CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection,Empirical Software Engineering

当前位置： X-MOL 学术 › Empir. Software Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection
Empirical Software Engineering ( IF 4.1 ) Pub Date : 2024-04-08 , DOI: 10.1007/s10664-024-10445-9
Amal Alazba , Hamoud Aljamaan , Mohammad Alshayeb

Context

Code smell detection is the process of identifying poorly designed and implemented code pieces. Machine learning-based approaches require enormous amounts of manually labeled data, which are costly and difficult to scale. Unsupervised semantic feature learning, or learning without manual annotation, is vital for effectively harvesting an enormous amount of available data.

Objective

The objective of this study is to propose a new code smell detection approach that utilizes self-supervised learning to learn intermediate representations without the need for labels and then fine-tune these representations on multiple tasks.

Method

We propose a Code Representation with Transformers (CoRT) to learn the semantic and structural features of the source code by training transformers to recognize masked reserved words that are applied to the code given as input. We empirically demonstrated that the defined proxy task provides a powerful method for learning semantic and structural features. We exhaustively evaluated our approach on four downstream tasks: detection of the Data Class, God Class, Feature Envy, and Long Method code smells. Moreover, we compare our results with those of two paradigms: supervised learning and a feature-based approach. Finally, we conducted a cross-project experiment to evaluate the generalizability of our method to unseen labeled data.

Results

The results indicate that the proposed method has a high detection performance for code smells. For instance, the detection performance of CoRT on Data Class achieved a score of F1 between 88.08–99.4, Area Under Curve (AUC) between 89.62–99.88, and Matthews Correlation Coefficient (MCC) between 75.28–98.8, while God Class achieved a value of F1 ranges from 86.32–99.03, AUC of 92.1–99.85, and MCC of 76.15–98.09. Compared with the baseline model and feature-based approach, CoRT achieved better detection performance and had a high capability to detect code smells in unseen datasets.

Conclusions

The proposed method has been shown to be effective in detecting class-level, and method-level code smells.

中文翻译：

CoRT：基于 Transformer 的代码表示，通过预测代码气味检测的保留字进行自我监督

语境

代码气味检测是识别设计和实施不当的代码片段的过程。基于机器学习的方法需要大量的手动标记数据，这些数据成本高昂且难以扩展。无监督语义特征学习或无需手动注释的学习对于有效收集大量可用数据至关重要。

客观的

本研究的目的是提出一种新的代码气味检测方法，该方法利用自监督学习来学习中间表示，而无需标签，然后在多个任务上微调这些表示。

方法

我们提出了一种使用 Transformers 的代码表示（CoRT），通过训练 Transformer 识别应用于作为输入给出的代码的屏蔽保留字来学习源代码的语义和结构特征。我们凭经验证明，定义的代理任务为学习语义和结构特征提供了一种强大的方法。我们详尽地评估了我们在四个下游任务上的方法：数据类、上帝类、特征嫉妒和长方法代码气味的检测。此外，我们将我们的结果与两种范式进行比较：监督学习和基于特征的方法。最后，我们进行了跨项目实验，以评估我们的方法对未见过的标记数据的通用性。

结果

结果表明，该方法对代码异味具有较高的检测性能。例如，CoRT 在 Data Class 上的检测性能达到了 88.08-99.4 之间的 F1 分数、89.62-99.88 之间的曲线下面积 (AUC) 和 75.28-98.8 之间的马修斯相关系数 (MCC)，而 God Class 则达到了F1 的范围为 86.32–99.03，AUC 为 92.1–99.85，MCC 为 76.15–98.09。与基线模型和基于特征的方法相比，CoRT 实现了更好的检测性能，并且具有检测未见过的数据集中的代码异味的强大能力。