APPT: Boosting Automated Patch Correctness Prediction via Fine-Tuning Pre-Trained Models,IEEE Transactions on Software Engineering

当前位置： X-MOL 学术 › IEEE Trans. Softw. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

APPT: Boosting Automated Patch Correctness Prediction via Fine-Tuning Pre-Trained Models
IEEE Transactions on Software Engineering ( IF 7.4 ) Pub Date : 2024-01-17 , DOI: 10.1109/tse.2024.3354969
Quanjun Zhang ₁ , Chunrong Fang ₁ , Weisong Sun ₁ , Yan Liu ₁ , Tieke He ₁ , Xiaodong Hao ₁ , Zhenyu Chen ₁

Affiliation

Automated program repair (APR) aims to fix software bugs automatically without human debugging efforts and plays a crucial role in software development and maintenance. Despite the recent significant progress in the number of fixed bugs, APR is still challenged by a long-standing overfitting problem (i.e., the generated patch is plausible but overfitting). Various techniques have thus been proposed to address the overfitting problem. Recently, researchers have employed BERT to extract code features, which are then used to train a classifier for patch correctness prediction, indicating the potential of such pre-trained models in reasoning about patch correctness. However, BERT is restricted to feature extraction for classifier training without benefiting from the training process, potentially generating sub-optimal vector representations for patched code snippets. In this paper, we propose APPT, a pre-trained model-based automated patch correctness assessment technique by both pre-training and fine-tuning. APPT adopts a pre-trained model as the encoder stack, followed by an LSTM stack and a deep learning classifier. More importantly, the pre-trained model is fine-tuned in conjunction with other components as a whole pipeline to fully adapt it specifically for reasoning about patch correctness. Although our idea is general and can be built on various existing pre-trained models, we have implemented APPT based on the BERT model. We conduct an extensive experiment on 1,183 Defects4J patches and the experimental results show that APPT achieves prediction accuracy of 79.7% and recall of 83.2%, outperforming the state-of-the-art technique CACHE by 4.3% and 6.7%. Our additional investigation on 49,694 real-world patches shows that APPT achieves the optimum performance (exceeding 99% in five common metrics for assessing patch classification techniques) compared with existing representation learning techniques. We further investigate the impact of each component and find that they all positively contribute to APPT, e.g., the fine-tuning process and the LSTM stack increase F1-score by 10.22% and 4.11%, respectively. We also prove that adopting advanced pre-trained models can further provide substantial advancement (e.g., GraphCodeBERT-based APPT improves BERT-based APPT by 2.8% and 3.3% in precision and AUC, respectively), highlighting the generalizability of APPT. Overall, our study highlights the promising future of fine-tuning pre-trained models to assess patch correctness and reduce the manual inspection effort of debugging experts when deploying APR tools in practice.

中文翻译：

APPT：通过微调预训练模型增强自动补丁正确性预测

自动程序修复（APR）旨在自动修复软件错误，无需人工调试，在软件开发和维护中发挥着至关重要的作用。尽管最近在修复的错误数量方面取得了显着进展，但 APR 仍然受到长期存在的过拟合问题的挑战（即生成的补丁看似合理，但过拟合）。因此，已经提出了各种技术来解决过度拟合问题。最近，研究人员使用 BERT 来提取代码特征，然后将其用于训练分类器以进行补丁正确性预测，这表明此类预训练模型在推理补丁正确性方面具有潜力。然而，BERT 仅限于分类器训练的特征提取，而无法从训练过程中受益，可能会为修补的代码片段生成次优的向量表示。在本文中，我们提出了 APPT，一种通过预训练和微调进行的基于预训练模型的自动补丁正确性评估技术。APPT 采用预训练模型作为编码器堆栈，然后是 LSTM 堆栈和深度学习分类器。更重要的是，预训练模型与其他组件作为一个整体管道一起进行微调，以完全适应它专门用于推理补丁正确性。虽然我们的想法很笼统，可以建立在各种现有的预训练模型上，但我们还是基于 BERT 模型实现了 APPT。我们对 1,183 个 Defects4J 补丁进行了广泛的实验，实验结果表明，APPT 的预测准确率达到 79.7%，召回率达到 83.2%，比最先进的 CACHE 技术高出 4.3% 和 6.7%。我们对 49,694 个真实世界补丁的进一步调查表明，与现有的表示学习技术相比，APPT 实现了最佳性能（在评估补丁分类技术的五个常见指标中超过 99%）。我们进一步研究了每个组件的影响，发现它们都对 APPT 做出了积极贡献，例如，微调过程和 LSTM 堆栈分别将 F1 分数提高了 10.22% 和 4.11%。我们还证明，采用先进的预训练模型可以进一步提供实质性的进步（例如，基于 GraphCodeBERT 的 APPT 在精度和 AUC 上分别将基于 BERT 的 APPT 提高了 2.8% 和 3.3%），突出了 APPT 的通用性。总体而言，我们的研究强调了微调预训练模型以评估补丁正确性并减少调试专家在实践中部署 APR 工具时的手动检查工作的光明前景。

更新日期：2024-01-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>