当前位置: X-MOL 学术IEEE Trans. Softw. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Federated Learning for Software Engineering: A Case Study of Code Clone Detection and Defect Prediction
IEEE Transactions on Software Engineering ( IF 7.4 ) Pub Date : 2024-01-03 , DOI: 10.1109/tse.2023.3347898
Yanming Yang 1 , Xing Hu 2 , Zhipeng Gao 3 , Jinfu Chen 4 , Chao Ni 2 , Xin Xia 5 , David Lo 6
Affiliation  

In various research domains, artificial intelligence (AI) has gained significant prominence, leading to the development of numerous learning-based models in research laboratories, which are evaluated using benchmark datasets. While the models proposed in previous studies may demonstrate satisfactory performance on benchmark datasets, translating academic findings into practical applications for industry practitioners presents challenges. This can entail either the direct adoption of trained academic models into industrial applications, leading to a performance decrease, or retraining models with industrial data, a task often hindered by insufficient data instances or skewed data distributions. Real-world industrial data is typically significantly more intricate than benchmark datasets, frequently exhibiting data-skewing issues, such as label distribution skews and quantity skews. Furthermore, accessing industrial data, particularly source code, can prove challenging for Software Engineering (SE) researchers due to privacy policies. This limitation hinders SE researchers’ ability to gain insights into industry developers’ concerns and subsequently enhance their proposed models. To bridge the divide between academic models and industrial applications, we introduce a federated learning (FL)-based framework called Almity . Our aim is to simplify the process of implementing research findings into practical use for both SE researchers and industry developers. Almity enhances model performance on sensitive skewed data distributions while ensuring data privacy and security. It introduces an innovative aggregation strategy that takes into account three key attributes: data scale, data balance, and minority class learnability. This strategy is employed to refine model parameters, thereby enhancing model performance on sensitive skewed datasets. In our evaluation, we employ two well-established SE tasks, i.e., code clone detection and defect prediction, as evaluation tasks. We compare the performance of Almity on both machine learning (ML) and deep learning (DL) models against two mainstream training methods, specifically the Centralized Training Method (CTM) and Vanilla Federated Learning (VFL), to validate the effectiveness and generalizability of Almity . Our experimental results demonstrate that our framework is not only feasible but also practical in real-world scenarios. Almity consistently enhances the performance of learning-based models, outperforming baseline training methods across all types of data distributions.

中文翻译:

软件工程联邦学习:代码克隆检测和缺陷预测案例研究

在各个研究领域,人工智能 (AI) 取得了显着的地位,导致研究实验室开发了大量基于学习的模型,并使用基准数据集进行评估。虽然之前的研究中提出的模型可能在基准数据集上表现出令人满意的性能,但将学术研究成果转化为行业从业者的实际应用却面临着挑战。这可能需要将经过训练的学术模型直接采用到工业应用中,从而导致性能下降,或者使用工业数据重新训练模型,而这项任务通常会因数据实例不足或数据分布不均而受到阻碍。现实世界的工业数据通常比基准数据集复杂得多,经常表现出数据偏差问题,例如标签分布偏差和数量偏差。此外,由于隐私政策的原因,访问工业数据(尤其是源代码)对于软件工程(SE)研究人员来说可能具有挑战性。这种限制阻碍了 SE 研究人员深入了解行业开发人员的担忧并随后改进他们提出的模型的能力。为了弥合学术模型和工业应用之间的鸿沟,我们引入了一种基于联邦学习(FL)的框架,称为阿尔米蒂。我们的目标是简化 SE 研究人员和行业开发人员将研究成果付诸实践的过程。Almity 增强了敏感倾斜数据分布上的模型性能,同时确保数据隐私和安全。它引入了一种创新的聚合策略,该策略考虑了三个关键属性:数据规模、数据平衡和少数类可学习性。该策略用于细化模型参数,从而增强敏感倾斜数据集上的模型性能。在我们的评估中,我们采用了两个成熟的 SE 任务,即代码克隆检测和缺陷预测作为评估任务。我们比较的性能Almity 针对两种主流训练方法(特别是集中训练方法(CTM)和普通联邦学习(VFL))对机器学习(ML)和深度学习(DL)模型进行了验证,以验证模型的有效性和泛化性阿尔米蒂。我们的实验结果表明,我们的框架不仅可行,而且在现实场景中也很实用。Almity 持续增强基于学习的模型的性能,在所有类型的数据分布上都优于基线训练方法。
更新日期:2024-01-03
down
wechat
bug