Assessing Effectiveness of Test Suites: What Do We Know and What Should We Do?,ACM Transactions on Software Engineering and Methodology

当前位置： X-MOL 学术 › ACM Trans. Softw. Eng. Methodol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Assessing Effectiveness of Test Suites: What Do We Know and What Should We Do?
ACM Transactions on Software Engineering and Methodology ( IF 4.4 ) Pub Date : 2024-04-17 , DOI: 10.1145/3635713
Peng Zhang ₁ , Yang Wang ₂ , Xutong Liu ₂ , Zeyu Lu ₂ , Yibiao Yang ₂ , Yanhui Li ₂ , Lin Chen ₂ , Ziyuan Wang ₃ , Chang-ai Sun ₄ , Xiao Yu ₅ , Yuming Zhou ₂

Affiliation

Background. Software testing is a critical activity for ensuring the quality and reliability of software systems. To evaluate the effectiveness of different test suites, researchers have developed a variety of metrics. Problem. However, comparing these metrics is challenging due to the lack of a standardized evaluation framework including comprehensive factors. As a result, researchers often focus on single factors (e.g., size), which finally leads to different or even contradictory conclusions. After comparing dozens of pieces of work in detail, we have found two main problems most troubling to our community: (1) researchers tend to oversimplify the description of the ground truth they use, and (2) data involving real defects is not suitable for analysis using traditional statistical indicators. Objective. We aim at scrutinizing the whole process of comparing test suites for our community. Method. To hit this aim, we propose a framework ASSENT (evAluating teSt Suite EffectiveNess meTrics) to guide the follow-up research for evaluating a test suite effectiveness metric. ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. Its functioning is as follows: first, users clarify the ground truth for determining the real order in effectiveness among test suites. Second, users generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, users use the metric to derive the order in effectiveness for the same test suites. Finally, users calculate the agreement indicator between the two orders derived by two metrics. Result. With ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score and code coverage metrics. Our results show that, based on the real faults, mutation score, and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, test effectiveness will be overestimated by more than 20% in values. Conclusion. We recommend that the standardized evaluation framework ASSENT should be used for evaluating and comparing test effectiveness metrics in the future work.

中文翻译：

评估测试套件的有效性：我们知道什么以及我们应该做什么？

背景。软件测试是确保软件系统质量和可靠性的关键活动。为了评估不同测试套件的有效性，研究人员开发了多种指标。问题。然而，由于缺乏包含综合因素的标准化评估框架，比较这些指标具有挑战性。因此，研究人员往往关注单一因素（例如规模），最终导致不同甚至矛盾的结论。在详细比较了数十项工作后，我们发现了最困扰我们社区的两个主要问题：（1）研究人员倾向于过度简化他们使用的基本事实的描述，（2）涉及真实缺陷的数据不适合使用传统统计指标进行分析。客观的。我们的目标是仔细审查社区测试套件比较的整个过程。方法。为了实现这一目标，我们提出了一个框架 ASSENT (evA卢廷特StS尤特乙有效的氮哎呀我时间rics）来指导评估测试套件有效性指标的后续研究。 ASSENT 由三个基本组成部分组成：地面事实、基准测试套件和协议指标。其功能如下：首先，用户澄清基本事实，以确定测试套件之间有效性的真实顺序。其次，用户生成一组基准测试套件并得出其有效性的基本事实顺序。第三，用户使用该指标来得出相同测试套件的有效性顺序。最后，用户计算由两个指标得出的两个订单之间的一致性指标。结果。借助 ASSENT，我们能够比较不同测试套件有效性指标的准确性。我们应用 ASSENT 来评估代表性测试套件的有效性指标，包括突变分数和代码覆盖率指标。我们的结果表明，基于真实的故障，突变分数和包含突变分数是量化测试套件有效性的最佳指标。同时，通过使用突变体而不是真实的故障，测试有效性的值将被高估20%以上。结论。我们建议在未来的工作中使用标准化评估框架ASSENT来评估和比较测试有效性指标。

更新日期：2024-04-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>