A random forest model for early-stage software effort estimation for the SEERA dataset,Information and Software Technology

当前位置： X-MOL 学术 › Inf. Softw. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A random forest model for early-stage software effort estimation for the SEERA dataset
Information and Software Technology ( IF 3.9 ) Pub Date : 2024-02-03 , DOI: 10.1016/j.infsof.2024.107413
Emtinan I. Mustafa , Rasha Osman

Publicly available software cost estimation datasets are outdated and may not represent current industrial environments. Thus most research has concentrated on the development and evaluation of estimation models with limited evidence of their applicability to industrial practice. Moreover, these datasets and models may not be applicable in (under-represented) technically and economically constrained environments such as the software development environment in Sudan. This paper aims to develop a machine learning model that is suitable for the Sudanese software industry. To demonstrate the suitability of our approach, we evaluate our model using the publicly available SEERA (oftware engining in Sudn) dataset, which is a software cost estimation dataset from organizations in Sudan. We demonstrated the suitability of the SEERA dataset for effort estimation by comparing the attributes that had a high correlation with and to the cost factors identified by (Sudanese) experts. In addition, we developed an early-stage Random Forest model to estimate project effort and duration from the SEERA dataset. Early-stage estimation is in-line with current Sudanese industrial practice. We investigated the impact of oversampling, feature selection, heterogeneity and local environmental factors on model accuracy. Our experimental results showed that the Random Forest model with oversampling and feature selection provided accurate estimates that were better than random guessing (standardized accuracy > 70 %). Our results were similar to accuracies reported in the literature. In addition, we demonstrated that our random forest model provided estimations that were more accurate than (Sudanese) expert judgement. This study has demonstrated the feasibility of our random forest model for early-stage effort and duration estimation for Sudanese software projects. The results demonstrate the importance of representative models and datasets for non-traditional technical environments. Further research is required to investigate the impact of local environmental factors on software cost estimation.

中文翻译：

用于 SEERA 数据集早期软件工作量估计的随机森林模型

公开的软件成本估算数据集已经过时，可能无法代表当前的工业环境。因此，大多数研究都集中在估计模型的开发和评估上，而其在工业实践中的适用性证据有限。此外，这些数据集和模型可能不适用于（代表性不足）技术和经济受限的环境，例如苏丹的软件开发环境。本文旨在开发一种适合苏丹软件行业的机器学习模型。为了证明我们方法的适用性，我们使用公开可用的 SEERA（Sudn 软件工程）数据集评估我们的模型，该数据集是来自苏丹组织的软件成本估算数据集。我们通过比较与（苏丹）专家确定的成本因素高度相关的属性，证明了 SEERA 数据集对于工作量估计的适用性。此外，我们开发了一个早期随机森林模型，用于根据 SEERA 数据集估算项目工作量和持续时间。早期估算符合苏丹当前的工业实践。我们研究了过采样、特征选择、异质性和局部环境因素对模型精度的影响。我们的实验结果表明，具有过采样和特征选择的随机森林模型提供了比随机猜测更好的准确估计（标准化精度> 70％）。我们的结果与文献中报告的准确性相似。此外，我们证明我们的随机森林模型提供的估计比（苏丹）专家判断更准确。这项研究证明了我们的随机森林模型用于苏丹软件项目早期工作和持续时间估计的可行性。结果证明了代表性模型和数据集对于非传统技术环境的重要性。需要进一步的研究来调查当地环境因素对软件成本估算的影响。

更新日期：2024-02-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>