Machine Learning for Accurate Software Development Cost Estimation in Economically and Technically Limited Environments

Cost estimation for software development is crucial for project planning and management. Several regression models have been developed to predict software development costs, using historical datasets of previous projects. Accurate cost estimation in software development is heavily influenced by the relevance and quality of the cost estimation dataset and its suitability to the software development environment. The currently available cost estimation datasets are limited to North American and European environments, leaving a gap in the representation of other economically and technically constrained software industries. In this article, the authors evaluate the performance of regression models using the SEERA dataset, which highly represents these constrained environments. This study provides insights into selecting regression models for cost estimation in software development. It highlights the importance of using appropriate models based on the specific software development model and dataset used in the estimation process. In the performance evaluations of eight regression models, including elastic net, lasso regression, linear regression, neural network, RANSACRegressor, random forest, ride regression, and SVM, for cost estimation in different software models, along with correlation coefficients and accuracy indicators, were reported. The results showed that SVM and random forest indicated superior performance. However, the elastic net, lasso regression, linear regression, neural network, and RANSACRegressor models also demonstrated exemplary performance in cost estimation.


INTRoDUCTIoN
Cost estimation is a critical aspect of software development (Rankovic, Rankovic, Ivanovic, & Lazic, 2021;Rankovic, Rankovic, Ivanovic, & Lazic, 2021;Mukherjee & Malu, 2014), as it helps in predicting the resources required for the project and ensuring that the project is completed within budget and on time (Pandey et al., 2020).However, estimating the cost and effort for different software development models can be challenging due to their unique characteristics and requirements (Boehm, 2017;Kumar et al., 2020).
Several essential features must be considered when estimating the cost and effort for different software development models.One of the most crucial factors is the size of the project, which refers to the number of software components or functions that need to be developed.The larger the project, the more effort and resources it will require, ultimately impacting the cost estimation (Saavedra Martínez et al., 2020;Mahmood et al., 2021).The project's complexity is another critical feature affecting cost and effort estimation.The complexity of the software model can vary based on various factors, such as the number of interrelated components, the number of decision points, and the level of customization required.Developing more complex software models will require more effort and resources, resulting in higher costs (Mahmood et al., 2021).
The development team's expertise is another vital factor when estimating the cost and effort for different software models.The level of experience, knowledge, and skills of the team will significantly impact the development time and cost.A team with more experience and knowledge can develop a project more efficiently, resulting in lower costs (Nassif et al., 2019).
The development process also plays a crucial role in cost and effort estimation.The development process can be iterative or sequential, and each approach has advantages and disadvantages.The sequential approach, also known as the Waterfall model, is more structured, which can help ensure that each development phase is completed before moving on to the next.In contrast, the iterative approach, the Agile model, is more flexible and adaptable, allowing for changes throughout the development process.
Finally, the software development environment also affects cost and effort estimation.The environment can include hardware and software tools, such as integrated development environments, version control systems, and testing tools necessary to complete the project.The cost and availability of these tools and resources will impact the cost estimation for the project.As a result, several important features need to be considered when estimating the cost and effort for different software models.These include the project's size and complexity, the development team's expertise, the development process, and the software development environment.An accurate model for estimating the cost and effort will ensure the project is completed within budget and on time, providing significant benefits to the development team and the organization.
We observed first-hand the challenges that local software development teams faced due to limited resources and infrastructure constraints.Accurately estimating costs was critical for project planning and management under these conditions.However, existing cost estimation techniques and datasets did not adequately account for the realities of working in such constrained environments.We were motivated to address this research gap and help improve cost estimation practices for software teams operating under similar limitations.
Machine learning plays an essential role in determining the critical features that affect the cost and effort estimation of different software models (Safari & Erfani, 2020;Holtkamp et al., 2015;Casado-Lumbreras et al., 2014).With the help of machine learning algorithms, large and complex datasets can be analyzed to identify patterns and relationships between cost and effort variables and the factors that affect them.By utilizing machine learning techniques, such as regression analysis, decision trees, and neural networks, it is possible to identify the most significant variables that impact software development cost and effort.
The ability of machine learning algorithms to identify essential features can lead to more accurate and reliable cost and effort estimates for different software development models.Machine learning algorithms can produce more accurate models considering the complex nature of software development projects by considering a wide range of features, including technical, organizational, and cultural factors.Furthermore, machine learning techniques can continually update and refine cost and effort estimation models as new data becomes available, ensuring that the models remain accurate and current.
The key gaps we identified in previous research were the lack of focus on non-Western software development contexts and the inadequacy of existing cost estimation datasets in representing technically and economically constrained industries.Most available datasets represented North American and European environments, leaving significant room to improve the external validity and generalizability of cost estimation models for other global contexts.
Overall, using machine learning algorithms to determine the critical features that affect the cost and effort estimation of different software models is essential in providing accurate and reliable estimates.With the ever-increasing complexity of software development projects, identifying and considering a wide range of factors is crucial in ensuring that projects are delivered on time and within budget.Machine learning provides a powerful tool to help achieve these goals, enabling more accurate and reliable cost and effort estimation for software development projects of all types and sizes.The rest of this paper is organized as follows: In Section 2, we provide the related works.Section 3 presents the methodology and data used in this study.In Section 4, we present the results and discussion.Finally, Section 5 provides conclusions and suggestions for future work.

RELATED woRKS
Cost estimation for software development varies, depending on the development model and the application domains.Each application domain has its unique characteristics, requirements, and constraints, which may affect the cost of development (Ilyas et al., 2020;Akbar et al., 2019).The cost estimation model for an application domain must consider factors, such as the system's complexity, size, functionality, and technologies.For instance, a software system for a financial application may require complex algorithms for data processing, security features, and integration with multiple databases.In contrast, a software system for a simple game may require less complex functionalities.The choice of the development model also affects the cost estimation.For instance, the waterfall model may suit application domains requiring a well-defined and predictable process.
In contrast, the agile model may suit application domains requiring frequent changes and iterations.In conclusion, the cost estimation model for software development must consider both the application domain and the development model to produce accurate and reliable cost estimation (Ali & Gravino, 2021).
There are various approaches for cost estimation in software development, each with its strengths and weaknesses.One of the most common approaches is the algorithmic model, which uses mathematical formulas to estimate the cost of software development based on project size and complexity.This approach works well for well-defined projects, with general requirements and specifications.However, it may not be suitable for more complex or innovative projects, without fully defined requirements.Another approach is the expert judgment model, which relies on the expertise of experienced professionals to estimate the cost of software development.This approach can be practical for complex and innovative projects, allowing for greater flexibility and adaptability in the estimation process.However, it is highly subjective and may be affected by biases or limitations in the knowledge and experience of the experts involved.Machine learning algorithms have also been used for cost estimation in software development, with promising results (Kumar et al., 2021;Zhao & Zhang, 2020;Panda & Majhi, 2020;Promise Software Engineering Repository, n.d.).These algorithms can analyze large amounts of data and identify patterns and relationships that are not easily discernible through other approaches.They can also adapt to changing project conditions and provide more accurate and reliable estimates.Table 2 compares different machine learning algorithms used for cost estimation in software development.The most effective approach for cost estimation in software development will depend on the specific characteristics of the project and the available resources and expertise.A combination of different approaches, including algorithmic models, expert judgment, and machine learning, may be needed to achieve the most accurate and reliable estimates (Chhabra & Singh, 2020;Mohammed & Jamal, 2022).
Comparing mathematical formulas and machine learning regression for cost estimation in different software models is a complex task that requires a thorough analysis of the strengths and limitations of each approach (Jha & Jha, 2020;Emtinan, 2020).Here are some potential points of Comparison:

WATERFALL MODEL
A linear sequential approach where each phase must be completed before moving on to the next phase.
Clear and structured development process.
Not suitable for projects with changing requirements.

Waterfall Model
A linear sequential approach where each phase must be completed before moving on to the next phase.
AGILE MODEL An iterative and incremental approach with a focus on adaptability and customer satisfaction.
Flexibility to adapt to changing requirements.
Collaboration between developers and customers.
Requires active customer involvement and can be challenging to manage for large projects.
Agile Model An iterative and incremental approach with a focus on adaptability and customer satisfaction.

SPIRAL MODEL
A risk-driven model that involves continuous feedback loops and risk analysis.
Allows for better risk management and more accurate cost and schedule estimation.
Complex and can be time-consuming.
Spiral Model A risk-driven model that involves continuous feedback loops and risk analysis.
V-MODEL A sequential approach with related testing activities for each development stage.
Ensures that all requirements are met and that testing is integrated throughout development.
It can be inflexible and may not allow for changes in requirements.
V-Model A sequential approach with related testing activities for each development stage.

INCREMENTAL MODEL
A model where software is developed in smaller increments or modules.
Allows for faster delivery and testing of working software.
It can be challenging to manage larger projects and may require more resources.

Incremental Model
A model where software is developed in smaller increments or modules.Accuracy: Machine learning regression algorithms have the potential to offer more accurate cost estimation than mathematical formulas.It can consider more variables and nonlinear relationships between variables.However, the accuracy of machine learning models depends on the quality and diversity of the training data, as well as the choice of algorithm and hyperparameters.Mathematical formulas can be accurate for simple software models, but may not be able to capture the complexity of more advanced models.
Interpretability: Mathematical formulas are often more interpretable than machine learning models.It provides explicit equations that can be used to calculate cost estimates based on input variables.Machine learning models, on the other hand, are often seen as black boxes because it can be challenging to understand how they arrive at their predictions.However, some machine learning models, such as decision trees, can provide interpretable rules that can be used to explain their predictions.
Flexibility: It can adapt to new data and software models without requiring manual adjustments.In contrast, mathematical formulas can become obsolete if the underlying software model changes or new variables are necessary for cost estimation.
Data requirements: Machine learning regression algorithms generally require more data than mathematical formulas to achieve reasonable accuracy.It needs to learn patterns in the data, which requires a large and diverse dataset.Mathematical formulas, on the other hand, can often be derived from a smaller dataset or expert knowledge.
Development and maintenance: Developing and maintaining a machine learning regression model can be more time-consuming and resource-intensive than creating a mathematical formula.It requires machine learning algorithms, data preprocessing, feature selection, and hyperparameter tuning expertise.In contrast, mathematical formulas can be developed and maintained by domain experts, without requiring specialized knowledge in machine learning.
Different mathematical formulas can be used for cost estimation in different software models.Here are a few examples: COCOMO (Constructive Cost Model): This is a popular model for estimating software project cost, which uses a set of equations based on project size, complexity, and development environment.The COCOMO model has three versions: Basic, Intermediate, and Advanced.
Function Points Analysis: This model uses the number of function points in a software system to estimate development effort and cost.Function points are a measure of the functionality provided by the system and can be calculated based on the number of inputs, outputs, inquiries, files, and interfaces in the system.
Putnam Model: This model uses a set of equations based on project size, development team experience, and development environment to estimate project cost and schedule.The model also considers the number of software defects likely to be found during development and testing.
PERT (Program Evaluation and Review Technique): This model uses a probabilistic approach to estimate project costs and schedules.PERT uses three different estimates for each activity in the project: optimistic, most likely, and pessimistic.These estimates are then used to calculate a weighted average for each activity, which is used to estimate the overall project cost and schedule.
Function Point Analysis Mark II: This original model extension includes additional factors, such as data communication, distributed data processing, and transaction rates.
These formulas are often used with other techniques, such as expert judgment and historical data analysis, and may not always provide accurate estimates.Machine learning regression models can also supplement or replace these formulas in some cases, mainly when dealing with complex and diverse datasets.Table 3 compares the traditional and machine learning approaches.
The General Framework for software cost estimation prediction, using machine learning regression and preprocessing, can be applied to SEERA, a software cost estimation dataset for constrained environments.SEERA is a publicly available dataset containing data on software development projects subject to time and cost constraints (Panda & Majhi, 2020).The dataset can be used to develop and evaluate machine learning models for predicting software development costs in constrained environments.The first step in preprocessing the SEERA dataset is to clean the data and remove any irrelevant or redundant data.Missing values are then filled, and any necessary feature engineering is applied.Feature engineering in the SEERA dataset includes using expert judgment, process metrics, and other software metrics to create new features that can provide valuable insights into the software development process.The SEERA dataset (Al Asheeri & Hammad, 2019), a software cost estimation dataset for constrained environments, can be segmented based on Waterfall, Agile, and Hybrid software development models.The Waterfall model's development process is sequential, and each project phase must be completed before moving on to the next one.The SEERA dataset can be segmented using the Waterfall model.The data can be used to train machine learning models to accurately predict the costs of software development projects that follow the Waterfall model.In the Agile model, the development process is iterative and incremental, and the SEERA dataset can be segmented based on the use of the Agile model.Machine learning models can be trained on the data to accurately predict the costs of software development projects that follow the agile model.A combination of the Waterfall and Agile models is used in the Hybrid model, and the SEERA dataset can be segmented based on the Hybrid model.The data can be used to train machine learning models to accurately predict the costs of software development projects that follow the Hybrid model.
Segmenting the SEERA dataset based on software development models allows for more accurate predictions of software development costs for specific software development projects.Machine learning models can be trained on the segmented data to accurately predict the costs of software development projects that follow a particular development model, helping organizations to make informed decisions regarding software development budgets and timelines.
After feature engineering, the data is normalized to ensure that features are on the same scale and do not bias the model's performance.Normalization helps ensure that features with large ranges do not overshadow smaller ranges.
The final step in preprocessing the SEERA dataset segmented based on S.W. development models is splitting the data into training and testing sets.The training set is used to train the machine learning model, and the testing set is used to evaluate its performance.Fig . 1 shows the SW-cost estimation prediction framework, taking care of the data preprocessing phase.The framework helps organizations make informed decisions regarding software development budgets and timelines in constrained environments.More adaptable to new data and software models without manual adjustments.

DATA REQUIREMENTS
It can often be derived from smaller datasets or expert knowledge.
Require large and diverse datasets to learn patterns and achieve good accuracy.

DEVELOPMENT AND MAINTENANCE
Can be developed and maintained by domain experts without specialized knowledge in machine learning.
More time-consuming and resource-intensive due to the expertise required in machine learning algorithms, data preprocessing, feature selection, and hyperparameter tuning.

Experimental Data
This paper uses the SEERA (Software enginEERing in SudAn) cost estimation dataset, a collection of 120 software development projects from 42 organizations in Sudan.Unlike current cost estimation datasets, the SEERA dataset contains 76 attributes and is augmented with metadata and raw data.
The SEERA data used in this study is thoroughly described in (Al Asheeri & Hammad, 2019).The paper discusses the data collection process, submitting organizations, and project characteristics.
The dataset was specifically collected to better represent software development projects facing cost and time constraints common in developing country contexts, like Sudan.In evaluating its suitability, we considered characteristics, like the sample size, number, and relevance of project features collected, representation of local industry factors, and documentation regarding data collection procedures.Compared to other publicly available datasets, SEERA captured the unique challenges and trade-offs faced by software teams operating under constrained budgets and deadlines.
The authors in (Al Asheeri & Hammad, 2019) also provide a general analysis of the dataset projects, illustrating the impact of local factors on software project cost and comparing the data quality of the SEERA dataset to public datasets from the PROMISE repository.
The SEERA dataset significantly contributes to the diversity of cost estimation datasets, filling the gap in constrained environment representation.Researchers can use the SEERA dataset to develop new cost estimation techniques that are more suitable for these environments and to evaluate the For the regression models, we chose algorithms commonly used in previous software cost estimation studies, including Linear Regression, Decision Trees, Support Vector Machines, Neural Networks, and ensemble methods.This allowed for comparing SEERA-based results to prior work, while exploring a variety of model types with different strengths.
Table 4 provides information about the methodologies used in a sample of 120 software development projects.The most common methodology used was the Hybrid methodology, employed in 42 projects, representing 35% of the total.The Waterfall model was the second most used methodology, with 41 projects, making up 34% of the total.The Agile model was used in 27 projects, representing 23% of the total.The remaining methodologies, Prototyping, no methodology, and others, were used in 5%, 3%, and 1% of the projects, respectively.The Table's final column, Total, indicates that 120 software development projects were included in the sample.The information in this Table can be used to analyze the characteristics of different software development methodologies and assist in making informed decisions regarding software development budgets and timelines based on the results of the cost estimation predictions.
The type of application domain influences the choice of software development model.For example, the Waterfall model is often used in safety-critical domains, such as aerospace and medical devices, because it provides a structured and predictable approach to development.The requirements are typically well-defined and unchanging.On the other hand, in domains such as e-commerce and web development, the Agile model is more commonly used because it is more flexible and allows for iterative development to quickly adapt to changing market demands.Similarly, in game development, the iterative and collaborative nature of the Agile model is preferred due to the constant need for feedback and adaptation to meet player expectations.The choice of development model also impacts the cost estimation process, with the Agile model typically requiring more frequent updates to the cost estimates due to the iterative and evolving nature of development.Therefore, the application domain and the associated software development model must be considered when developing cost estimation models to ensure they are appropriate for the specific development environment-Table 5 shows the project development type and application domain in the SEERA dataset.

CE-SwM-PRED: Approach
This section presents the CE-SWM-PRED, a machine learning-based approach for predicting cost estimation under different software development approaches in a constrained environment.The CE-SWM-PRED utilizes eight different machine learning algorithms, including Artificial Neural Networks (ANN), Support Vector Machines (SVM), Random Forests (R.F.), Generalized Linear 1. Read the dataset: The experimental data described in the previous section is read into the system.2. Segment the SEERA dataset into three sub-datasets, each representing one of the software development approaches (Waterfall, Agile, Hybrid).3. Preprocessing stage: The data is normalized to ensure that the values of the different variables are on the same scale.4. Data split: The data is divided into a training set (70%) and a testing set (30%) to evaluate the performance of the models.5. Model training: The eight machine learning algorithms are applied to the training data to build models.6. Model testing: The trained models are tested using the testing data.7. Evaluation metrics: Each model's performance is evaluated using metrics, such as mean absolute error, root mean squared error, and R-squared.8. Repeat the process: Steps 3 to 6 are repeated 30 times to ensure the results are robust and reliable.Fig . 2 depicts the procedure of the proposed approach, CE-SWM-PRED.The CE-SWM-PRED is designed to provide an accurate prediction of the cost estimation based on the input features.

Applied Regression Algorithms
SVM: Support Vector Machines (SVMs) are a machine learning algorithm for classification and regression.They were developed by Vapnik in 1995 and are based on statistical learning theory.The main idea behind SVM classification is to find the linear classifier that separates the classes with the most significant margin, meaning the enormous gap between the two classes, as defined by the distances from the closest points in each class to the hyperplane.In cases where a single hyperplane cannot separate the two classes, SVM will try to find the hyperplane that balances the margin and the number of misclassifications.It chooses a positive constant that trades off between the margin and the misclassification error, ensuring the optimal balance between the two.SVMs can also be used for nonlinear decision surfaces by mapping the original variables into a higher-dimensional feature space, and then, defining a linear classification problem.It allows SVMs to handle complex relationships between variables and targets, providing a flexible and effective solution for many real-world problems.
Ridge Regression: The Ridge Regression (R.R.) algorithm is a linear regression model that seeks to minimize the residual sum of squares (RSS) between the observed and predicted response values, while at the same time avoiding overfitting the data by adding a penalty term to the coefficients.The penalty term, alpha (α), is a hyperparameter that controls the strength of the regularization.Ridge Regression is a regularized linear regression method that addresses multicollinearity among predictors.In Ridge Regression, a penalty term is added to the least squares cost function to shrink the coefficients of the predictor variables toward zero.The objective function of Ridge Regression can be formulated as follows: Where β is the vector of coefficients, X is the design matrix, y is the target vector, and α is the regularization parameter, which controls the magnitude of the penalty term.The R.R. algorithm is a powerful tool for linear regression that allows us to account for multicollinearity and overfitting, while still obtaining meaningful predictions.
Random Forest: Random Forest (R.F.) is an ensemble machine learning algorithm widely used for regression and classification problems.It is an improvement over decision trees, which are prone to overfitting, by aggregating the results of many individual trees to make a prediction.R.F. works by constructing multiple decision trees using bootstrapped samples of the data and a random subset of the features at each split.The trees are constructed independently, and their predictions are combined by taking the mean or mode, depending on the problem being solved.This process makes a prediction less susceptible to overfitting than a single decision tree.The two main parameters in R.F. are the number of trees in the forest and the number of variables considered at each split in the tree.The number of trees in the forest can be increased to reduce variance, but this can increase the computational time and memory requirements.The number of variables considered at each split can be increased to reduce bias, which can also lead to overfitting the data.It is essential to tune these parameters through cross-validation and testing to find the optimal balance between variance and bias for a given data set.
RANSACREgressor: RANSAC, or the RANdom SAmple Consensus algorithm, is a robust regression method that fits a model to data containing outliers.The method works by randomly selecting a subset of the data and using it to fit the model.The model is then evaluated based on the number of points (the support size) with residual errors lower than a threshold value.The goal is to maximize the support size and minimize the outliers' effect on the model.The main parameters in the RANSACRegressor are max_trials, which determines the maximum number of iterations RANSAC will perform before it terminates and is used to control the computational cost of the algorithm, and residual_threshold, which determines the maximum residual error that a data point can have and still be considered an inlier.Points with residual errors more significant than the threshold are considered outliers and are not included in the model fit.The residual error is the difference between a given data point's predicted and actual values.
Artificial Neural Network: Artificial Neural Networks (ANNs) are a type of machine learning model inspired by the human brain's structure and function.In the context of regression, ANNs are used to model the relationship between a set of inputs and a continuous output variable.The basic building block of an ANN is the artificial neuron, which is a mathematical function that receives input from other neurons, processes it, and outputs a value.Multiple neurons are connected to form a network that can learn complex relationships between inputs and outputs.The training process of an ANN involves adjusting the weights of the connections between neurons so that the output of the network matches the target values, as closely as possible, using an optimization algorithm, such as gradient descent.The main parameters that can be adjusted in an ANN include the number of hidden layers, the number of neurons in each hidden layer, the activation function used by the neurons, the learning rate, and the optimization algorithm used to adjust the weights.The choice of these parameters can significantly impact the network's performance, so it is essential to experiment with different combinations to find the best values for a specific dataset.
Lasso Regression: Lasso Regression is a statistical technique used to perform linear regression.It works by utilizing the concept of shrinkage, where data values move closer to a central point, such as the mean.The Lasso Regression is particularly well-suited for models that contain a significant amount of multicollinearity, as it can help to eliminate noise and select only the most relevant variables.To achieve this, Lasso Regression adds limitations to the standard least squares (L.S.) technique.The Lasso method reduces the solution of L.S. to zero, or close to zero, for the coefficients of variables that are not as significant.As a result, it enables Lasso to act as a variable selection technique, resulting in a more straightforward and interpretable model.However, it is essential to note that the estimation obtained through Lasso Regression is subject to some bias.Lasso Regression has been used in previous studies to analyze data with many variables that explain the data.The regression coefficients produced by Lasso Regression are better equipped to pick explanatory factors than those produced by traditional regression methods.Additionally, Lasso Regression can help address multicollinearity issues in regression analysis.When applied to data that contains grouped variables, the original form of Lasso Regression may not achieve the desired level of precision.In response to this limitation, researchers Yuan and Lin developed a novel strategy called Group Lasso, which improves the precision of Lasso Regression when applied to grouped variables.
Elastic Net: Elastic Net Regression, or ELNET, is a statistical technique that combines the regularization methods of Ridge Regression and Lasso Regression.This combination of methods allows ELNET to handle situations with high multicollinearity between the predictor variables, which leads to problems in traditional regression models, as the predictors can be highly correlated and impact the accuracy of the predictions.Hoerl and Kennard first introduced ELNET in 1970.Since then, it has been widely used to regularize data and select the most significant predictor variables to simplify the model and improve the accuracy of predictions.ELNET can eliminate or choose highly correlated predictor variables, leading to more accurate predictions.In addition, ELNET can overcome the limitations of time series variables used in regression analysis.When the behavior of time series variables is nonstationary and nonlinear, or when there is a problem with multicollinearity, ELNET can provide a solution.ELNET involves the decomposition of the original multivariate time-series predictors and the examination of their impact on the response variable, which helps to combat the correlation between the predictors.ELNET can eliminate or choose predictor variables that have a high level of correlation in the final model, which improves the accuracy of the predictions (Liu and Li, 2017).
The optimal selection of hyperparameters and variables is crucial in improving the prediction models' performance and ensuring the results' accuracy.Further experimentation and evaluation may lead to better results and an improved understanding of the relationships between the variables and the compressive strength of concrete.In the SVM model, the radial basis function (RBF) was used as the kernel function, costing 1000000 and an epsilon of 0.001.The Random Forest model used 1000 decision trees.In the Neural Network model, two hidden layers were used, each with 100 nodes and the maximum iteration was set to 1000.The Elastic Net model used an alpha value of 0.1, an l1 ratio of 0.9, and random selection.

Evaluation Metrics
We used various famous regression metrics to evaluate the performance of these algorithms (Brown, 2018).CE-SWM-PRED uses MAE, MSE, RMSE, and R2 to evaluate the quality of the different created models and compare their performance.MAE (Mean Absolute Error) measures the average magnitude of the errors in a set of predictions, without considering their direction.It is calculated as the average of the absolute differences between prediction and actual observation over the test sample.On the other hand, MSE (Mean Squared Error) is the average of the squared differences between the predicted and actual values, giving more weight to significant errors and making it more sensitive to outliers.RMSE (Root Mean Squared Error), which is the square root of the MSE, provides a measure of the absolute fit of the model to the data and is interpreted as the standard deviation of the unexplained variance.Finally, R2 (Coefficient of Determination) is a statistic that provides the proportion of the variation in the response variable, explained by the model's predictor variables.It ranges between 0 and 1, with a higher value indicating a better fit.These metrics provide valuable information about the machine learning algorithms' performance and help determine which algorithm best suits a given problem (Kaur, 2020).
MSE (Mean Squared Error)-The mean squared error is a standard measure of how well a model fits a dataset.It is the average of the squared differences between the predicted and actual values.MSE measures the average variance of the errors or the average squared deviation of the predictions from the actual values.From Table 6, which represents the models based on the data set of the Agile software development approach, it can be observed that the Random Forest (R.F.) algorithm had the highest maximum MAE and MSE values.In contrast, the Artificial Neural Network (ANN) had the lowest minimum MAE and MSE values.Similarly, the R.F. algorithm had the highest maximum RMSE value, while the Support Vector Machine (SVM) had the lowest minimum RMSE value.The Average MAE, MSE, and RMSE values were the lowest for the ANN algorithm.
Regarding the coefficient of determination (R2), the ANN and SVM algorithms had the highest maximum R2 values.In contrast, the Ridge Regression (R.R.) and Randomized Lasso Regression Based on the evaluation results presented in Table 6 and shown in Fig .3, the ANN and SVM algorithms are the most promising learning algorithms for Agile software development regarding their overall performance across different measures.
Table 7 presents the results of different learning algorithms used for a regression problem, evaluated using a waterfall dataset based on different performance metrics.For more readability, the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R2 score are also shown in Figure 4. Upon analyzing the results, we can make the following observations: The MAE metric measures the average absolute difference between the predicted and actual values.The lowest MAE value is achieved by the SVM learning algorithm (75.64), followed by R.F. (282.72),LSR (2075.76),ANN (3689), L.R. (3977.3),RAN (4030), R.R. (4397.04),and EN (6754.9).Therefore, SVM is the best algorithm for minimizing MAE, while EN is the worst.The MSE metric measures the average squared difference between the predicted and actual values.The lowest MSE value is achieved by the SVM learning algorithm (103.54),followed by R.F. (2574.86),LSR (118244.54),ANN (316777.84), L.R. (382251.08),RAN (382298.2), R.R. (433495.94), and EN (932171.86).Therefore, SVM is the best algorithm for minimizing MSE, while EN is the worst.The RMSE metric measures the square root of the average squared difference between the predicted and actual values.The lowest RMSE value is achieved by the SVM learning algorithm (10.Therefore, SVM is the best algorithm for minimizing RMSE, while EN is the worst.The R2 score metric measures the proportion of the variance in the target variable that is predictable from the independent variables.The highest R2 score is achieved by the SVM learning algorithm (0.91), followed by R.F. (0.91), LSR (0.82), ANN (0.84), L.R. (0.81), RAN (0.82), R.R. (0.62), and EN (0.5).Therefore, SVM and R.F. are the best algorithms for maximizing the R2 score, while EN is the worst.The SVM learning algorithm appears to be the best-performing overall, as it achieves the lowest values for MAE, MSE, and RMSE and the highest R2 score.Table 8 presents the evaluation results of learning algorithms using the Hybrid software development approach dataset.The SVM algorithm has the lowest MAE, indicating that it performs better at predicting the target variable.
In contrast, the ANN algorithm has the highest MAE, implying the most considerable absolute difference between the predicted and actual values.The R.F. algorithm has the highest MSE, indicating that it performs the worst in predicting the target variable.The EN algorithm has the lowest MSE, implying a minor squared difference between the predicted and actual values.The SVM algorithm has the lowest RMSE, indicating that it performs better predicting the target variable.The ANN algorithm has the highest RMSE, implying the most significant difference between the predicted and actual values.The EN algorithm has the highest R2, indicating that it explains the most variance in the target variable.The R.R. and LSR algorithms have the lowest R2, implying that they explain the minor variance in the target variable.Moreover, the results in Figure 5 suggest that the SVM and EN algorithms best predict the target variable, while the R.F. algorithm performs the worst.However, the choice of the algorithm may also depend on other factors, such as the computational complexity and interpretability of the model.
Based on the evaluation results presented in Tables 6-8 and Figure 3, 4, and 5, it can be concluded that the performance of the learning algorithms varies depending on the dataset used and the performance metric evaluated.However, the SVM and ANN algorithms appear to be the most promising algorithms for Agile software development.In contrast, the SVM algorithm best predicts the target variable across all three software development approaches.It is important to note that the choice of an algorithm may also depend on other factors such as interpretability, computational complexity, and the specific problem being addressed.
Looking at the values in Table 9 for the AME of cost estimation for the three software development approaches, we can see that SVM has the lowest MAE values across all three software development models.It suggests that SVM has the best performance in terms of accuracy for cost estimation.
In terms of the different software development models, the Waterfall model has the lowest average MAE value (2795.58),compared to the Hybrid (3066.12)and Agile (3291.57)models.The results suggest that the cost estimation performance for the Waterfall model is better, compared to the other two models.It is important to note that the MAE values for all the models are relatively high, indicating room for improvement in the cost estimation process, regardless of the software development model used.
Looking at the RMSE values for the three software development models, we can see that SVM has the lowest RMSE value in all three cases, which indicates that SVM is the most accurate model It is worth noting that the RMSE values are relatively high in absolute terms, indicating that the model's predictions may still have a significant margin of error.Nonetheless, the relative differences between the models suggest that SVM is the most accurate of the models presented here.
The overall result for all evaluation metrics is shown in Figure 6.; when selecting a model for software development cost estimation, it is essential to consider the R-squared and RMSE values and other factors, such as model interpretability, data availability, and computational efficiency.Moreover, we can see that the SVM model has the lowest MSE for all three software development models, followed closely by the Random Forest (R.F.) model.These models can estimate the cost of software development projects more accurately than the other models.On the other hand, the Ridge Regression (R.R.) and Elastic Net (EN) models have the highest MSE values for all three software development models.They are less accurate in estimating the cost of software development projects.The R2 value is a statistical measure that indicates the proportion of the variation in the dependent variable (i.e., cost) explained by the independent variables (i.e., features).In this case, we can see that the R2 values for the three software development models (Hybrid, Waterfall, and Agile) are relatively similar across all eight machine learning algorithms.The highest R2 value is observed for the Support Vector Machine (SVM) and Random Forest (R.F.) algorithms for both the Waterfall and Agile models.In contrast, the highest R2 value for the Hybrid model is observed for the SVM algorithm.Overall, the R2 values indicate that the machine learning algorithms can explain a significant portion of the variation in software model cost estimates, with R2 values ranging from 0.39 to 0.91.However, it is worth noting that the R2 values are not extremely high, indicating that there may still be some unexplained variation in the cost estimates.Additionally, the R2 values alone do not provide information about the cost estimates' accuracy.Other performance metrics, such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), should also be considered when evaluating the performance of the machine learning algorithms.
Selecting the appropriate software development model for cost estimation in a constrained environment requires careful consideration of project scope, requirements, and available resources.Each model has its benefits and drawbacks, and choosing the suitable model can significantly impact the accuracy and reliability of cost estimation.A Hybrid model may be a good option, as it provides a balance of flexibility and structure while considering the constraints of the project.Cost estimation is an essential part of software development, and it is crucial to choose an appropriate software development model for accurate and reliable cost estimation.Choosing a suitable software development model can be challenging in a constrained environment with limited resources.Among the different software development models, Waterfall, Agile, and Hybrid are popular options for cost estimation.The Waterfall model is a linear sequential approach that follows a strict order of development phases, which can be helpful in small projects with a well-defined scope and requirements.However, in constrained environments, it may not be practical, as it requires a lot of planning and documentation, which can be time-consuming and costly.
Conversely, the agile model is an iterative approach focusing on flexibility, adaptability, and collaboration between the development team and the client.It can be helpful in constrained environments, as it allows for changes to be made to the project scope and requirements based on client feedback, which can help save time and resources.However, estimating costs accurately in an Agile model can also be challenging, as the scope and requirements change frequently.The Hybrid model combines Waterfall and Agile elements, which can be beneficial in constrained environments.It allows for flexibility and adaptability, while maintaining some of the structure and planning of the Waterfall model.It can also be helpful in projects with evolving requirements and an uncertain scope.However, it can be challenging to estimate costs accurately in a Hybrid model, as it requires balancing the benefits of both Waterfall and Agile models, while accounting for their drawbacks.Cost estimation becomes critical to project planning and management in constrained environments.Cost estimation involves predicting the resources needed for a project, including time, money, and personnel.Accurate cost estimation is critical to ensure a project can be completed within the allocated budget and timeline.In practical terms, several factors should be considered when estimating the cost in constrained environments.One of the most important is the accuracy of the estimates.Estimation techniques, such as parametric estimation, analogous estimation, and expert judgment can be used to develop accurate cost estimates.It is also essential to consider the level of uncertainty in the estimates and to use contingency planning to account for unexpected events that may impact project costs.Other factors to consider are cost, scope, and schedule trade-offs.In constrained environments, it is often necessary to balance the resources available with the project's scope and the completion timeline, which requires making difficult decisions about project priorities, such as reducing the project scope or extending the timeline for completion.Finally, effective communication and collaboration among team members are essential for cost estimation in constrained environments.Project stakeholders, including team members, sponsors, and clients, must be involved in the estimation process and kept informed of any changes or updates to the estimates to ensure that everyone is on the same page and that there are no surprises later in the project lifecycle.In summary, cost estimation in constrained environments requires a careful balance between accuracy, scope, and timeline.Using effective estimation techniques, contingency planning, and communication and collaboration among team members, project managers can develop and manage cost estimates that help ensure their projects' success.
Cost estimation using different software development models requires practical considerations in constrained environments.Constrained environments refer to limited resources, such as time, budget, or personnel.One practical consideration is carefully evaluating the chosen model's ability to adapt to project scope or requirement changes.Agile and Hybrid development models are more suited to adapt to changes because they are iterative and focus on continuous improvement, allowing for more flexibility in cost estimation as the project progresses.Another consideration is to ensure that the estimated costs align with the project's objectives and desired outcomes.In constrained environments, allocating resources efficiently and ensuring that the costs are not exceeding the available budget is crucial.Waterfall development models are typically more rigid and may not allow cost adjustments during the project's lifecycle.Hence, it is essential to evaluate the project's requirements carefully before choosing the model for cost estimation.Finally, in constrained environments, monitoring the project's progress and adjusting the cost estimation is vital for implementing effective project management practices and tools that allow real-time tracking of project costs, risks, and timelines.Regularly updating and reviewing cost estimation using different software development models can help ensure the project remains on track and within budget.Evaluating the chosen model's ability to adapt to changes, ensuring that the estimated costs align with the project's objectives, and regularly monitoring the project's progress are all critical factors to consider.With proper planning and project management, organizations can develop accurate cost estimates that align with their constraints and deliver successful projects.The benefits of cost estimation using machine learning models for industries in developing countries with constrained environments are significant.Firstly, machine learning models can handle large volumes of data, enabling accurate cost estimation, which is instrumental to developing countries with limited resources.Data collection can be challenging due to inadequate infrastructure.Machine learning models can learn from available data and use this knowledge to make predictions, making it easier to estimate costs and make informed decisions.Secondly, machine learning models can help industries in developing countries optimize their resources and minimize costs significant in constrained environments, where resources are scarce and every resource is critical.Machine learning models can analyze data and identify cost-saving opportunities, helping industries make better decisions on where to allocate their resources and optimize their operations, reduce waste, and improve overall efficiency.Lastly, cost estimation using machine learning models can help industries in developing countries increase their competitiveness by accurately estimating costs and optimizing their resources.Industries can offer their products and services at competitive prices, making them more attractive to customers, helping industries grow, creating jobs, and contributing to the country's economic development.In summary, cost estimation using machine learning models can benefit industries in developing countries with constrained environments.By leveraging machine learning models, industries can accurately estimate costs, optimize their resources, and increase competitiveness, ultimately contributing to the country's economic development.

CoNCLUSIoN
In conclusion, this paper aimed to evaluate the performance of various regression models for software development cost estimation in constrained environments.The study utilized the SEERA dataset, representing economically and technically constrained software industries.The eight regression models evaluated were Elastic Net, Lasso Regression, Linear Regression, Neural Network, RANSACRegressor, Random Forest, Ride Regression, and SVM.The performance of these models was assessed using correlation coefficients and accuracy indicators.The results showed that SVM and Random Forest models best estimated software development costs in constrained environments.However, the Elastic Net, Lasso Regression, Linear Regression, Neural Network, and RANSACRegressor models also performed well.These findings provide insights into selecting appropriate regression models based on the specific software model and dataset used in the estimation process.The study highlights the importance of using relevant and quality datasets for accurate cost estimation in software development.The availability of datasets representing constrained environments is limited, and this paper fills a gap in the literature by using the SEERA dataset.The study also emphasizes software development industries' challenges in constrained environments, such as limited resources, infrastructure, and expertise.
In conclusion, the study provides valuable insights into software development cost estimation in constrained environments.It can help project managers and developers select appropriate regression models for cost estimation in these settings.Further research can expand on this work by incorporating additional datasets and exploring the use of other machine-learning techniques for cost estimation in constrained environments.

Figure 1 .
Figure 1.General framework for SW-cost estimation prediction

Figure 2 .
Figure 2. CE-SWM-PRED ALGO Absolute Error)-The mean absolute error measures how well a model fits a dataset.It is the average of the absolute differences between the predicted and actual values.MAE measures the average magnitude of the errors or the average absolute deviation of the predictions from the actual values.Mean Squared Error)-The root mean squared error measures how well a model fits a dataset.It is the square root of the average of the squared differences between the predicted and actual values.RMSE measures the average variance of the errors or the average squared deviation of the predictions from the actual values.-R-squared measures how well a model fits a dataset.It is the proportion of the variance in the dependent variable that the model explains.It measures how close the data are to the fitted regression line.It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.the evaluation results of various learning algorithms (ANN, SVM, R.F., L.R., EN, R.R., LSR, RAN) for the three Agile software development models based on different measures (MAE, MSE, RMSE, R2).
(RAN) algorithms had the lowest minimum R2 values.The average R2 values were highest for the ANN and SVM algorithms.The standard deviation (Stdev) values indicate the variability of the evaluation results for each algorithm.The R.F. and EN algorithms had the highest Stdev values for MAE, MSE, and RMSE, indicating more significant variability in their performance.On the other hand, the SVM algorithm had the lowest Stdev values for MAE, MSE, and RMSE, indicating a more consistent performance.

Figure 3 .
Figure 3. Summary of fitting indicators for the eight models used: Mean absolute error (MAE), mean square error (MSE), and the square root of the mean of the square of all of the errors (RMSE)

Figure 4 .
Figure 4. Summary of fitting indicators for the eight models used: Mean absolute error (MAE), mean square error (MSE), and the square root of the mean of the square of all of the errors (RMSE)

Figure 5 .
Figure 5. Summary of fitting indicators for the eight models used: Mean absolute error (MAE), mean square error (MSE), and the square root of the mean of the square of all of the errors (RMSE)

Figure 6 .
Figure 6.Summary of fitting indicators for the eight models used: Mean absolute error (MAE), mean square error (MSE), and the square root of the mean of the square of all of the errors (RMSE)

Table 2 . Comparison between different machine learning algorithms used for cost estimation in software development Software Development Model Dataset Name Machine Learning Model Performance Measures
Note: RMSE stands for Root Mean Square Error, and R2 stands for Coefficient of Determination.These performance measures are commonly used in evaluating the accuracy of machine learning models for cost estimation.

Table 7 . Evaluation results of learning algorithms waterfall sub-dataset (ANN, SVM, R.F., L.R., EN, R.R., LSR, RAN)
for cost estimation among the four models.The other models have similar RMSE values, with R.F. and L.R. having slightly lower values than the others.