Article Preview
TopIntroduction
Since software projects play major role in today’s industry, the accurate estimation of the software development cost is very crucial. According to the Standish group report (2009), just 32% of software projects were on time and on cost in 2009, 44% of the projects were in challenged mode and 24% of projects had been canceled. Designing, developing, testing, and all aspects of the software projects are affected by the relevant estimations and predictions. Software testing is known as a major factor in increasing the development cost. Faulty modules cause significant risk by decreasing customer fulfillment and by increasing the testing and maintenance costs. Early detection of fault-prone software components could enable verification experts and testers to concentrate their time and resources on the problematic areas of the system under development. Area of software fault prediction still poses many challenges and unfortunately, none of the techniques proposed within the last decade have achieved widespread applicability in the software industry. During the recent decade, many software fault prediction models have been proposed; however, selecting the best method among them seems to be impossible because the performance of each method depends on the various factors such as different software measurement metrics, available information, machine-learning techniques and so on. However, the main aim of all methods is presenting the accurate results.
Soft computing methods have recently become popular in all prediction areas. It is a field within computer science that is characterized by using inexact solutions. Soft computing differs from conventional (hard) computing in that, unlike hard computing, soft computing deals with imprecision, uncertainty, partial truth, and approximation to achieve practicability, robustness, and low solution cost (Zadeh, 1965). Components of soft computing include neural networks, support vector machines, fuzzy logics, evolutionary computation and so on.
In the fault prediction process, previous reported faulty data along with distinct metrics identify the fault-prone modules. However, outliers and irrelevant data in training set can lead to the imprecise prediction. In fact, in many engineering problems, we encounter vagueness in information and uncertainty in training sets, so as these phenomena cause, we could not reach to expected results for our proposed solution. Our system, models the input information’s vagueness through fuzzy clusters and fault prediction is done based on majority ranking of three most similar fuzzy clusters to the test data. This system provides more accurate results compared to existing methods based on different classification techniques. Based on our proposed model, we construct three research questions that are listed as follows:
- •
RQ1: Does fuzzy clustering with majority ranking perform better than two well-performed learning methods in fault prediction modeling namely naïve bayes and random forest?
- •
RQ2: Does fuzzy clustering with majority ranking perform better than two well-performed learning methods in fault prediction modeling namely naïve bayes and random forest when two-stage outlier removal is applied on data sets?
- •
RQ3: How our proposed model is performed when two different sets of data sets are used for training and testing process?
The remainder of this paper continues with section 2, where a brief discussion on related works is presented. Fuzzy clustering is reviewed in section 3. Section 4 contains our proposed method. Experimental descriptions are presented in section 5. Experimental results and analysis are described in section 6, and finally, we summarize this paper in section 7.
TopAccording to Catal (2011), software fault prediction became one of the noteworthy research topics since 1990 and it includes two recent and comprehensive systematic literature reviews (Catal & Diri, 2009b; Hall, Beecham, Bowes, Gray, & Counsell, 2012). The prediction techniques use approaches that originated from the field of either statistics or machine learning. Some of these techniques are decision trees (Koprinska, Poon, Clark, & Chan, 2007) neural network (Thwin & Quah, 2005), naïve bayes (Menzies, Greenwald, & Frank, 2007), fuzzy logic (Yuan, Khoshgoftaar, Allen, & Ganesan, 2000) and the artificial immune recognition system algorithms in (Catal & Diri, 2007a, 2007b, 2009a). As the number of related works in this area is too much, we present some of them in this section.