Counting the Hidden Defects in Software Documents

Counting the Hidden Defects in Software Documents

Frank Padberg (Saarland University, Germany)
DOI: 10.4018/978-1-60566-766-9.ch025
OnDemand PDF Download:
No Current Special Offers


The author uses neural networks to estimate how many defects are hidden in a software document. Input for the models are metrics that get collected when effecting a standard quality assurance technique on the document, a software inspection. For inspections, the empirical data sets typically are small. The author identifies two key ingredients for a successful application of neural networks to small data sets: Adapting the size, complexity, and input dimension of the networks to the amount of information available for training; and using Bayesian techniques instead of cross-validation for determining model parameters and selecting the final model. For inspections, the machine learning approach is highly successful and outperforms the previously existing defect estimation methods in software engineering by a factor of 4 in accuracy on the standard benchmark. The author’s approach is well applicable in other contexts that are subject to small training data sets.
Chapter Preview


This chapter describes a novel application of machine learning to an important estimation problem in software engineering – estimating the number of hidden defects in software artifacts. The number of defects is a software metric that is indispensable for guiding decisions about the software quality assurance during development. In engineering processes, management usually demands that a certain quality level be met for the products at each production step, for instance, that each product be 98 percent defect-free. Software can never be assumed defect-free. In order to assess whether additional quality assurance is required before a prescibed quality level is met, the software engineers must reliably estimate the defect content of their software document, be it code or some other intermediate software product, such as a requirements specification or a design artifact. This is a hard problem with substantial economic significance in software practice.

Software companies use defect classification schemes and possess broad practical experience about the kind of errors typically committed by their developers, but software engineering does not have a general theory available that would explain how, when, and where defects get inserted into software artifacts: Currently, no model explains the generating process of the software defects. As a consequence, any estimates must be based on (secondary) defect data that is observed during development and deployment of the software.

During development, defect data emerges mainly during software testing and software inspections. Software testing requires that executable code is available. A typical defect metric collected during testing is the number of defects detected in each test. Software reliability engineering uses this kind of data to predict the total number of defects (including the hidden ones) by extrapolating the cumulative number of defects found during testing. The rationale is that the quality of the code increases as testing progresses and defects get fixed, hence the growth of the defect curve should eventually flatten. Reliability models do not explain the defect generating process, nonetheless they can yield good estimates for the defect content of code.

For software documents other than code, inspections are the main quality assurance technique (Gilb & Graham, 1993). During an inspection, reviewers independently find defects in individual detection phases and group meetings in a structured way, using special reading techniques. Inspections are highly effective where testing is not possible, including textual specifications and design documents. Defect data that emerge during an inspection of a document include the number of defects found by each individual reviewer, or the number of different defects detected by the inspection team as a whole.

In principle, software reliability models can also be applied to inspections by replacing the code tests with the detection phases of the individual reviewers. Empirical studies show, though, that reliability models do not work well when transferred to inspections (see the next section). The models generally exhibit a high variation in their estimation error. In particular, the estimates can be extreme outliers without warning during the estimation procedure. As a result, there is a compelling need in software engineering for a reliable defect content estimation technique for inspections.

In this chapter, we view the defect content estimation problem for inspected documents as a machine learning problem. The goal is to learn the relationship between certain observable features of an inspection and the true number of defects in the inspected document, including the hidden ones. A typical feature that can be observed during an inspection is the total number of different defects detected by the inspection team; this feature provides a lower bound for the defect content of the document. Some inspection features may carry valuable nonlinear information about the defect content of the document; an example is the variation of the reviewers’ performance during the inspection. We identify such features using the information-theoretic concept of mutual information. In order to be able to capture any nonlinear relationships between the features and the target, we use neural networks for learning from the data.

Key Terms in this Chapter

Defect Content: The number of defects contained in a software artifact. This number includes the defects that have not yet been found by quality assurance, that is, the hidden defects.

Model Selection: A systematic procedure that selects the best neural network from a set of trained networks as the final model. The best network should show a small training error and at the same time a high ability to generalize.

Empty Space Phenomenon: A case where the training patterns sparsely populate the input space. This occurs when the training data set is too small relative to the dimension of the input space. As a result, the models won’t be robust and small changes in the data can produce large variations in the estimates.

Bayesian Learning: A learning technique that determines model parameters (such as the network weights) by maximizing the posterior probability of the parameters given the training data. The idea is that some parameter values are more consistent with the observed data than others. By Bayes’ rule, maximizing the posterior probability amounts to maximizing the so-called model evidence, defined as the conditional probability of the training data given the model parameters. The evidence often can be approximated by some closed formula or by an update rule. Bayesian techniques render it possible to use all data for training instead of reserving patterns for cross-validation of parameters.

Generalization: The ability of a model to yield good estimates on previously unseen input.

Feature Selection: A systematic procedure that selects the most relevant features from a set of candidates as input for the networks. The goal is to select those features that together carry the most information about the target and to avoid that the input space gets too high-dimensional.

Mutual Information: An entropy-based measure for the degree of stochastic dependence between two random vectors. If the mutual information value is high, the vectors carry much information about each other.

Software Inspection: A core technique in software quality assurance where a group of reviewers independently and systematically examine software artifacts to find defects. Inspections are highly effective where software testing is not possible, in particular, for textual specifications and design documents.

Overfitting: Learning a complicated function that matches the training data closely but fails to recognize the underlying process that generates the data. As a result of overfitting, the model performs poor on new input. Overfitting occurs when the training patterns are sparse in input space and/or the trained networks are too complex.

Regularization: Including a term in the error function such that the training process favours networks of moderate size and complexity, that is, networks with small weights and few hidden units. The goal is to avoid overfitting and support generalization.

Complete Chapter List

Search this Book: