Article Preview
Top1. Introduction
In this study, new linear programming (LP) approaches are developed as nonparametric procedures for multiple-class discrimination and classification. Specifically, a LP model minimizing the sum of the classification errors, or the-norm, and a LP model minimizing the maximum of the classification errors, or the -norm, are formulated. These new models are more straightforward and simpler than any of the previous LP formulations for multiple-class discrimination and classification. Although having their own rights, these new models may also be considered as generalizations from two-class to multiple-class techniques. The LP model minimizing the -norm is similar to the MSD (minimizes the sum of deviations) models and that minimizing the -norm is similar to the MMD (minimizes the maximum of deviations) models for two-class classification. Properties of these new models are studied. The MSD model is immune to the difficulties caused by pathologies of earlier mathematical programming (MP) models for two-class classification although the MMD model is very limited in use.
Discriminant and classification analysis has been fundamental scientific research and practical applications. Discriminant analysis involves the study of the differences between two or more classes of objects represented by measurements, or variables, of different characteristics or attributes. Classification analysis involves the study of assigning new observations into one of the classes based on the measurements of the characteristics. Applications of discriminant and classification analysis are diverse. For example, reported applications in business include financial management, human resource management, and marketing; applications in biology and medicine include patient classification, disease diagnosis and species classification; and applications in environment and geography include remote sensing image pattern classification and pollution control. In fact, this study was motivated by the need to identify a few from many thousands of genes that can be used to classify tissue samples into two or more classes, such as normal and tumor tissues, and to identify genes responsible for certain diseases (Ambroise & McLachlan, 2002; Dudoit, Fridlyand, & Speed, 2002; Nguyen & Rocke, 2002; Sun, 2003; Sun & Xiong, 2003). The availability of large datasets collected using current information technology, such as the Internet, imaging and remote sensing, made the traditional discriminant and classification techniques inadequate. Discriminant and classification analysis techniques will play a more important role in data analysis as information technology advances and as huge datasets need to be analyzed with data mining tools.
Based on known values of the attributes or variables and known class memberships of the observations in a sample, discriminant functions are constructed. The attribute values of an observation can be evaluated by these discriminant functions to obtain discriminant scores based on which the observation is assigned to a class. For many decades, statistical techniques, such as Fisher’s linear discriminant function (LDF) (Fisher, 1936), Smith’s quadratic discriminant function (Smith, 1947) and logistic regression (Hand, 1981), have been standard tools for this purpose. Statistical methods perform well when the data analyzed satisfy the underlying assumptions, such as multivariate normality and equal covariance matrices of the independent variables, although minor deviations from these assumptions do not severely affect the performance of these techniques. More recently, other techniques, such as MP (Freed & Glover, 1981a, 1981b; Hand, 1981), neural networks (Stern, 1996) and classification trees (Breiman et al., 1984), have become alternative tools to rival statistical techniques. A revolutionary development in classification is the support vector machine (SVM) (Vapnik, 1995, 1998). A spectrum of techniques is needed because no single technique always outperforms others under all situations (Johnson & Wichern, 1988).