Receive a 20% Discount on All Purchases Directly Through IGI Global's Online Bookstore

Lei Xu (Chinese University of Hong Kong, China) and Shun-ichi Amari (Hong Kong & Peking University, China)

DOI: 10.4018/978-1-59904-849-9.ch049

Chapter Preview

TopExpert combination is a classic strategy that has been widely used in various problem solving tasks. A team of individuals with diverse and complementary skills tackle a task jointly such that a performance better than any single individual can make is achieved via integrating the strengths of individuals. Started from the late 1980’ in the handwritten character recognition literature, studies have been made on combining multiple classifiers. Also from the early 1990’ in the fields of neural networks and machine learning, efforts have been made under the name of ensemble learning or mixture of experts on how to learn jointly a mixture of experts (parametric models) and a combining strategy for integrating them in an optimal sense.

The article aims at a general sketch of two streams of studies, not only with a re-elaboration of essential tasks, basic ingredients, and typical combining rules, but also with a general combination framework (especially one concise and more useful one-parameter modulated special case, called α-integration) suggested to unify a number of typical classifier combination rules and several mixture based learning models, as well as max rule and min rule used in the literature on fuzzy system. (Figure 1)

TopBoth streams of studies are featured by two periods of developments. The first period is roughly from the late 1980s to the early 1990s. In the handwritten character recognition literature, various classifiers have been developed from different methodologies and different features, which motivate studies on combining multiple classifiers for a better performance. A systematical effort on the early stage of studies was made in (Xu, Krzyzak & Suen, 1992), with an attempt of setting up a general framework for classifier combination. As re-elaborated in Tab.1, not only two essential tasks were identified and a framework of three level combination was presented for the second task to cope with different types of classifier’s output information, but also several rules have been investigated towards two of the three levels, especially with Bayes voting rule, product rule, and Dempster-Shafer rule proposed. Subsequently, the rest one (i.e., rank level) was soon studied in (Ho, Hull, & Srihari, 1994) via Borda count.

Interestingly and complementarily, almost in the same period the first task happens to be the focus of studies in the neural networks learning literature. Encountering the problems that there are different choices for the same type of neural net by varying its scale (e.g., the number of hidden units in a three layer net), different local optimal results on the same neural net due to different initializations, studies have been made on how to train an ensemble of diverse and complementary networks via cross-validation- partitioning, correlation reduction pruning, performance guided re-sampling, etc, such that the resulted combination produces a better generalization performance (Hansen & Salamon, 1990; Xu, Krzyzak, & Suen, 1991; Wolpert, 1992; Baxt, 1992, Breiman, 1992&94; Drucker, et al, 1994). In addition to classification, this stream also handles function regression via integrating individual estimators by a linear combination (Perrone & Cooper, 1993). Furthermore, this stream progresses to consider the performance of two tasks in Tab.1 jointly in help of the mixture-of-expert (ME) models (Jacobs, et al, 1991; Jordan & Jacobs, 1994; Xu & Jordan, 1993; Xu, Jordan & Hinton, 1994), which can learn either or both of the combining mechanism and individual experts in a maximum likelihood sense.

Product Rule: When k classifiers are mutually independent, a combination is given by or concisely , which is also called product rule

Conditional Distribution p(y|x): Describes the uncertainty that an input x is mapped into an output y that simply takes one of several labels. In this case, x is classified into the class label y with a probability p(y|x). Also, y can be a real-valued vector, for which x is mapped into y according density distribution p(y|x)

f Mean: Given a set of non-negative numbers , the mean is given by , where is a monotonic scalar function and . Particularly, one most interesting special case is that satisfies for any scale c, which is called mean

Mixture of Experts: Each expert is described by a conditional distribution either with y taking one of several labels for a classification problem or with y being a real-valued vector for a regression problem. A combination of experts is given by which is called a mixture-of-experts model. Particularly, for y in a real-valued vector, its regression form is

Performance evaluation Approach: It usually works in the literature on classifier Combination, with a chart flow that considering a set of classifiers designing a combining mechanism according to certain principles evaluating performances of combination empirically via misclassification rates, in help of samples with known correct labels

Error-Reduction Approach: It usually works in the literature on mixture based learning, where what needs to be pre-designed is the structures of classifiers or experts, as well as the combining structure with unknown parameters. A cost or error measure is evaluated via a set of training samples, and then minimized through learning all the unknown parameters

Classifier Combination: Given a number of classifiers, each classifies a same input x into a class label, and the labels maybe different for different classifiers. We seek a rule M(x) that combines these classifiers as a new one that performs better than anyone of them

Sum rule (Bayes voting): A classifier classifies x to a label y can be regarded as casting one vote to this label, a simplest combination is to count the votes received by every candidate label. The j-th classifier classifies x to a label y with a probability means that one vote is divided to different candidates in fractions. We can sum up to count the votes on a candidate label y, which is called Bayes voting since p(y|x) is usually called Bayes posteriori probability

Search this Book:

Reset

Copyright © 1988-2018, IGI Global - All Rights Reserved