Chapter Preview
TopBackground
Big data refers to the data that is large in size or large in dimensionality or both. It becomes difficult to store, analyze, search, share and visualize such a large data. Big data is being generated due to the advancements in sensor technology, wireless sensor networks, information gathering mobile devices, remote sensing devices, various data capturing tools and sophisticated cameras. Big data are of high volume (i.e. both size and dimensionality may be large), of high velocity (i.e. making time sensitive decisions by processing big data as it streams into the enterprise), and/or high variety information assets (i.e. structured and unstructured data). The traditional data processing applications and conventional decision making algorithms become impractical in handling big data of this size and volume.
A wide variety of sensors enable collecting a large number of observations (patterns). They also enable collecting larger number of features that describe each pattern leading to high dimensional data. The classification of the high dimensional data has become increasingly important in the fields of engineering, computational biology, genomics and pattern recognition. In high dimensional data, generally the number of features is greater than the number of patterns. One of the major challenges in classifying high dimensional data is high variance and overfitting (due to the noise). The other concern is the curse of dimensionality that makes the conventional classification impractical. The curse of dimensionality refers to the demand for more number of observations with an increase in the dimensionality in order to get good estimates. The increase in dimensionality increases the noise, over fitting and the complexity of any learning algorithm. These problems can be solved and the generalization performance of the classifier can be improved either by 1) reducing the dimensionality of the data 2) minimizing the VC dimension of the classifier or 3) by providing a large number of training samples.
Dimensionality reduction involves feature extraction or feature selection. Feature extraction techniques such as Principal Component Analysis, Linear Discriminant Analysis, Random Projection, Independent Component Analysis etc., have been extensively explored (Van der Maaten Postma & Van den Herik, 2009; Subasi & Ismail Gursoy, 2010; Seetha, Murty & Saravanan, 2011a; 2011b; Deng et al., 2012). Feature selection techniques using wrappers and filters have also been well studied (Kohavi & John, 1997; Liu & Yu, 2005; Gao et al., 2011; Bermejo et al., 2012). The recent trend is towards pattern synthesis to combat the curse of dimensionality (Agrawal et al., 2005; Viswanath, Murty & Bhatnagar, 2006; Chen, Zhu & Nakagawa, 2011; Seetha, Saravanan & Murty, 2012).
Key Terms in this Chapter
Curse of Dimensionality: As the dimensionality increases the available data becomes sparse and requires large amount of data for any learning method that requires statistical significance to produce a reliable result.
Capacity: Capacity is a measure of complexity and measures the expressive power flexibility of the boundary of the classifier.
Kernel: A kernel defines a similarity measure between two data points. Examples: Linear Kernel - (dot product); where and are two data points. Non Linear Kernels-Polynomial Kernel - , represents degree of the polynomial. Gaussian Kernel -
SMOTE: Synthetic minority oversampling technique used to reduce the imbalance in the datasets by generating synthetic patterns in minority class.
Multiple Kernel Learning: Several kernels are synthesized into a single kernel.
VC Dimension: VC dimension of a set of concepts C is defined as the largest set of points that can be shattered by the set of concepts C.
Shattering: A set of instances S is shattered by a set of concepts C if and only if for every dichotomy of S there exists some concept in C consistent with this dichotomy.
Overfitting: On applying a learning algorithm to a smaller training data, the model developed memorizes data and cannot predict well on new unseen data.
Classification: It is a process of predicting the class label of an object for which the class label is unknown; typically a model is built using labeled samples which distinguishes objects of different classes.
NNC or NN Classifier: Nearest neighbor classifier.
Bootstrapping: It is a statistical method that generates artificial samples from the original training samples.