Semi-Supervised Dimension Reduction Techniques to Discover Term Relationships

Semi-Supervised Dimension Reduction Techniques to Discover Term Relationships

Manuel Martín-Merino (University Pontificia of Salamanca, Spain)
Copyright: © 2015 |Pages: 11
DOI: 10.4018/978-1-4666-5888-2.ch721

Chapter Preview



The analysis of high dimensional datasets remains a challenging task for common machine learning techniques due to the well known ‘curse of dimensionality’ (Aggarwal, 2013; Cherkassky, 2007). It has been suggested in the literature (Kuman, 2006; Cevikalp, 2008) that the dimension reduction techniques can help to overcome this problem because they reduce the noise keeping the main structure of the dataset. Several algorithms have been proposed to this aim such as Principal Component Analysis (PCA), Correspondence Analysis or neural based techniques (see for instance (Kuman, 2006; Borg, 2005; Stuhlsatz, 2012)). In this article, we study two non-linear techniques, the Sammon mapping (Martín-Merino, 2004) and the Self Organizing Maps (SOM) (Kaski, 2006). Both have been widely applied to visualize term relationships.

Non-linear dimension reduction techniques have been applied to discover semantic relations among terms or documents in textual databases (Kaski, 2006). However, the algorithms proposed in the literature often have a low discriminant power, that is, different topics of the textual collection often overlap strongly in the projection. This is mainly due to the well known ‘curse of dimensionality’ (Aggarwal, 2013; Wang, 2013) and to the unsupervised nature of the algorithms proposed. Therefore, the projections are often useless to identify the different semantic groups in a given textual collection (Martín-Merino, 2005; Gönen, 2010).

Unfortunately, the words of a textual collection cannot be organized in a supervised manner, because no a priori classification of terms into topics is usually available (Martín-Merino, 2004). However, several search engines such as Yahoo provide a categorization for a small subset of documents (Martín-Merino, 2005; Manning, 2008) that may help to improve the discriminant power of the dimension reduction techniques. The semi-supervised dimension reduction techniques proposed in the literature (Gönen, 2010) cannot be applied to this problem because only the documents are categorized, not the terms.

Key Terms in this Chapter

Semi-Supervised Learning: Estimation of the parameters of a model considering both, un-labeled data and a small subset of labeled examples by human experts.

Unsupervised Learning: Estimation of the parameters of a model considering only un-labeled data and without the help of human experts.

Sammon: Non-linear dimension reduction technique applied to visualize the underlying structure of high dimensional data.

MDS: Multidimensional Scaling Algorithm. Multivariate exploratory data analysis technique that is able to obtain a visual representation of the object relationships working directly from a dissimilarity matrix.

IR: Information Retrieval. A broad discipline that study the organization and recovery of textual data.

Mutual Information: A non-linear correlation measure that allow us to evaluate the degree of association between the input variables and the response.

SOM: Self Organizing Maps. An unsupervised neural network widely used in exploratory data analysis and to visualize multivariate object relationships.

Complete Chapter List

Search this Book: