Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI), when applied to information retrieval, has been a major analysis approach in text mining. It is an extension of the vector space method in information retrieval, representing documents as numerical vectors but using a more sophisticated mathematical approach to characterize the essential features of the documents and reduce the number of features in the search space. This chapter summarizes several major approaches to this dimensionality reduction, each of which has strengths and weaknesses, and it describes recent breakthroughs and advances. It shows how the constructs and products of LSA applications can be made user-interpretable and reviews applications of LSA beyond information retrieval, in particular, to text information visualization.
A vast amount of information exists in text form, such as free (unstructured) or semi-structured text, including many database fields, reports, memos, e-mail, Web sites, and news articles. Various Web mining and text mining methods have been developed to analyze textual resources. Latent Semantic Analysis (LSA) (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990), or Latent Semantic Indexing (LSI) when it is applied to document retrieval, has been a major approach in text mining. It is an extension of the vector space method in Information Retrieval (Salton, Wong, & Yang, 1975), using a mathematical approach to represent documents as numerical vectors but with a more sophisticated means of characterizing the essential features of documents and reducing the number of dimensions needed to describe documents to a manageable size. There have been several major approaches to address this dimensionality reduction, each of which has strengths and weaknesses. A major challenge in using LSA is that it is typically considered a black box approach that makes it difficult to understand or interpret the results. However, more recent research has not only overcome this challenge, but also demonstrates that the use of LSA extends beyond IR and text document clustering to become a major player in the area of text information visualization. This chapter will summarize the major approaches to LSA, their strengths and weakness, as well as recent breakthroughs and advances and applications beyond information retrieval.
Text mining has adopted certain techniques from the more general field of data analysis, including sophisticated methods for analyzing relationships among highly formatted data, such as numerical data or data with a relatively small fixed number of possible values. Such techniques can expose patterns and trends in this type of data. Text mining can identify relationships between individual unstructured or semi-structured text documents, as well as more general semantic patterns across large collections of such documents. Latent Semantic Analysis, like many other methods of text mining, depends on the twin concepts of “document” and “term.” As used in this chapter, a “document” refers to any body of unstructured or semi-structured text. The text may include the entire content of a document in the general sense, such as a book, an article, a paper, or the like—or only a portion of a document, such as an abstract, a paragraph, a sentence, or a title. Ideally, a “document” describes a coherent topic. In addition, a “document” can be the text field of a database, or encompass text generated from an image or graphic, or it may be text recovered from audio or video formats. We will use the term “document” in this general sense.
A document can be represented as a collection of “terms,” each of which can appear in multiple documents. Typically, a “term” consists of an individual word used in the text. However, a “term” can also include multiple words that are commonly used together, for example, “landing gear”, or even consist of a string that need not appear explicitly in the text but rather result from token normalization or standardization. Token normalization will be discussed further.
In vector-based methods of text data analysis, after a suitable set of terms has been defined for a document collection, the collection can be represented as a set of vectors. With traditional vector space methods, individual documents are treated as vectors in a high-dimensional vector space in which each dimension corresponds to some feature of a document, typically a term. A collection of documents can thus be represented by a two-dimensional matrix A(t,d) of features (terms) and documents. In the typical case, the value of each matrix entry is the number of occurrences of that term in the specified document, or some weighting or principled transformation of that number. LSA, as an extension of the vector space method, involves methods of transforming A by various means, e.g. singular value decomposition (SVD) in the case of ‘classical’ LSA, which typically attempt to provide a more sophisticated set of features that better capture the latent semantics of the documents. We discuss various such matrix decomposition techniques in much more detail.
Key Terms in this Chapter
Latent Semantic Space: The subspace of term space whose dimensions correspond to the features uncovered by Latent Semantic Analysis for representing documents in a more semantically useful form.
Singular Value Decomposition (SVD): A linear algebra method of decomposing an arbitrary matrix into three matrices, two of which are orthonormal (the columns, the left and right singular vectors, respectively, are orthogonal and have length1) and the third is a diagonal matrix whose diagonal values are the singular values of the matrix
Basis Vectors (for a given space): A set of linearly independent vectors that define a space in that any vector in that space can be defined as a linear combination (a weighted sum) of those vectors. Linearly independent means that none of them can be defined as a linear combination (or weighted sum) of the others.
Latent Semantic Analysis (LSA): A method of representing text documents in terms of features that are weighted combinations of the frequencies words or terms in the documents that makes the “latent semantics” or topics treated in the documents more computationally accessible.
Subspace: A vector space with a lower dimensionality that is wholly contained in a larger vector space.
Dimensionality Reduction: The process of taking high dimensional data (data represented by a large number of features) and representing it with different and fewer features or dimensions (which may be combinations of the old features) in a principled fashion that preserves some properties of the original space.
Vector Space Methods: A method of representing documents as numerical vectors, where the values represent the frequencies of the words or terms in the documents, or some weighting of these to represent their importance in the document set
Principal Components Analysis (PCA): A statistical method for discovering the dimensions that maximize variability in high dimensional data. Mathematically equivalent to SVD, except that it requires that the data all be centered
Top ic Words: Words that summarize the important topics of a document or piece of text that are automatically assigned based on the representation of that document in latent semantic space.