Spectral Methods for Data Clustering

Spectral Methods for Data Clustering

Wenyuan Li
Copyright: © 2009 |Pages: 7
DOI: 10.4018/978-1-60566-010-3.ch278
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

With the rapid growth of the World Wide Web and the capacity of digital data storage, tremendous amount of data are generated daily from business and engineering to the Internet and science. The Internet, financial realtime data, hyperspectral imagery, and DNA microarrays are just a few of the common sources that feed torrential streams of data into scientific and business databases worldwide. Compared to statistical data sets with small size and low dimensionality, traditional clustering techniques are challenged by such unprecedented high volume, high dimensionality complex data. To meet these challenges, many new clustering algorithms have been proposed in the area of data mining (Han & Kambr, 2001). Spectral techniques have proven useful and effective in a variety of data mining and information retrieval applications where massive amount of real-life data is available (Deerwester et al., 1990; Kleinberg, 1998; Lawrence et al., 1999; Azar et al., 2001). In recent years, a class of promising and increasingly popular approaches — spectral methods — has been proposed in the context of clustering task (Shi & Malik, 2000; Kannan et al., 2000; Meila & Shi, 2001; Ng et al., 2001). Spectral methods have the following reasons to be an attractive approach to clustering problem: • Spectral approaches to the clustering problem offer the potential for dramatic improvements in efficiency and accuracy relative to traditional iterative or greedy algorithms. They do not intrinsically suffer from the problem of local optima. • Numerical methods for spectral computations are extremely mature and well understood, allowing clustering algorithms to benefit from a long history of implementation efficiencies in other fields (Golub & Loan, 1996). • Components in spectral methods have the naturally close relationship with graphs (Chung, 1997). This characteristic provides an intuitive and semantic understanding of elements in spectral methods. It Spectral Methods for Data Clustering Wenyuan Li Nanyang Technological University, Singapore Wee Keong Ng Nanyang Technological University, Singapore is important when the data is graph-based, such as links of WWW, or can be converted to graphs. In this paper, we systematically discuss applications of spectral methods to data clustering.
Chapter Preview
Top

Introduction

With the rapid growth of the World Wide Web and the capacity of digital data storage, tremendous amount of data are generated daily from business and engineering to the Internet and science. The Internet, financial real-time data, hyperspectral imagery, and DNA microarrays are just a few of the common sources that feed torrential streams of data into scientific and business databases worldwide. Compared to statistical data sets with small size and low dimensionality, traditional clustering techniques are challenged by such unprecedented high volume, high dimensionality complex data. To meet these challenges, many new clustering algorithms have been proposed in the area of data mining (Han & Kambr, 2001).

Spectral techniques have proven useful and effective in a variety of data mining and information retrieval applications where massive amount of real-life data is available (Deerwester et al., 1990; Kleinberg, 1998; Lawrence et al., 1999; Azar et al., 2001). In recent years, a class of promising and increasingly popular approaches — spectral methods — has been proposed in the context of clustering task (Shi & Malik, 2000; Kannan et al., 2000; Meila & Shi, 2001; Ng et al., 2001). Spectral methods have the following reasons to be an attractive approach to clustering problem:

  • Spectral approaches to the clustering problem offer the potential for dramatic improvements in efficiency and accuracy relative to traditional iterative or greedy algorithms. They do not intrinsically suffer from the problem of local optima.

  • Numerical methods for spectral computations are extremely mature and well understood, allowing clustering algorithms to benefit from a long history of implementation efficiencies in other fields (Golub & Loan, 1996).

  • Components in spectral methods have the naturally close relationship with graphs (Chung, 1997). This characteristic provides an intuitive and semantic understanding of elements in spectral methods. It is important when the data is graph-based, such as links of WWW, or can be converted to graphs.

In this paper, we systematically discuss applications of spectral methods to data clustering.

Top

Background

To begin with the introduction of spectral methods, we first present the basic foundations that are necessary to understand spectral methods.

Mathematical Foundations

Data is typically represented as a set of vectors in a high-dimensional space. It is often referred as the matrix representation of the data. Two widely used spectral operations are defined on the matrix.

  • EIG(A) operation: Given a real symmetric matrix An×n, if there is a vector x ∈ Rn ≠ 0 such that Axx for some scalar λ, then λ is called the eigenvalue of A with corresponding (right) eigenvector x. EIG(A) is an operation to compute all eigenvalues and corresponding eigenvectors of A. All eigenvalues and eigenvectors are real, that is, guaranteed by Theorem of real schur decomposition (Golub & Loan, 1996).

  • SVD(A) operation: Given a real matrix Am×n, similarly, there always exists two orthogonal matrices U ∈ Rm×m and V ∈ Rn×n (UTU= I and VTV=I) to decompose A to the form A=USVT, where S = diag(σ1, ..., σr) ∈ Rr×r, r=rank(A) and σ1 ≥ σ2... ≥ σr = ... = σn = 0. Here, the σi are the singular values of A and the first r columns of U and V are the left and right (respectively) singular vectors of A. SVD(A) is called Singular Value Decomposition of A (Golub & Loan, 1996).

Typically, the set of eigenvalues (or singular values) is called the spectrum of A. Besides, eigenvectors (or singular vectors) are the other important components of spectral methods. These two spectral components have been widely used in various disciplines and adopted to analyze the key encoding information of a complex system. Therefore, they are also the principle objects in spectral methods for data clustering.

Complete Chapter List

Search this Book:
Reset