Feature Extraction/Selection in High-Dimensional Spectral Data

Feature Extraction/Selection in High-Dimensional Spectral Data

Seoung Bum Kim (The University of Texas at Arlington, USA)
Copyright: © 2009 |Pages: 7
DOI: 10.4018/978-1-60566-010-3.ch133
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Development of advanced sensing technology has multiplied the volume of spectral data, which is one of the most common types of data encountered in many research fields that require advanced mathematical methods with highly efficient computation. Examples of the fields in which spectral data abound include nearinfrared, mass spectroscopy, magnetic resonance imaging, and nuclear magnetic resonance spectroscopy. The introduction of a variety of spectroscopic techniques makes it possible to investigate changes in composition in a spectrum and to quantify them without complex preparation of samples. However, a major limitation in the analysis of spectral data lies in the complexity of the signals generated by the presence of a large number of correlated features. Figure 1 displays a high-level diagram of the overall process of modeling and analyzing spectral data. The collected spectra should be first preprocessed to ensure high quality data. Preprocessing steps generally include denoising, baseline correction, alignment, and normalization. Feature extraction/selection identifies the important features for prediction, and relevant models are constructed through the learning processes. The feedback path from the results of the validation step enables control and optimization of all previous steps. Explanatory analysis and visualization can provide initial guidelines that make the subsequent steps more efficient. This chapter focuses on the feature extraction/selection step in the modeling and analysis of spectral data. Particularly, throughout the chapter, the properties of feature extraction/selection procedures are demonstrated with spectral data from high-resolution nuclear magnetic resonance spectroscopy, one of the widely used techniques for studying metabolomics.
Chapter Preview
Top

Introduction

Development of advanced sensing technology has multiplied the volume of spectral data, which is one of the most common types of data encountered in many research fields that require advanced mathematical methods with highly efficient computation. Examples of the fields in which spectral data abound include near-infrared, mass spectroscopy, magnetic resonance imaging, and nuclear magnetic resonance spectroscopy.

The introduction of a variety of spectroscopic techniques makes it possible to investigate changes in composition in a spectrum and to quantify them without complex preparation of samples. However, a major limitation in the analysis of spectral data lies in the complexity of the signals generated by the presence of a large number of correlated features. Figure 1 displays a high-level diagram of the overall process of modeling and analyzing spectral data.

Figure 1.

Overall process for the modeling and analysis of high-dimensional spectra data

The collected spectra should be first preprocessed to ensure high quality data. Preprocessing steps generally include denoising, baseline correction, alignment, and normalization. Feature extraction/selection identifies the important features for prediction, and relevant models are constructed through the learning processes. The feedback path from the results of the validation step enables control and optimization of all previous steps. Explanatory analysis and visualization can provide initial guidelines that make the subsequent steps more efficient.

This chapter focuses on the feature extraction/selection step in the modeling and analysis of spectral data. Particularly, throughout the chapter, the properties of feature extraction/selection procedures are demonstrated with spectral data from high-resolution nuclear magnetic resonance spectroscopy, one of the widely used techniques for studying metabolomics.

Top

Background

Metabolomics is global analysis for the detection and recognition of metabolic changes in biological systems in response to pathophysiological stimuli and to the intake of toxins or nutrition (Nicholson et al., 2002). A variety of techniques, including electrophoresis, chromatography, mass spectroscopy, and nuclear magnetic resonance, are available for studying metabolomics. Among these techniques, proton nuclear magnetic resonance (1H-NMR) has the advantages of high-resolution, minimal cost, and little sample preparation (Dunn & Ellis, 2005). Moreover, the technique generates high-throughput data, which permits simultaneous investigation of hundreds of metabolite features. Figure 2 shows a set of spectra generated by a 600MHz 1H-NMR spectroscopy. The x-axis indicates the chemical shift within units in parts per million (ppm), and the y-axis indicates the intensity values corresponding to each chemical shift. Traditionally, chemical shifts in the x-axis are listed from largest to smallest. Analysis of high-resolution NMR spectra usually involves combinations of multiple samples, each with tens of thousands of correlated metabolite features with different scales.

Figure 2.

Multiple spectra generated by a 600MHz 1H-NMR spectroscopy

This leads to a huge number of data points and a situation that challenges analytical and computational capabilities. A variety of multivariate statistical methods have been introduced to reduce the complexity of metabolic spectra and thus help identify meaningful patterns in high-resolution NMR spectra (Holmes & Antti, 2002). Principal components analysis (PCA) and clustering analysis are examples of unsupervised methods that have been widely used to facilitate the extraction of implicit patterns and elicit the natural groupings of the spectral dataset without prior information about the sample class (e.g., Beckonert et al., 2003). Supervised methods have been applied to classify metabolic profiles according to their various conditions (e.g., Holmes et al., 2001). The widely used supervised methods in metabolomics include Partial Least Squares (PLS) methods, k-nearest neighbors, and neural networks (Lindon et al., 2001).

Complete Chapter List

Search this Book:
Reset