Cluster Analysis in R With Big Data Applications

Cluster Analysis in R With Big Data Applications

Alicia Taylor Lamere
DOI: 10.4018/978-1-7998-2768-9.ch004
(Individual Chapters)
No Current Special Offers


This chapter discusses several popular clustering functions and open source software packages in R and their feasibility of use on larger datasets. These will include the kmeans() function, the pvclust package, and the DBSCAN (density-based spatial clustering of applications with noise) package, which implement K-means, hierarchical, and density-based clustering, respectively. Dimension reduction methods such as PCA (principle component analysis) and SVD (singular value decomposition), as well as the choice of distance measure, are explored as methods to improve the performance of hierarchical and model-based clustering methods on larger datasets. These methods are illustrated through an application to a dataset of RNA-sequencing expression data for cancer patients obtained from the Cancer Genome Atlas Kidney Clear Cell Carcinoma (TCGA-KIRC) data collection from The Cancer Imaging Archive (TCIA).
Chapter Preview


The basic concept behind all clustering methods is to group together similar datapoints based on the variables that describe them. Through this clustering, we can observe characteristics that distinguish data points from cluster to cluster, leading to potential hypotheses about our population. We can also use these clusters to identify subsets of our population, which we can focus on separately in future analysis. This clustering is generally accomplished by measuring the distance between these data points and grouping those that have the smallest distances between them. The goal is to maximize the separation between different clusters while minimizing the separation between data points within each cluster. An important consideration, then, becomes how we choose to measure this distance.

Key Terms in this Chapter

Density-Based Clustering: A clustering technique that seeks to identify areas of high density separated by areas of low density.

Linkage: A numerical method for determining the distance between clusters when performing hierarchical clustering.

Distance Measure: A numerical measure of the dissimilarity between two points which must have the properties of positivity, symmetry, and abide by the triangle inequality.

RNA-Sequencing Expression: Measures of individual gene expression in the form of counts, usually obtained through technologies such as Illumina sequencing.

Clustering: An unsupervised data mining technique used to find either useful or meaningful patterns and groupings within a dataset.

K-Means Clustering: A clustering technique that identifies a predetermined number of clusters of similar size and shape through an iterative search process utilizing total distance from data points to cluster centers across all K clusters.

Pvclust: An R package for assessing the uncertainty in hierarchical cluster analysis. For each cluster in hierarchical clustering, quantities called p-values are calculated via multiscale bootstrap resampling.

Dimension Reduction: A method to capture the information, usually the variability, contained within a dataset at a lower dimension.

Hierarchical Clustering: A clustering technique that iteratively collects or separates data points into clusters using a given linkage method to evaluate the distance between clusters.

Complete Chapter List

Search this Book: