## Abstract

Cluster analysis is a set of statistical models and algorithms that attempt to find “natural groupings” of sampling units (e.g., customers, survey respondents, plant or animal species) based on measurements. The observable measurements are sometimes called manifest variables and cluster membership is called a latent variable. It is assumed that each sampling unit comes from one of K clusters or classes, but the cluster identifier cannot be observed directly and can only be inferred from the manifest variables. See Bartholomew and Knott (1999) and Everitt, Landau and Leese (2001) for a broader survey of existing methods for cluster analysis. Many applications in science, engineering, social science, and industry require grouping observations into “types.” Identifying typologies is challenging, especially when the responses (manifest variables) are categorical. The classical approach to cluster analysis on those data is to apply the latent class analysis (LCA) methodology, where the manifest variables are assumed to be independent conditional on the cluster identity. For example, Aitkin, Anderson and Hinde (1981) classified 468 teachers into clusters according to their binary responses to 38 teaching style questions. This basic assumption in classical LCA is often violated and seems to have been made out of convenience rather than it being reasonable for a wide range of situations. For example, in the teaching styles study two questions are “Do you usually allow your pupils to move around the classroom?” and “Do you usually allow your pupils to talk to one another?” These questions are mostly likely correlated even within a class.

Top## Main Focus

Assume a random sample of *n* observations, where each comes from one of *K* unobserved classes. Random variable *Y* ∈ {1, …, *K*} is the latent variable, specifying the value of class membership. Let *P*(*Y*=*k*) = η_{k} specify the *prior distribution* of class membership, where

.

For each observation *i* = 1, …, *n,* the researcher observes *p* manifest variables *X*_{i} = (*X*_{i}_{1}, …, *X*_{ip})′. Given that an observation comes from class *k* (i.e., *Y* = *k*), the *class-conditional distribution* of *X*, denoted as *f*_{k}(*x*;θ_{k}), is generally assumed to come from common distribution families. For example, classical LCA assumes that the components of *X* are each multinomial and independent of each other for objects within the same class. Suppose each manifest variable takes only 2 values, hereafter labeled generically “yes” and “no”, then *P*(*X*_{j} =x_{j} | *Y*) is a Bernoulli trial. Let π _{jk} be the probability that someone in class *k* has a “yes” value to manifest variable *X*_{j}. Then the class-conditional distribution, under the assumption of class-conditional independence, is