Apologies for the rudimentary question, I am new to this form of analysis and have a very limited understanding of the principles so far.
I was just wondering if many of the parametric assumptions for multivariate/univariate tests apply for Cluster analysis? Many of the sources of information I have read regarding cluster analysis fail to specify any assumptions.
I am particularly interested in the assumption of independence of observations. My understanding is that violation of this assumption (in ANOVA and MAVOVA for example) is serious because it influences estimates of error. From my reading so far, it seems that cluster analysis is largely a descriptive technique (that only involves statistical inference in certain specified cases). Accordingly, are assumptions such as independence and normally distributed data required?
Any recommendations of texts that discuss this issue would be greatly appreciated.
Well, clustering techniques are not limited to distance-based methods where we seek groups of statistical units that are unusually close to each other, in a geometrical sense. There’re also a range of techniques relying on density (clusters are seen as “regions” in the feature space) or probability distribution.
The latter case is also know as model-based clustering; psychometricians use the term Latent Profile Analysis to denote this specific case of Finite Mixture Model, where we assume that the population is composed of different unobserved groups, or latent classes, and that the joint density of all manifest variables is a mixture of this class-specific density. Good implementation are available in the Mclust package or Mplus software. Different class-invariant covariance matrices can be used (in fact, Mclust uses the BIC criterion to select the optimal one while varying the number of clusters).
The standard Latent Class Model also makes the assumption that observed data come from a mixture of g multivariate multinomial distributions. A good overview is available in Model-based cluster analysis: a Defence, by Gilles Celeux.
Inasmuch these methods rely on distributional assumptions, this also render possible to use formal tests or goodness-of-fit indices to decide about the number of clusters or classes, which remains a difficult problem in distance-based cluster analysis, but see the following articles that discussed this issue:
- Handl, J., Knowles, J., and Kell, D.B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15), 3201-3212.
- Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.
- Hennig, C. (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. Journal of Multivariate Analysis, 99, 1154-1176.