the Fisher analysis aims at simultaneously maximising the
between-class separation, while minimising the within-class
dispersion. A useful measure of the discrimination power of a variable
is hence given by the diagonal quantity: Bii/Wii.
I understand that the size (
p x p) of the Between (B) and Within-Class (W) matrices are given by the number of input variables,
p. Given this, how can Bii/Wii be a “useful measure of the discrimination power” of a single variable? At least two variables are required to construct the matrices B and W, so the respective traces would represent more than one variable.
Update: Am I right in thinking that Bii/Wii is not a trace over a trace, where the sum is implied, but the matrix element Bii divided by Wii? Currently that is the only way I can reconcile the expression with the concept.
Here is a short tale about Linear Discriminant Analysis (LDA) as a reply to the question.
When we have one variable and k groups (classes) to discriminate by it, this is ANOVA. The discrimination power of the variable is SSbetween groups/SSwithin groups, or B/W.
When we have p variables, this is MANOVA. If the variables are uncorrelated neither in total sample nor within groups, then the above discrimination power, B/W, is computed analogously and could be written as trace(Sb)/trace(Sw), where Sw is the pooled within-group scatter matrix (i.e. the sum of k
p x p SSCP matrices of the variables, centered about the respective groups’ centroid); Sb is the between-group scatter matrix =St−Sw, where St is the scatter matrix for the whole data (SSCP matrix of the variables centered about the grand centroid. (A “scatter matrix” is just a covariance matrix without devidedness by sample_size-1.)
When there is some correlation between the variables – and usually there is – the above B/W is expressed by S−1wSb which is not a scalar anymore but a matrix. This simply due to that there are p discriminative variables hidden behind this “overall” discrimination and partly sharing it.
Now, we may want to submerge in MANOVA and decompose S−1wSb into new and mutually orthogonal latent variables (their number is min(p,k−1)) called discriminant functions or discriminants – the 1st being the strongest discriminator, the 2nd being next behind, etc. Just like we do it in Pricipal component analysis. We replace original correlated variables by uncorrelated discriminants without loss of discriminative power. Because each next discriminant is weaker and weaker we may accept a small subset of first m discriminants without great loss of discriminative power (again, similar to how we use PCA). This is the essense of LDA as of dimensionality reduction technique (LDA is also a Bayes’ classification technique, but this is an entirely separate topic).
LDA thus resembles PCA. PCA decomposes “correlatedness”, LDA decomposes “separatedness”. In LDA, because the above matrix expressing “separatedness” isn’t symmetric, a by-pass algebraic trick is used to find its eigenvalues and eigenvectors1. Eigenvalue of each discriminant function (a latent variable) is its discriminative power B/W I was saying about in the first paragraph. Also, it is worth mentioning that discriminants, albeit uncorrelated, are not geometrically orthogonal as axes drawn in the original variable space.
Some potentially related topics that you might want to read:
LDA is MANOVA “deepened” into analysing latent structure and is a particular case of Canonical correlation analysis (exact equivalence between them as such).
How LDA classifies objects and what are Fisher’s coefficients. (I link only to my own answers currently, as I remember them, but there is many good and better answers from other people on this site as well).
1 LDA extraction phase computations are as follows. Eigenvalues (L) of S−1wSb are the same as of symmetric matrix (U−1)′SbU−1, where U is the Cholesky root of Sw: an upper-triangular matrix whereby U′U=Sw. As for the eigenvectors of S−1wSb, they are given by V=U−1E, where E are the eigenvectors of the above matrix (U−1)′SbU−1. (Note: U, being triangular, can be inverted – using low-level language – faster than using a standard generic “inv” function of packages.)
The described workaround-eigendecomposition-of-S−1wSb method is realized in some programs (in SPSS, for example), while in other programs there is realized a “quasi zca-whitening” method which, being just a little slower, gives the same results and is described elsewhere. To summarize it here: obtain ZCA-whitening matrix for Sw – the symmetric sq. root S−1/2w (what is done through eigendecomposition); then eigendecomposition of S−1/2wSbS−1/2w (which is a symmetric matrix) yields discriminant eigenvalues L and eigenvectors A, whereby the discriminant eigenvectors V=S−1/2wA. The “quasi zca-whitening” method can be rewritten to be done via singular-value-decomposition of casewise dataset instead of working with Sw and Sb scatter matrices; that adds computational precision (what is important in near-singularity situation), but sacrifices speed.
OK, let’s turn to the statistics usually computed in LDA. Canonical correlations corresponding to the eigenvalues are Γ=√L/(L+1). Whereas eigenvalue of a discriminant is B/W of the ANOVA of that discriminant, canonical correlation squared is B/T (T = total sum-of-squares) of that ANOVA.
If you normalize (to SS=1) columns of eigenvectors V then these values can be seen as the direction cosines of the rotation of axes-variables into axes-discriminants; so with their help one can plot discriminants as axes on the scatterplot defined by the original variables (the eigenvectors, as axes in that variables’ space, are not orthogonal).
The unstandardized discriminant coefficients or weights are simply the scaled eigenvectors C=√N−k V. These are the coefficients of linear prediction of discriminants by the centered original variables. The values of discriminant functions themselves (discriminant scores) are XC, where X is the centered original variables (input multivariate data with each column centered). Discriminants are uncorrelated. And when computed by the just above formula they also have the property that their pooled within-class covariance matrix is the identity matrix.
Optional constant terms accompanying the unstandardized coefficients and allowing to un-center the discriminants if the input variables had nonzero means are C0=−∑pdiag(ˉX)C, where diag(ˉX) is the diagonal matrix of the p variables’ means and ∑p is the sum across the variables.
In standardized discriminant coefficients, the contribution of variables into a discriminant is adjusted to the fact that variables have different variances and might be measured in different units; K=√diag(Sw)V (where diag(Sw) is diagonal matrix with the diagonal of Sw). Despite being “standardized”, these coefficients may occasionally exceed 1 (so don’t be confused). If the input variables were z-standardized within each class separately, standardized coefficients = unstandardized ones. Coefficients may be used to interpret discriminants.
Pooled within-group correlations (“structure matrix”, sometimes called loadings) between variables and discriminants are given by R=√diag(Sw)−1SwV. Correlations are insensitive to collinearity problems and constitute an alternative (to the coefficients) guidance in assessment of variables’ contributions, and in interpreting discriminants.
See the complete output of the extraction phase of the discriminant analysis of iris data here.
Read this nice later answer which explains a bit more formally and detailed the same things as I did here.
This question deals with the issue of standardizing data before doing LDA.