In
R
, I uselda
function from libraryMASS
to do classification. As I understand LDA, input x will be assigned label y, which maximize p(yx), right?But when I fit the model, in which x=(Lag1,Lag2)y=Direction, I don’t quite understand the output from
lda
,Edit: to reproduce the output below, first run:
library(MASS) library(ISLR) train = subset(Smarket, Year < 2005) lda.fit = lda(Direction ~ Lag1 + Lag2, data = train)
> lda.fit Call: lda(Direction ~ Lag1 + Lag2, data = train) Prior probabilities of groups: Down Up 0.491984 0.508016 Group means: Lag1 Lag2 Down 0.04279022 0.03389409 Up 0.03954635 0.03132544 Coefficients of linear discriminants: LD1 Lag1 0.6420190 Lag2 0.5135293
I understand all the info in the above output but one thing, what is
LD1
? I search the web for it, is it linear discriminant score? What is that and why do I need it?UPDATE
I read several posts (such as this and this one) and also search the web for DA, and now here is what I think about DA or LDA.
It can be used to do classification, and when this is the purpose, I can use the Bayes approach, that is, compute the posterior p(yx) for each class y_i, and then classify x to the class with the highest posterior. By this approach, I don’t need to find out the discriminants at all, right?
As I read in the posts, DA or at least LDA is primarily aimed at dimensionality reduction, for K classes and Ddim predictor space, I can project the Ddim x into a new (K1)dim feature space z, that is, \begin{align*}x&=(x_1,…,x_D)\\z&=(z_1,…,z_{K1})\\z_i&=w_i^Tx\end{align*}, z can be seen as the transformed feature vector from the original x, and each w_i is the vector on which x is projected.
Am I right about the above statements? If yes, I have following questions:
What is a discriminant? Is each entry z_i in vector z is a discriminant? Or w_i?
How to do classification using discriminants?
Answer
If you multiply each value of LDA1
(the first linear discriminant) by the corresponding elements of the predictor variables and sum them (0.6420190\timesLag1
+ 0.5135293\timesLag2
) you get a score for each respondent. This score along the the prior are used to compute the posterior probability of class membership (there are a number of different formulas for this). Classification is made based on the posterior probability, with observations predicted to be in the class for which they have the highest probability.
The chart below illustrates the relationship between the score, the posterior probability, and the classification, for the data set used in the question. The basic patterns always holds with twogroup LDA: there is 1to1 mapping between the scores and the posterior probability, and predictions are equivalent when made from either the posterior probabilities or the scores.
Answers to the subquestions and some other comments

Although LDA can be used for dimension reduction, this is not what is going on in the example. With two groups, the reason only a single score is required per observation is that this is all that is needed. This is because the probability of being in one group is the complement of the probability of being in the other (i.e., they add to 1). You can see this in the chart: scores of less than .4 are classified as being in the Down group and higher scores are predicted to be Up.

Sometimes the vector of scores is called a
discriminant function
. Sometimes the coefficients are called this. I’m not clear on whether either is correct. I believe that MASSdiscriminant
refers to the coefficients. 
The MASS package’s
lda
function produces coefficients in a different way to most other LDA software. The alternative approach computes one set of coefficients for each group and each set of coefficients has an intercept. With the discriminant function (scores) computed using these coefficients, classification is based on the highest score and there is no need to compute posterior probabilities in order to predict the classification. I have put some LDA code in GitHub which is a modification of theMASS
function but produces these more convenient coefficients (the package is calledDisplayr/flipMultivariates
, and if you create an object usingLDA
you can extract the coefficients usingobj$original$discriminant.functions
). 
I have posted the R for code all the concepts in this post here.
 There is no single formula for computing posterior probabilities from the score. The easiest way to understand the options is (for me anyway) to look at the source code, using:
library(MASS)
getAnywhere("predict.lda")
Attribution
Source : Link , Question Author : avocado , Answer Author : Tim