How to understand the correlation coefficient formula?

Can anyone help me understand the Pearson correlation formula?
the sample r = the mean of the products of the standard scores of variables X and Y.

I kind of understand why they need to standardize X and Y, but how to understand the products of both the z scores?

This formula is also called “product-moment correlation coefficient”, but what’s the rationale for the product action?
I am not sure if I have made my question clear, but I just want to remember the formula intuitively.

Answer

In the comments, 15 ways to understand the correlation coefficent were suggested:


The 13 ways discussed in the Rodgers and Nicewander article (The American Statistician, February 1988) are

  1. A Function of Raw Scores and Means,

    r=(XiˉX)(YiˉY)(XiˉX)2(YiˉY)2.

  2. Standardized Covariance,

    r=sXY/(sXsY)

    where sXY is sample covariance and sX and sY are sample standard deviations.

  3. Standardized Slope of the Regression Line,

    r=bYXsXsY=bXYsYsX,

    where bYX and bXY are the slopes of the regression lines.

  4. The Geometric Mean of the Two Regression Slopes,

    r=±bYXbXY.

  5. The Square Root of the Ratio of Two Variances (Proportion of Variability Accounted For),

    r=(Yi^Yi)2(YiˉY)2=SSREGSSTOT=sˆYsY.

  6. The Mean Cross-Product of Standardized Variables,

    r=zXzY/N.

  7. A Function of the Angle Between the Two Standardized Regression Lines. The two regression lines (of Y vs. X and X vs. Y) are symmetric about the diagonal. Let the angle between the two lines be β. Then

    r=sec(β)±tan(β).

  8. A Function of the Angle Between the Two Variable Vectors,

    r=cos(α).

  9. A Rescaled Variance of the Difference Between Standardized Scores. Letting zYzX be the difference between standardized X and Y variables for each observation,

    r=1s2(zYzX)/2=s2(zY+zX)/21.

  10. Estimated from the “Balloon” Rule,

    r1(h/H)2

    where H is the vertical range of the entire XY scatterplot and h is the range through the “center of the distribution on the X axis” (that is, through the point of means).

  11. In Relation to the Bivariate Ellipses of Isoconcentration,

    r=D2d2D2+d2

    where D and d are the major and minor axis lengths, respectively. r also equals the slope of the tangent line of an isocontour (in standardized coordinates) at the point the contour crosses the vertical axis.

  12. A Function of Test Statistics from Designed Experiments,

    r=tt2+n2

    where t is the test statistic in a two-independent sample t test for a designed experiment with two treatment conditions (coded as X=0,1) and n is the combined total number of observations in the two treatment groups.

  13. The Ratio of Two Means. Assume bivariate normality and standardize the variables. Select some arbitrarily large value Xc of X. Then

    r=E(Y|X>Xc)E(X|X>Xc).

(Most of this is verbatim, with very slight changes in some of the notation.)

Some other methods (perhaps original to this site) are

  • Via circles. r is the slope of the regression line in standardized coordinates. This line can be characterized in various ways, including geometric ones, such as minimizing the total area of circles drawn between the line and the data points in a scatterplot.

  • By coloring rectangles. Covariance can be assessed by coloring rectangles in a scatterplot (that is, by summing signed areas of rectangles). When the scatterplot is standardized, the net amount of color–the total signed error–is r.

Attribution
Source : Link , Question Author : Aaron Lu , Answer Author :
5 revs, 2 users 89%

Leave a Comment