I’m reading a paper and the author wrote:
The effect of A,B, C on Y was studied through the use of multiple regression analysis. A,B,C were entered into the regression equation with Y as the dependent variable. The analysis of variance is presented in Table 3.
The effect of B on Y was significant, with B correlating .27 with Y.English is not my mother tongue and I got really confused here.
First, he said he would run a regression analysis, then he showed us the analysis of variance. Why?
And then he wrote about the correlation coefficient, is that not from correlation analysis? Or this word could also be used to describe regression slope?
Answer
First, he said he would run a regression analysis, then he showed us
the analysis of variance. Why?
Analysis of variance (ANOVA) is just a technique comparing the variance explained by the model versus the variance not explained by the model. Since regression models have both the explained and unexplained component, it’s natural that ANOVA can be applied to them. In many software packages, ANOVA results are routinely reported with linear regression. Regression is also a very versatile technique. In fact, both t-test and ANOVA can be expressed in regression form; they are just a special case of regression.
For example, here is a sample regression output. The outcome is miles per gallon of some cars and the independent variable is whether the car was domestic or foreign:
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 13.18
Model | 378.153515 1 378.153515 Prob > F = 0.0005
Residual | 2065.30594 72 28.6848048 R-squared = 0.1548
-------------+------------------------------ Adj R-squared = 0.1430
Total | 2443.45946 73 33.4720474 Root MSE = 5.3558
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.foreign | 4.945804 1.362162 3.63 0.001 2.230384 7.661225
_cons | 19.82692 .7427186 26.70 0.000 18.34634 21.30751
------------------------------------------------------------------------------
You can see the ANOVA reported at top left. The overall F-statistics is 13.18, with a p-value of 0.0005, indicating the model being predictive. And here is the ANOVA output:
Number of obs = 74 R-squared = 0.1548
Root MSE = 5.35582 Adj R-squared = 0.1430
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 378.153515 1 378.153515 13.18 0.0005
|
foreign | 378.153515 1 378.153515 13.18 0.0005
|
Residual | 2065.30594 72 28.6848048
-----------+----------------------------------------------------
Total | 2443.45946 73 33.4720474
Notice that you can recover the same F-statistics and p-value there.
And then he wrote about the correlation coefficient, is that not from
correlation analysis? Or this word could also be used to describe
regression slope?
Assuming the analysis involved using only B and Y, technically I would not agree with the word choice. In most of the cases, slope and correlation coefficient cannot be used interchangeably. In one special case, these two are the same, that is when both the independent and dependent variables are standardized (aka in the unit of z-score.)
For example, let’s correlate miles per gallon and the price of the car:
| price mpg
-------------+------------------
price | 1.0000
mpg | -0.4686 1.0000
And here is the same test, using the standardized variables, you can see the correlation coefficient remains unchanged:
| sdprice sdmpg
-------------+------------------
sdprice | 1.0000
sdmpg | -0.4686 1.0000
Now, here are the two regression models using the original variables:
. reg mpg price
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 20.26
Model | 536.541807 1 536.541807 Prob > F = 0.0000
Residual | 1906.91765 72 26.4849674 R-squared = 0.2196
-------------+------------------------------ Adj R-squared = 0.2087
Total | 2443.45946 73 33.4720474 Root MSE = 5.1464
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
price | -.0009192 .0002042 -4.50 0.000 -.0013263 -.0005121
_cons | 26.96417 1.393952 19.34 0.000 24.18538 29.74297
------------------------------------------------------------------------------
… and here is the one with standardized variables:
. reg sdmpg sdprice
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 20.26
Model | 16.0295482 1 16.0295482 Prob > F = 0.0000
Residual | 56.9704514 72 .791256269 R-squared = 0.2196
-------------+------------------------------ Adj R-squared = 0.2087
Total | 72.9999996 73 .999999994 Root MSE = .88953
------------------------------------------------------------------------------
sdmpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sdprice | -.4685967 .1041111 -4.50 0.000 -.6761384 -.2610549
_cons | -7.22e-09 .1034053 -0.00 1.000 -.2061347 .2061347
------------------------------------------------------------------------------
As you can see, the slope of the original variables is -0.0009192, and the one with standardized variables is -0.4686, which is also the correlation coefficient.
So, unless the A, B, C, and Y are standardized, I would not agree with the article’s “correlating.” Instead, I’d just opt of a one unit increase in B is associated with the average of Y being 0.27 higher.
In more complicated situation, where more than one independent variable is involved, the phenomenon described above will no longer be true.
Attribution
Source : Link , Question Author : yue86231 , Answer Author : Penguin_Knight