I am using rlm in the R MASS package to regress a multivariate linear model. It works well for a number of samples but I am getting quasi-null coefficients for a particular model:
Call: rlm(formula = Y ~ X1 + X2 + X3 + X4, data = mymodel, maxit = 50, na.action = na.omit) Residuals: Min 1Q Median 3Q Max -7.981e+01 -6.022e-03 -1.696e-04 8.458e-03 7.706e+01 Coefficients: Value Std. Error t value (Intercept) 0.0002 0.0001 1.8418 X1 0.0004 0.0000 13.4478 X2 -0.0004 0.0000 -23.1100 X3 -0.0001 0.0002 -0.5511 X4 0.0006 0.0001 8.1489 Residual standard error: 0.01086 on 49052 degrees of freedom (83 observations deleted due to missingness)
For comparison, these are the coefficients calculated by lm():
Call: lm(formula = Y ~ X1 + X2 + X3 + X4, data = mymodel, na.action = na.omit) Residuals: Min 1Q Median 3Q Max -76.784 -0.459 0.017 0.538 78.665 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.016633 0.011622 -1.431 0.152 X1 0.046897 0.004172 11.240 < 2e-16 *** X2 -0.054944 0.002184 -25.155 < 2e-16 *** X3 0.022627 0.019496 1.161 0.246 X4 0.051336 0.009952 5.159 2.5e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.574 on 49052 degrees of freedom (83 observations deleted due to missingness) Multiple R-squared: 0.0182, Adjusted R-squared: 0.01812 F-statistic: 227.3 on 4 and 49052 DF, p-value: < 2.2e-16
The lm plot doesn’t show any particularly high outlier, as measured by Cook’s distance:
EDIT
For reference and after confirming results based on the answer provided by Macro, the R command to set the tuning parameter,
k
, in the Huber estimator is (k=100
in this case):rlm(y ~ x, psi = psi.huber, k = 100)
Answer
The difference is that rlm()
fits models using your choice of a number of different $M$-estimators, while lm()
uses ordinary least squares.
In general the $M$-estimator for a regression coefficient minimizes
$$ \sum_{i=1}^{n} \rho \left( \frac{Y_i – {\bf X}_{i} {\boldsymbol \beta}}{\sigma} \right) $$
as a function of ${\boldsymbol \beta}$, where $Y_i$ is the $i$’th response, and ${\bf X}_{i}$ is the predictors for individual $i$. Least squares is a special case of this where $$ \rho(x) = x^2 $$ However, the default setting for rlm()
, which you appear to be using, is the Huber $M$-estimator, which uses
$$
\rho(x) = \begin{cases} \frac{1}{2} x^2 &\mbox{if } |x| \leq k\\
k |x| – \frac{1}{2} k^2 & \mbox{if } |x| > k. \end{cases}
$$
where $k$ is a constant. The default in rlm()
is $k = 1.345$. These two estimators are minimizing different criteria, so it is no surprise that the estimates are different.
Edit: From the QQ plot shown above, it looks like you have a very long tailed error distribution. This is the kind of situation the Huber M-estimator is designed for and, in that situation, can give quite different estimates:
When the errors are normally distributed, the estimates will be pretty similar since, under the normal distribution, most of the Huber $ρ$ function will fall under the $|x|<k$ situation, which is equivalent to least squares. In the long tailed situation you have, many fall into the $|x|>k$ situation, which is a departure from OLS, which would explain the discrepancy.
Attribution
Source : Link , Question Author : Robert Kubrick , Answer Author : Macro