Let’s say I have the following data and am running a regression model:
df=data.frame(income=c(5,3,47,8,6,5), won=c(0,0,1,1,1,0), age=c(18,18,23,50,19,39), home=c(0,0,1,0,0,1))
On one hand, I run a linear model to predict on income:
md1 = lm(income ~ age + home + home, data=df)
Second, I run a logit model to predict on the won variable:
md2 = glm(factor(won) ~ age + home, data=df, family=binomial(link="logit"))
For both models, I wonder how I can generate a table or data frame with the predictor response category, fitted value, and the model predicted value.
So for the linear model, something like:
age fitted_income predicted_income 18 3 5 23 3 3 50 4 2 19 5 5 39 6 4 home fitted_income predicted_income 0 5 6 1 3 9
Or perhaps it should be for each data point. So for x_i data point, the fitted and predicted values are:
id age fitted_income predicted_income 1 18 3 5 2 23 3 3 3 50 4 2 4 19 5 5 5 39 6 4
From a statistical standpoint, is such an undertaking useful? Why or why not?
How can this be done in R? (looked at names(md1) and found what I can pull from the model, but haven’t proceeded past that)
Thanks!
Answer
You have to be a bit careful with model objects in R. For example, whilst the fitted values and the predictions of the training data should be the same in the glm()
model case, they are not the same when you use the correct extractor functions:
R> fitted(md2)
1 2 3 4 5 6
0.4208590 0.4208590 0.4193888 0.7274819 0.4308001 0.5806112
R> predict(md2)
1 2 3 4 5 6
-0.3192480 -0.3192480 -0.3252830 0.9818840 -0.2785876 0.3252830
That is because the default for predict.glm()
is to return predictions on the scale of the linear predictor. To get the fitted values we want to apply the inverse of the link function to those values. fitted()
does that for us, and we can get the correct values using predict()
as well:
R> predict(md2, type = "response")
1 2 3 4 5 6
0.4208590 0.4208590 0.4193888 0.7274819 0.4308001 0.5806112
Likewise with residuals()
(or resid()
); the values stored in md2$residuals
are the working residuals are are unlikely to be what you want. The resid()
method allows you to specify the type of residual you want and has a useful default.
For the glm()
model, something like this will suffice:
R> data.frame(Age = df$age, Won = df$won, Fitted = fitted(md2))
Age Won Fitted
1 18 0 0.4208590
2 18 0 0.4208590
3 23 1 0.4193888
4 50 1 0.7274819
5 19 1 0.4308001
6 39 0 0.5806112
Something similar can be done for the lm()
model:
R> data.frame(Age = df$age, Income = df$income, Fitted = fitted(md1))
Age Income Fitted
1 18 5 7.893273
2 18 3 7.893273
3 23 47 28.320749
4 50 8 -1.389725
5 19 6 7.603179
6 39 5 23.679251
Attribution
Source : Link , Question Author : ATMathew , Answer Author : Gavin Simpson