How to model bounded target variable?

I have 5 variables and I’m trying to predict my target variable which must be within the range 0 to 70.

How do I use this piece of information to model my target better?

Answer

You don’t necessarily have to do anything. It’s possible the predictor will work fine. Even if the predictor extrapolates to values outside the range, possibly clamping the predictions to the range (that is, use $\max(0, \min(70, \hat{y}))$ instead of $\hat{y}$) will do well. Cross-validate the model to see whether this works.

However, the restricted range raises the possibility of a nonlinear relationship between the dependent variable ($y$) and the independent variables ($x_i$). Some additional indicators of this include:

  • Greater variation in residual values when $\hat{y}$ is in the middle of its range, compared to variation in residuals at either end of the range.

  • Theoretical reasons for specific non-linear relationships.

  • Evidence of model mis-specification (obtained in the usual ways).

  • Significance of quadratic or high-order terms in the $x_i$.

Consider a nonlinear re-expression of $y$ in case any of these conditions hold.

There are many ways to re-express $y$ to create more linear relationships with the $x_i$. For instance, any increasing function $f$ defined on the interval $[0,70]$ can be “folded” to create a symmetric increasing function via $y \to f(y) – f(70-y)$. If $f$ becomes arbitrarily large and negative as its argument approaches $0$, the folded version of $f$ will map $[0,70]$ into all the real numbers. Examples of such functions include the logarithm and any negative power. Using the logarithm is equivalent to the “logit link” recommended by @user603. Another way is to let $G$ be the inverse CDF of any probability distribution and define $f(y) = G(y/70)$. Using a Normal distribution gives the “probit” transformation.

One way to exploit families of transformations is to experiment: try a likely transformation, perform a quick regression of the transformed $y$ against the $x_i$, and test the residuals: they should appear to be independent of the predicted values of $y$ (homoscedastic and uncorrelated). These are signs of a linear relationship with the independent variables. It helps, too, if the residuals of the back-transformed predicted values tend to be small. This indicates the transformation has improved the fit. To resist the effects of outliers, use robust regression methods such as iteratively reweighted least squares.

Attribution
Source : Link , Question Author : user333 , Answer Author : whuber

Leave a Comment