# Linear regression when Y is bounded and discrete

The question is straightforward: Is it appropriate to use linear regression when Y is bounded and discrete (e.g. the test score 1~100, some pre-defined ranking 1~17)? In this case, is it “not good” to use linear regression, or it’s totally wrong to use it?

When a response or outcome $$Y$$ is bounded, various questions arise in fitting a model, including the following:

1. Any model that could predict values for the response outside those bounds is in principle dubious. Hence a linear model might be problematic as there are no bounds on $$\hat Y = Xb$$ for predictors $$X$$ and coefficients $$b$$ whenever the $$X$$ are themselves unbounded in one or both directions. However, the relationship might be weak enough for this not to bite and/or predictions might well remain within bounds over the observed or plausible range of the predictors. At one extreme, if the response is some mean $$+$$ noise it hardly matters which model one fits.

2. As the response can’t exceed its bounds, a nonlinear relationship is often more plausible with predicted responses tailing off to approach bounds asymptotically. Sigmoid curves or surfaces such as those predicted by logit or probit models are attractive in this regard and are now not difficult to fit. A response such as literacy (or fraction adopting any new idea) often shows such a sigmoid curve in time and plausibly with almost any other predictor.

3. A bounded response can’t have the variance properties expected in plain or vanilla regression. Necessarily as the mean response approaches lower and upper bounds, the variance always approaches zero.

A model should be chosen according to what works and knowledge of the underlying generating process. Whether the client or audience knows about particular model families may also guide practice.

Note that I am deliberately avoiding blanket judgments such as good/not good, appropriate/not appropriate, right/wrong. All models are approximations at best and which approximation appeals, or is good enough for a project, isn’t so easy to predict. I typically favour logit models as first choice for bounded responses myself, but even that preference is based partly on habit (e.g. my avoiding probit models for no very good reason) and partly on where I will report results, usually to readerships that are, or should be, statistically well informed.

Your examples of discrete scales are for scores 1-100 (in assignments I mark, 0 is certainly possible!) or rankings 1-17. For scales like that, I would usually think of fitting continuous models to responses scaled to [0, 1]. There are, however, practitioners of ordinal regression models who would happily fit such models to scales with a fairly large number of discrete values. I am happy if they reply if they are so minded.