I’m working on a project using logistic regression to predict student retention. The data were collected through three self-report instruments. We are trying to find out which predictors are powerful enough to predict the at-risk students. I came across some articles saying a balanced sample (50% stay, 50% dropouts) is desirable for such study, e.g. Glynn, J.G., Sauer, P.L., & Miller, T.E. (2003). Signaling Student Retention With Prematriculation Data, NASPA Journal, 41 (1), 41-67:
A problem however, is that the distribution of the dependent variable is likely to be highly skewed toward persistence. For example, if 85% of the analysis sample were persistors, a classification model that classified every student as a persistor would have a success rate of 85%, or would classify 85% of student correctly. To resolve this issue, the maintenance of relative balance between the number of dropouts and the number of persistors (about 50% each) in the analysis sample was desirable.
Is this true? Our sample only has about 25%-30% dropout students. Will this affect the results?
This is not so much a problem of logistic regression per se as it is a problem with classification accuracy as a performance measure. Note that balancing the data set is not necessarily the only valid approach. If one of the classes is actually much more common in the population (and not merely in your sample), a naive model (classifying everything as belonging to the most common category) really is a good guess. If the error costs are not symmetric, balancing the data set might lead you to err in the wrong direction (the more costly one).
The problem also often comes up the other way around: Training/evaluating on some artificially balanced data set before using the resulting model in a strongly unbalanced situation (think detecting fraud or diagnosing a rare disease) where the usefulness of the model is not nearly as high as the raw accuracy would suggest. It all depends on your objectives and your cost structure.