Is it mandatory to subset your data to validate a model?

I’m having a hard time getting on the same page as my supervisor when it comes to validating my model. I have analyzed the residues (observed against the fitted values) and I used this as an argument to discuss the results obtained by my model, however my supervisor insists that the only way to validate a model is to make a random subset of my data, generate the model with 70% of it and then apply the model on the remaining 30%.

The thing is, my response variable is zero inflated (85% of it, to be more rpecise) and i prefer not to create a subset as it is already very difficult to converge to a result.

So, my question is: what are the possible (and scientifically acceptable) ways to validate a model? Is subsetting data the only way? If possible, reference your questions with articles/books so I can use it as an argument when presenting my alternatives.

Answer

To start, I would suggest that it is usually good to be wary of statements that there is only one way to do something. Splitting an obtained sample into a “training” and a “testing” data set is a common approach in many machine learning/data science applications. Oftentimes, these modeling approaches are less interested in hypothesis testing about an underlying data generation process, which is to say they tend to be somewhat atheoretical. In fact, mostly these sorts of training/testing splits just want to see if the model is over-fitting in terms of predictive performance. Of course, it is also possible to use a training/testing approach to see if a given model replicates in terms of which parameters are “significant,” or to see if the parameter estimates fall within expected ranges in both instances.

In theory, validating or invalidating models is what science, writ large, is supposed to be doing. Independent researchers, separately examining, generating, and testing hypotheses that support or refute arguments about a theory for why or under what circumstances an observable phenomenon occurs – that is the scientific enterprise in a nut shell (or at least in one overly long sentence). So to answer your question, to me, even training/testing splits are not “validating” a model. That is something that takes the weight of years of evidence amassed from multiple independent researchers studying the same set of phenomena. Though, I will grant that this take may be something of a difference in semantics about what I view model validation to mean versus what the term validation has come to mean in applied settings… but to get back to the root of your question more directly.

Depending on your data and modeling approach, it may not always be appropriate from a statistical standpoint to split your sample into training and testing sets. For instance, small samples may be particularly difficult to apply this approach to. Additionally, some distributions may have certain properties making them difficult to model even with relatively large samples. Your zero-inflated case likely fits this latter description. If the goal is to get at an approximation of the “truth” about a set of relations or underlying processes thought to account for some phenomenon, you will not be well-served by knowingly taking an under-powered approach to testing a given hypothesis. So perhaps the first step is to perform a power analysis to see if you would even be likely to replicate the finding of interest in your subsetted data. If it is not appropriately powered, that you could be an argument against the testing/training split.

Another option is to specify several models to see if they “better” explain the observed data. The goal here would be to identify the best model among a set of reasonable alternatives. This is a relative, not an absolute, argument you’d be making about your model. Essentially, you are admitting that there may be other models that could be posited to explain your data, but your model is the best of the tested set of alternatives (at least you hope so). All models in the set, including your hypothesized model, should be theoretically grounded; otherwise you run the risk of setting up a bunch of statistical straw men.

There are also Bayes Factors in which you can compute the weight of evidence your model provides, given your data, for a specific hypothesis relative to alternative scenarios.

This is far from an exhaustive list of options, but I hope it helps. I’ll step down from the soapbox now. Just remember that every model in every published study about human behavior is incorrect. There are almost always relevant omitted variables, unmodeled interactions, imperfectly sampled populations, and just plain old sampling error at play obfuscating the underlying truth.

Attribution
Source : Link , Question Author : Eric Lino , Answer Author : Matt Barstead

Leave a Comment