In general, I have often heard of “splines” being referred to as “old models”, criticized for being prone to overfit the data, and being considered to be only better than “higher order polynomials”.
However, splines still seem to be used despite the shortcomings mentioned above.
My Question: Are there any ideal use cases for splines? Are there any industries/types of problems where spline based models are seen as the “dominant methodology” (e.g. survival analysis models for time-to-event data)?
I have heard that at times, spline models are able to “interpolate” data better than other types of statistical models – I have also seen a few authors attempt to use splines to model the “rate parameter function” in a poisson process.
Can anyone please provide comments on this?
You have to define what you mean by “ideal” or “best” in this question, but I will give you my two cents none the less.
Are there any ideal use cases for splines?
My (very radical) opinion is that when data are plentiful, a spline makes the most sense to use in the absence of any other information. Frank Harrell provides an excellent rationale for this perspective, which I will repeat now.
There are 2 possibilities (use splines, or don’t) and 2 latent truths (the effect of the variable truly is non-linear, or it is linear). There are then 4 outcomes to consider:
We don’t use splines and the effect is linear. In this case, a simple model will suffice and we benefit from lower variance fits.
We use splines and the effect is linear. In this case, we spend extra degrees of freedom unnecessarily, increasing the variance of our fits, but the effect of the variable is appropriately estimated.
We don’t use splines and the effect is non-linear. This has the potential to be catastrophic! Imagine that a variable has an effect on y that looks like y=x2. If x is mean centered, then the linear fit could estimate the effect to be 0 or too small, depending on the distribution of x.
We use splines and the effect is non-linear. This would be the best case scenario where we appropriately spend our degrees of freedom.
From this scenario analysis it seems that it is always best to fit a spline to data (again, I will concede to the weaker position that this really depends on the availability of data). Best of all, if we initially fit a spline we can always perform an F test to determine if the spline model explains more variance than a linear effect, thereby informing future models fit to similar data.
That’s my opinion, generally. Now, let me offer an example in which splines are really clearly the better choice. Suppose you are modelling some function over the course a day (maybe it is traffic to a website or something). The effect has a strong chance of being non-linear since our lives are largely effected by time (we sleep during some periods thereby leading to lower activity on the website, and are active in other periods). Not only is the effect non-linear, it is cyclic (the activity at 1 minute prior to midnight is approximately the same as the activity 1 minute post midnight). We could potentially model this with a trig function, but a better option is to use a cyclic spline which can a) accommodate the cyclic nature of the phenomenon, and b) avoid model bias in having to specify a function form of the phenomena a priori.