# Cross validation with nonparametric smoothing regressions

When I use regression models I feel leery of defaulting to an assumptions of linear association; instead I like to explore the functional form of relationships between dependent and explanatory variables using nonparametric smoothing regression (e.g. generalized additive models, lowess/lowess, running line smoothers, etc.) before estimating a parametric model using, as appropriate, nonlinear least squares regression to estimate parameters for functions suggested by the nonparametric model.

What is a good way to think about performing cross validation in the nonparametric smoothing regression phase of such an approach? I wonder if I might encounter a situation where in random holdout sample A a relationship approximated by a “broken stick” linear hinge function might be evident, while holdout sample B suggests a relationship that would be better approximated by a parabolic threshold hinge function.

Would one take a non-exhaustive approach hold back some randomly selected portion of the data, perform the nonparametric regression, interpret plausible functional forms for the result, and repeat this a few (human-manageable) number of times and mentally tally plausible functional forms?

Or would one take an exhaustive approach (e.g. LOOCV), and use some algorithm to ‘smooth all the smooths’ and used that smoothest of smooths to inform plausible functional forms? (Although, on reflection, I think LOOCV is quite unlikely to result in very different functional relationships since a functional form on a large enough sample is unlikely to be altered by a single data point.)

My applications will typically entail human-manageable numbers of predictor variables (a handful to a few dozen, say), but my sample sizes will range from from a few hundreds to a few hundred thousand. My aim is to produce an intuitively communicated and easily translated model that might be used to make predictions by people with data sets other than mine, and which do not include the outcome variables.

References in answers very welcome.

It seems to me there are two confusions in your question:

• First, linear (least-square) regression does not require a linear relationship in the independent variables, but in the parameters.

Thus $y=a + b \cdot x e^{-x} + c \cdot \frac{z}{1 + x^2}$ can be estimated by ordinary least squares ($y$ is a linear function of parameters $a$, $b$, $c$), while $y = a + b \cdot x + b^2 \cdot z$ cannot ($y$ is not linear in parameter $b$).

• Second, how do you determine a “correct” functional model from a smoother, i.e. how do you go from step 1 to step 2?

As far as I know, there is no way to infer “which functions of regressors to use” from smoothing techniques such as splines, neural nets, etc. Except maybe by plotting the smoothed outputs, and determining relationships by intuition, but that doesn’t sound very robust to me, and it seems one doesn’t need smoothing for this, just scatterplots.

If your final goal is a linear regression model, and your problem is that you don’t know exactly what functional form of the regressors should be used, you would be better off directly fitting a regularized linear regression model (such as LASSO) with a large basis expansion of the original regressors (such as polynomials of the regressors, exponentials, logs, …). The regularization procedure should then elimnate the unneeded regressors, leaving you with a (hopefully good) parametric model. And you can use cross-validation to determine the optimal penalization parameter (which determines the actual degrees of freedom of the model).

You can always use nonparametric regressions as a benchmark for generalization error, as a way to check that your regularized linear model predicts outside data just as well as a nonparametric smoother.