Efficient nonparametric estimation of confidence intervals and p-values for nonlinear regression

I’m estimating parameters for a complex, “implicit” nonlinear model f(\mathbf{x}, \boldsymbol{\theta}). It’s “implicit” in the sense that I don’t have an explicit formula for f: its value is the output of a complex fluid dynamics code (CFD). After NLS regression, I had a look at residuals, and they don’t look very normal at all. Also, I’m having a lot of issues with estimating their variance-covariance matrix: methods available in nlstools fail with an error.

I’m suspecting the assumption of normally distributed parameter estimators is not valid: thus I would like to use some nonparametric method to estimate confidence intervals, p-values and confidence regions for the three parameters of my model. I thought of bootstrap, but other approaches are welcome, so long as they don’t rely on normality of parameter estimators. Would this work:

  1. given data set D=\{P_i=(\mathbf{x}_i,f_i)\}_{i=1}^N, generate datasets D_1,\dots,D_m by sampling with replacement from D
  2. For each D_i, use NLS (Nonlinear Least Squares) to estimate model parameters \boldsymbol{\theta}^*_i=(\theta^*_{1i},\theta^*_{2i},\theta^*_{3i})
  3. I now have empirical distributions for the NLS parameters estimator. The sample mean of this distribution would be the bootstrap estimate for my parameters; 2.5% and 97.5% quantiles would give me confidence intervals. I could also make scatterplots matrices of each parameter against each other, and get an idea of the correlation among them. This is the part I like the most, because I believe that one parameter is weakly correlated with the others, while the remaining are extremely strongly correlated among themselves.

Is this correct? Then how do I compute the p-values – what is the null for nonlinear regression models? For example, for parameter \theta_{3}, is it that \theta_{3}=0, and the other two are not? How would I compute the p-value for such an hypothesis from my bootstrap sample \boldsymbol{\theta}^*_1,\dots,\boldsymbol{\theta}^*_m? I don’t see the connection with the null…

Also, each NLS fit takes me quite some time (let’s say a few hours) because I need to run my fluid dynamics code p\times N times, where N is the size of D and p is about 40 in my case. The total CPU time for bootstrap is then 40\times N \times m the time of a single CFD run, which is a lot. I would need a faster way. What can I do? I thought of building a metamodel for my CFD code (for example, a Gaussian Process model) and use that for bootstrapping, instead than CFD. What do you think? Would that work?

EDIT I don’t think the NLS regression problem is convex. NLS is being used to find the calibration parameters of a 1D CFD (Computational Fluid Dynamics) code which better agree with data. If that helps, a plot of residuals can be seen here. I can add other plots (QQ plot?) if needed.

I have no theoretical guarantee that there is only a single parameter vector \boldsymbol{\theta} which minimizes the RSS. One may wonder why to use NLS then. The main reason is pragmatic: calibrating the code is slow. A tool which can quickly compute an estimate \boldsymbol{\theta}^* such that \text{RSS}(\boldsymbol{\theta}^*)<\text{RSS}(\boldsymbol{\theta}_0), together with a reliable measure of uncertainty in my estimates, would be better than nothing. NLS is fast, with respect to, say, Bayesian inference with MCMC. However, since I then have to use bootstrap to get the reliable uncertainty estimate, I admit the advantage is somewhat reduced. I still think that the computational effort is less, but if you believe I'm using the wrong approach and I should do something totally different, I'm open to suggestions.

EDIT 2 the setting is exactly the same as here. I'd be glad to provide any other details you need.


Source : Link , Question Author : DeltaIV , Answer Author : Community

Leave a Comment