# How best to deal with a left-censored predictor (because of detection limits) in a linear model?

Context: I’m new to Bayesian stats and am trying to fit a multiple regression with `rstan`. All variables are continuous and there is no hierarchical structure.

One of my predictors is left-censored because it falls below the detection limit for a chemical assay. What is the best way to deal with this in a multiple regression? So far, I can see a few possibilities:

1. A substitution rule, such as ‘replace all values below the detection limit with a constant such as detection limit/2’. This is clearly not rigorous.
2. Multiple imputation, but (i) I don’t know how to deal with the fact that values above the detection limit are likely to be generated by the imputation process, which I will know with high probability to be false, and (ii) I’m not sure how well multiple imputation plays with Bayesian approaches, since I can’t think of a good way to aggregate the posterior distributions from fits to the different imputed datasets
3. Simulate values data from a distribution that makes sense based on prior knowledge and the data, and randomly assign values below the detection limit to the relevant points. This suffers from similar problems to #2, since I would have to simulate many sets of values, model them separately, and then figure out how to integrate the posteriors.

Am I missing better options? Are there useful Bayesian tricks that can help deal with this problem? I’m also open to non-Bayesian options.

The histogram below shows the distribution of values. The plot is on a log scale because that is most natural for this variable. For visual clarity, I have treated values below the detection limit (~25% of the data) as being 1/10 of the detection limit, and added a red line to separate them from the remaining points. Note that the red line is not the precise detection limit; the smallest quantified values to the right of the red line are at the putative limit. The fact that there are very few values exactly at the limit suggests that there may have been some variation in the detection limit between measurements, but I don’t mind if that is ignored for the purposes of this question.

UPDATE:

Here’s my `rstan` code, in case that is helpful. Betas 1 through 4 represent main effects, 5 & 6 are interaction terms (between 1 & 3 and 2 & 4). The censored predictor is therefore present in an interaction term as well, which is a complication I neglected to mention earlier.

``````data {
int<lower=0> n;       // number of data items
int<lower=0> k;       // number of predictors
vector[n] Y;          // outcome vector
matrix[n,k] X;        // predictor matrix
int n2;               //the size of the new_X matrix
matrix[n2,k] new_X;   //the matrix for the predicted values
}
parameters {
real alpha; // intercept
vector[k] beta; // coefficients for predictors
real<lower=0> sigma; // error scale (cauchy truncated at zero)
}
model {
beta[1] ~ normal(-0.75, 1);   //prior for beta
beta[2] ~ normal(0, 3);   //prior for beta
beta[3] ~ normal(0, 3);   //prior for beta
beta[4] ~ normal(0, 3);   //prior for beta
beta[5] ~ normal(0, 3);   //prior for beta
beta[6] ~ normal(0, 3);   //prior for beta
sigma ~ cauchy (0, 2.5);  //prior for sigma

Y ~ normal(alpha + X * beta, sigma); // likelihood
}
generated quantities {
vector[n2] y_pred;
y_pred = new_X * beta; //the y values predicted by the model
}
``````

rstan provides you with all the tools you need to solve this problem with Bayesian inference. In addition to the usual regression model of response $$y$$ in terms of predictors $$x$$, you should include a model of $$x$$ in the Stan code. This model should include the left-censoring. The Stan user manual chapter on censoring explains two different ways to do this in the Stan language. The first way is easier to incorporate into a regression model. The model for $$x$$ would look something like this (omitting the definition of N_obs and such):

``````data {
real x_obs[N_obs];
}
parameters {
real<upper=DL> x_cens[N_cens];
real x[N];
}
model {
x_obs ~ normal(mu, sigma);
x_cens ~ normal(mu, sigma);
x = append_array(x_obs, x_cens);
}
``````

The key idea is that the censored data is represented by parameters whose upper limit is the detection limit. The censored data will be sampled alongside the other parameters in the model, so the posteriors you get will automatically integrate out the censored data.