How best to deal with a left-censored predictor (because of detection limits) in a linear model?

Context: I’m new to Bayesian stats and am trying to fit a multiple regression with rstan. All variables are continuous and there is no hierarchical structure.

One of my predictors is left-censored because it falls below the detection limit for a chemical assay. What is the best way to deal with this in a multiple regression? So far, I can see a few possibilities:

  1. A substitution rule, such as ‘replace all values below the detection limit with a constant such as detection limit/2’. This is clearly not rigorous.
  2. Multiple imputation, but (i) I don’t know how to deal with the fact that values above the detection limit are likely to be generated by the imputation process, which I will know with high probability to be false, and (ii) I’m not sure how well multiple imputation plays with Bayesian approaches, since I can’t think of a good way to aggregate the posterior distributions from fits to the different imputed datasets
  3. Simulate values data from a distribution that makes sense based on prior knowledge and the data, and randomly assign values below the detection limit to the relevant points. This suffers from similar problems to #2, since I would have to simulate many sets of values, model them separately, and then figure out how to integrate the posteriors.

Am I missing better options? Are there useful Bayesian tricks that can help deal with this problem? I’m also open to non-Bayesian options.

The histogram below shows the distribution of values. The plot is on a log scale because that is most natural for this variable. For visual clarity, I have treated values below the detection limit (~25% of the data) as being 1/10 of the detection limit, and added a red line to separate them from the remaining points. Note that the red line is not the precise detection limit; the smallest quantified values to the right of the red line are at the putative limit. The fact that there are very few values exactly at the limit suggests that there may have been some variation in the detection limit between measurements, but I don’t mind if that is ignored for the purposes of this question.
Histogram of


Here’s my rstan code, in case that is helpful. Betas 1 through 4 represent main effects, 5 & 6 are interaction terms (between 1 & 3 and 2 & 4). The censored predictor is therefore present in an interaction term as well, which is a complication I neglected to mention earlier.

data {
  int<lower=0> n;       // number of data items
  int<lower=0> k;       // number of predictors
  vector[n] Y;          // outcome vector
  matrix[n,k] X;        // predictor matrix
  int n2;               //the size of the new_X matrix
  matrix[n2,k] new_X;   //the matrix for the predicted values
parameters {
  real alpha; // intercept
  vector[k] beta; // coefficients for predictors
  real<lower=0> sigma; // error scale (cauchy truncated at zero)
model {
  beta[1] ~ normal(-0.75, 1);   //prior for beta
  beta[2] ~ normal(0, 3);   //prior for beta
  beta[3] ~ normal(0, 3);   //prior for beta
  beta[4] ~ normal(0, 3);   //prior for beta
  beta[5] ~ normal(0, 3);   //prior for beta
  beta[6] ~ normal(0, 3);   //prior for beta
  sigma ~ cauchy (0, 2.5);  //prior for sigma

  Y ~ normal(alpha + X * beta, sigma); // likelihood
generated quantities {
  vector[n2] y_pred;
  y_pred = new_X * beta; //the y values predicted by the model


rstan provides you with all the tools you need to solve this problem with Bayesian inference. In addition to the usual regression model of response $y$ in terms of predictors $x$, you should include a model of $x$ in the Stan code. This model should include the left-censoring. The Stan user manual chapter on censoring explains two different ways to do this in the Stan language. The first way is easier to incorporate into a regression model. The model for $x$ would look something like this (omitting the definition of N_obs and such):

data {
  real x_obs[N_obs];
parameters {
  real<upper=DL> x_cens[N_cens];
  real x[N];
model {
  x_obs ~ normal(mu, sigma);
  x_cens ~ normal(mu, sigma);
  x = append_array(x_obs, x_cens);

The key idea is that the censored data is represented by parameters whose upper limit is the detection limit. The censored data will be sampled alongside the other parameters in the model, so the posteriors you get will automatically integrate out the censored data.

Source : Link , Question Author : mkt , Answer Author : Tom Minka

Leave a Comment