# What exactly are censored data?

I have read different descriptions of censored data:

A) As explained in this thread, unquantified data below or above a certain threshold is censored. Unquantified means data is above or below a certain threshold but we do not know the exact value. Data is then marked at the low or high threshold value in the regression model. It matches the description in this presentation, which I’ve found very clear (2nd slide on first page). In other words $Y$ is capped to either a minimum, a maximum value or both because we do not know the true value outside of that range.

B) A friend told me that we can apply a censored data model to partially unknown $Y$ observations, provided we have at least some limit information about the unknown $Y_i$ outcomes. For example, we want to estimate the final price for a mix of silent and open auctions based on some qualitative criteria (type of goods, country, bidders wealth, etc.). While for the open auctions we know all final prices $Y_i$, for the silent auctions we only know the first bid (say, \$1,000) but not the final price. I was told that in this case data is censored from above and a censored regression model should be applied.

C) Finally there is the definition given by the Wikipedia where $Y$ is missing altogether but the predictors are available. I’m not sure how this example is different from truncated data.

So what exactly are censored data?

Consider the following data on an outcome $y$ and a covariate $x$:

user y       x
1    10      2
2   (-∞,5]   3
3   [4,+∞)   5
4   [8,9]    7
5     .      .

For user 1, we have the complete data. For everyone else, we have incomplete data. Users 2, 3 and 4 are all censored: the outcome corresponding to known values of the covariate is not observed or is not observed exactly (left-, right-, and interval-censored). Sometimes this is an artifact of privacy considerations in survey design. In other times, it happens for other reasons. For instance, we don’t observe any wages below the minimum wages or the actual demand for concert tickets above the arena capacity.

User 5 is truncated: both the outcome and the covariate are missing. This usually happens because we only collect data on people who did something. For instance, we only survey people who bought something ($y>0$), so we exclude anyone with $y=0$ along with their $x$s. We may not even have a row for this type of user in out data, though we know they exist because we know the rule that was used to generate our sample. Another example is incidental truncation: we only observe wage offers for people who are in the work force, because we assume that the wage offer is the wage when you are working. The truncation is incidental since it depends not on $y$, but on another variable.

In short, truncation implies a greater information loss than censoring (points A & B). Both of these types of “missingness” are systematic.

Working with this type of data typically involves making a strong distribution assumption about the error, and modifying the likelihood to take this into account. More flexible semi-parametric approaches are also possible. This is implicit in your point B.