How will you deal with “don’t know” and “missing data” in survey data?

As title, I am thinking of merging both into “missing data”, which is to name it as NA in R. Since I don’t see it will make much sense (or even any sense), to separate the “don’t know” row out and to compare the information with other rows.

Is it OK for me to do so?


Well, you should also considered that “don’t know” is at least some kind of answer, whereas non-response is a purely missing value. Now, we often allow for “don’t know” response in survey just to avoid forcing people to provide a response anyway (which might bias the results). For example, in the National Health and Nutrition Examination Survey, they are coded differently but subsequently discarded from the analysis.

You could try analyzing the data both ways: (1) treating “don’t know response” as specific response category and handling all responses set with some kind of multivariate data analysis (e.g. multiple correspondence analysis or multiple factor analysis for mixed data, see the FactoMineR package), and (2) if it doesn’t bring any evidence of distortion on items distribution, just merge it with missing values.

For (2), I would also suggest you to check that “don’t know” and MV are at least missing at random (MAR), or that they are not specific of one respondents group (e.g. male/female, age class, SES, etc.).

Source : Link , Question Author : lokheart , Answer Author : chl

Leave a Comment