Two worlds collide: Using ML for complex survey data

I am struck with seemingly easy problem, but I haven’t found a suitable solution for several weeks now.

I have quite a lot of poll/survey data (tens of thousands of respondents, say 50k per dataset), coming from something I hope is called complexly designed survey with weights, stratification, specific routing and so on. For each respondents, there are hundreds of variables such as demographics (age, region…) and then mostly binary (at most, categorical) variables.

I come more from computer science/machine learning background and I had to learn a lot about classical survey statistics and methodology. Now I want to apply classical machine learning to those data (e.g. predicting some missing values for subset of respondents – basically classification task). But, hold and behold, I cannot find a suitable way how to do that. How should I incorporate those stratas, weights or routing (like: if question 1 answered with option 2, ask question 3, otherwise skip it)?

Simply applying my models (trees, logistic regression, SVM, XGBoost…) seems dangerous (and, they fail in most cases), since they usually assume data are coming from simple random sample or iid.

A lot of methods at least have weights, but it doesn’t help much. Furthermore, it is unclear how I should I combine imbalanced classes and weights given by survey definition together, not talking about those stratification stuff. Furthermore, result models should be well calibrated – the predicted distribution should be very close to the original one. Good performance of prediction isn’t the only criteria here. I changed the optimisation metric to take into account this as well (such as distance of predicted distribution from the true distribution + accuracy/MCC) and it helped in some cases, why crippling the performance in others.

Is there some canonical way how to deal with this problem? It seems as a heavily underappreciated area of research for me. IMO many surveys could benefit from ML’s power, but there are no sources. Like these are two worlds not interacting with each other.

What I have found so far:

Related CV questions, but none of them contains any usable answer how to approach this (either no answer, not what I ask for, or present misleading recommendations):


(Update: There isn’t very much work yet on “modern” ML methods with complex survey data, but the most recent issue of Statistical Science has a couple of review articles.
See especially Breidt and Opsomer (2017), “Model-Assisted Survey Estimation with Modern Prediction Techniques”.

Also, based on the Toth and Eltinge paper you mentioned, there is now an R package rpms implementing CART for complex-survey data.)

Now I want to apply classical machine learning to those data (e.g. predicting some missing values for subset of respondents – basically classification task).

I’m not fully clear on your goal.
Are you primarily trying to impute missing observations, just to have a “complete” dataset to give someone else? Or do you have complete data already, and you want to build a model to predict/classify new observations’ responses? Do you have a particular question to answer with your model(s), or are you data-mining more broadly?

In either case, complex-sample-survey / survey-weighted logistic regression is a reasonable, pretty well-understood method. There’s also ordinal regression for more than 2 categories. These will account for stratas and survey weights. Do you need a fancier ML method than this?

For example, you could use svyglm in R’s survey package. Even if you don’t using R, the package author, Thomas Lumley, also wrote a useful book “Complex Surveys: A Guide to Analysis Using R” which covers both logistic regression and missing data for surveys.

(For imputation, I hope you’re already familiar with general issues around missing data. If not, look into approaches like multiple imputation to help you account for how the imputation step affects your estimates/predictions.)

Question routing is indeed an additional problem. I’m not sure how best to deal with it. For imputation, perhaps you can impute one “step” in the routing at a time. E.g. using a global model, first impute everyone’s answer to “How many kids do you have?”; then run a new model on the relevant sub-population (people with more than 0 kids) to impute the next step of “How old are your kids?”

Source : Link , Question Author : kotrfa , Answer Author : civilstat

Leave a Comment