Sampling model for crowdsourced data?

I’m working on an open health survey application, planned to be used in developing country.

The basic idea is that survey interviews are crowdsourced – they are performed by unorganized volunteers who submit forms data of the interviews they performed by using their mobile devices, and each survey is accompanied by the GPS data of the interview location.

Traditional surveys compiled by government agencies are usually implemented using some standard sampling model – usually a probability sampling model. This requires a lot of centralized planning that cannot be always performed. (mentioned this to put my question in the right context)

We can say that a volunteer will implement a convenience sampling around his area. He will interview arbitrarily number of people he can reach.

The basic problem is: How can understand and characterize the overall sampling model of this surveying system? Are there any methodologies or composed models to deal with such cases?


Short answer: This is a convenience sample. There is nothing you can do to justify it.

A somewhat longer answer: you are in the same boat as many social networks that run their internal surveys without having much idea as to who would respond to a one-question survey that would appear randomly on Facebook or Google+… except that unlike these giants, you don’t have any data on those who did not respond. The survey and public opinion research community generally frown upon this type of work, as it is not at all clear how the results of these heavily biased sample can be generalized to the total population (if at all). You can attempt to reweight according to the known demographics, but then you will end up with a variation of weights from 1 for a person who only represent themselves to 1,000,000 assigned to the only 70+ male in the population who knows how to use a computer (and is likely not representative of the remaining 1,000,000 70+ males, anyway).

Additional reading: “How To Lie With Statistics” opens with a chapter on biased samples. If you can read it and not weep in frustration about your sample design, you can move on. If you rely on volunteers, your sample with be biased towards young and urban populations with better access to electronic gadgets. Likewise, “What is a Survey” booklet put together by Fritz Scheuren, past president of the American Statistical Association, opens up with the picture of Harry Truman whose victory could not have been predicted by the biased polling techniques that existed at the time.

There is some research on hard to reach populations. One well-known project was a study into the number of excess deaths in Iraq where geographical areas were sampled, and in each area, the local doctor would try to solicit interviews from every household in the city block. There’s been mounting critique of this design, but however compromising it was, it still had its sampling component. See papers in Lancet (as you probably know, you cannot get any more prestigious in the medical world) and

Source : Link , Question Author : al-Amjad Tawfiq Isstaif , Answer Author : StasK

Leave a Comment