I’m interviewing people for a position of algorithm developer/researcher in a statistics/machine learning/data mining context.

I’m looking for questions to ask to determine, specifically, a candidate’s familiarity, understanding and fluidity with the underlying theory, e.g. basic properties of expectation and variance, some common distributions, etc.

My current go-to question is: “There is an unknown quantity X which we would like to estimate. To this end we have estimators Y1,Y2,…,Yn which, given X, are all unbiased and independent, and each has a known variance σ2i, different for each one. Find the optimal estimator Y=f(Y1,…,Yn) which is unbiased and has minimal variance.”

I’d expect any serious candidate to handle it with ease (given some time to work out the calculations), and yet I’m surprised at how many candidates which are supposedly from relevant fields fail to make even the smallest bit of progress. I thus consider it a good, discriminative question. The only problem with this question is that it is only one.

What other questions can be used for this? Alternatively, where can I find a collection of such questions?

**Answer**

What do you want your statistical developer to do?

The US Army says “train you will fight, because you will fight like you were trained”. Test them on what you want them to do all day long. Really, you want them to “create value” or “make money” for the company.

**Boss 101**

Think “show me the money.”

- Money grows on trees called employees. You put in a “dime” (their wages) and they pay you a “quarter” (their value).
- If you can’t relate their job to how they make money for the company then neither you nor they are doing their job correctly.

Note: If your symbolic manipulation question doesn’t cleanly connect to the “money” then you might be asking the wrong question.

There are 3 things every employee has to do to be an employee:

- Be actually able to do the job
- Work well with the team
- Be willing/motivated to actually do the job

If you don’t get these down rock solid, no other answer is going to do you any good.

If you can replace them with a good piece of software or a well-trained teenager, then you will eventually have to do it, and it will cost you.

**Data 101**

What they should be able to do:

- use your internal flavors of software (network, os, office,

presentation, and analysis) - use some industry standard flavors of software (Excel, R, JMP, MatLab,

pick_three) - get the data themselves. They should know basic data sets for basic

tasks. They should know repositories. They should know which famous

data is used for which task. Fisher Iris. Pearson Crab. … there

are perhaps 20 elements that should go here. UCI, NIST, NOAA. - They should know rules of handling data. binary data (T/F) has very

different information content than categorical (A,B,C,D) or

continuous. Proper handling of the data by data-type is important. - A few Basic statistical tasks include: are these two the same or

different (aka cluster/classify), how does this relate to that

(regression/fitting including linear models, glm, radial basis,

difference equations), is it true that “x” (hypothesis testing), how

many samples do I need (acceptance sampling), how do I get the most

data from few/cheap/efficient experiments (statistical Design of

experiment) –*disclaimer, I’m engineer not statistician*You might

ask them the question “what are the different fundamental tasks, and

how do you test that the statistician can do them efficiently and

correctly? - access/use the data themselves. This is about formats and tools.

They should be able to read from csv, xlsx (excel), SQL, and

pictures. (HDF5, Rdata) If you have a custom format, they should

be able to read through it and work with the tools quickly and

efficiently. They should know strength/weakness of the format. CSV

is quick use, been around forever, fast prototype, but bloated,

inefficient and slow to run. - process the data properly, using best practices, and not committing

sins. Don’t throw away data, ever. Don’t fit binomial data with a

continuous line. Don’t defy physics. - come up with results that are repeatable and reproducible. Some

folks say “there are lies, damn lies, and statistics” but not at my

company. The same good input gives the same good output. The output

isn’t a number, it is always a business decision that informs a

technical action and results in a business result. Different tests

may set the dial at 5.5, or 6.5, but the capability is always

above 1.33. - present findings in the language and at the level that the decision

makers, and/or minion-developers, and/or themselves in a year, can

understand with the least errors. A beautiful thing is being able to

explain it so your grandma gets it. This (link) is my answer, but I like it.

**Analytic zingers:**

I think impossible questions are great. They are impossible for a reason. Being able to know whether something is impossible out the gate is a good thing. Knowing why, having some ways of engaging it, or being able to ask a different question can be better.

Other CV questions. (link)

On reddit. (link)

others (link)

BTW: this was a good question. I might have to update this answer over time.

**Attribution***Source : Link , Question Author : Community , Answer Author :
7 revs*