# What are some good interview questions for statistical algorithm developer candidates?

I’m interviewing people for a position of algorithm developer/researcher in a statistics/machine learning/data mining context.

I’m looking for questions to ask to determine, specifically, a candidate’s familiarity, understanding and fluidity with the underlying theory, e.g. basic properties of expectation and variance, some common distributions, etc.

My current go-to question is: “There is an unknown quantity $X$ which we would like to estimate. To this end we have estimators $Y_1, Y_2, \ldots, Y_n$ which, given $X$, are all unbiased and independent, and each has a known variance $\sigma_i^2$, different for each one. Find the optimal estimator $Y=f(Y_1,\ldots, Y_n)$ which is unbiased and has minimal variance.”

I’d expect any serious candidate to handle it with ease (given some time to work out the calculations), and yet I’m surprised at how many candidates which are supposedly from relevant fields fail to make even the smallest bit of progress. I thus consider it a good, discriminative question. The only problem with this question is that it is only one.

What other questions can be used for this? Alternatively, where can I find a collection of such questions?

What do you want your statistical developer to do?

The US Army says “train you will fight, because you will fight like you were trained”. Test them on what you want them to do all day long. Really, you want them to “create value” or “make money” for the company.

Boss 101

Think “show me the money.”

• Money grows on trees called employees. You put in a “dime” (their wages) and they pay you a “quarter” (their value).
• If you can’t relate their job to how they make money for the company then neither you nor they are doing their job correctly.

Note: If your symbolic manipulation question doesn’t cleanly connect to the “money” then you might be asking the wrong question.

There are 3 things every employee has to do to be an employee:

• Be actually able to do the job
• Work well with the team
• Be willing/motivated to actually do the job

If you don’t get these down rock solid, no other answer is going to do you any good.

If you can replace them with a good piece of software or a well-trained teenager, then you will eventually have to do it, and it will cost you.

Data 101

What they should be able to do:

• use your internal flavors of software (network, os, office,
presentation, and analysis)
• use some industry standard flavors of software (Excel, R, JMP, MatLab,
pick_three)
• get the data themselves. They should know basic data sets for basic
tasks. They should know repositories. They should know which famous
data is used for which task. Fisher Iris. Pearson Crab. … there
are perhaps 20 elements that should go here. UCI, NIST, NOAA.
• They should know rules of handling data. binary data (T/F) has very
different information content than categorical (A,B,C,D) or
continuous. Proper handling of the data by data-type is important.
• A few Basic statistical tasks include: are these two the same or
different (aka cluster/classify), how does this relate to that
(regression/fitting including linear models, glm, radial basis,
difference equations), is it true that “x” (hypothesis testing), how
many samples do I need (acceptance sampling), how do I get the most
data from few/cheap/efficient experiments (statistical Design of
experiment) – disclaimer, I’m engineer not statistician You might
how do you test that the statistician can do them efficiently and
correctly?
• access/use the data themselves. This is about formats and tools.
They should be able to read from csv, xlsx (excel), SQL, and
pictures. (HDF5, Rdata) If you have a custom format, they should
be able to read through it and work with the tools quickly and
efficiently. They should know strength/weakness of the format. CSV
is quick use, been around forever, fast prototype, but bloated,
inefficient and slow to run.
• process the data properly, using best practices, and not committing
sins. Don’t throw away data, ever. Don’t fit binomial data with a
continuous line. Don’t defy physics.
• come up with results that are repeatable and reproducible. Some
folks say “there are lies, damn lies, and statistics” but not at my
company. The same good input gives the same good output. The output
isn’t a number, it is always a business decision that informs a
technical action and results in a business result. Different tests
may set the dial at 5.5, or 6.5, but the capability is always
above 1.33.
• present findings in the language and at the level that the decision
makers, and/or minion-developers, and/or themselves in a year, can
understand with the least errors. A beautiful thing is being able to

Analytic zingers:

I think impossible questions are great. They are impossible for a reason. Being able to know whether something is impossible out the gate is a good thing. Knowing why, having some ways of engaging it, or being able to ask a different question can be better.