Application of machine learning techniques in small sample clinical studies

What do you think about applying machine learning techniques, like Random Forests or penalized regression (with L1 or L2 penalty, or a combination thereof) in small sample clinical studies when the objective is to isolate interesting predictors in a classification context? It is not a question about model selection, nor am I asking about how to find optimal estimates of variable effect/importance. I don’t plan to do strong inference but just to use multivariate modeling, hence avoiding testing each predictor against the outcome of interest one at a time, and taking their interrelationships into account.

I was just wondering if such an approach was already applied in this particular extreme case, say 20-30 subjects with data on 10-15 categorical or continuous variables. It is not exactly the $n\ll p$ case and I think the problem here is related to the number of classes we try to explain (which are often not well balanced), and the (very) small n. I am aware of the huge literature on this topic in the context of bioinformatics, but I didn’t find any reference related to biomedical studies with psychometrically measured phenotypes (e.g. throughout neuropsychological questionnaires).

Any hint or pointers to relevant papers?


I am open to any other solutions for analyzing this kind of data, e.g. C4.5 algorithm or its derivatives, association rules methods, and any data mining techniques for supervised or semi-supervised classification.


I haven’t seen this used in outside of bioinformatics/machine learning either, but maybe you can be the first one 🙂

As a good representative of small sample method method from bioinformatics, logistic regression with L1 regularization can give a good fit when number of parameters is exponential in the number of observations, non-asymptotic confidence intervals can be crafted using Chernoff-type inequalities (ie, Dudik, (2004) for example). Trevor Hastie has done some work applying these methods to identifying gene interactions. In the paper below, he uses it to identify significant effects from a model with 310,637 adjustable parameters fit to a sample of 2200 observations

“Genome-wide association analysis by lasso penalized logistic regression.”
Authors: Hastie, T; Sobel, E; Wu, T. T; Chen, Y. F; Lange, K
Bioinformatics Vol: 25 Issue: 6 ISSN: 1367-4803 Date: 03/2009 Pages: 714 – 721

Related presentation by Victoria Stodden (Model Selection with Many More Variables than Observations )

Source : Link , Question Author : chl , Answer Author : Yaroslav Bulatov

Leave a Comment