Correlating continuous clinical variables and gene expression data

In SVM (linear kernel) classification analyses of a data-set of gene expression (~400 variables/genes) for ~25 each of cases and controls, I find that the gene expression-based classifiers have very good performance characteristics. The cases and controls do not differ significantly for a number of categorical and continuous clinical/demographic variables (as per Fisher’s exact or t tests), but they do differ significantly for age.

Is there a way to show that the classification analysis results are or are not influenced by age?

I am thinking of reducing the gene expression data to principal components, and doing a Spearman correlation analysis of the components against age.

Is this is a reasonable approach? Alternately, can I check for correlation between age and class-membership probability values obtained in the SVM analysis.



There are at least two possibilities for this data. One possibility is that your microarrays contain no disease markers whatsoever. But, they do contain information about age, and since in your case the sick and control populations are of different age, you get the illusion of good classification performance. Another possibility is that the microarrays do contain disease markers, and, moreover, these markers is exactly what SVM focuses on.

It seems like the principal components of the data may be correlated with age in both of these possibilities. In the first case it will be because age is what the data expresses. In the second case it will be because disease is what the data expresses, and this disease is itself correlated with age (for your dataset). I don’t think there is an easy way to look at the correlation value and conclude which case it is.

I could think of several ways to assess the effect differently. One option is to split your training set into groups of equal age. In this case, for ‘young’ ages the normal class will have more training examples than the disease class, and vice versa for the older ages. But as long as there are enough examples, this should not be a problem. Another option is to do the same with the test sets, i.e. see whether the classifier tends to say ‘sick’ more often for older patients. Both of these options could be difficult since you don’t have that many examples.

One more option is to train two classifiers. In the first, the only feature will be the age. It seems this has AUC of 0.82. In the second, there will be age and the microarray data. (It seems that currently you train a different classifier which only uses the microarray data, and it gives you AUC 0.95. Adding the age feature explicitly is likely to improve performance, so AUC will be even higher.) If the second classifier performs better than the first, this indicates that age is not the only thing of interest in this data. Based on your comment, the improvement in AUC is 0.13 or more, which seems fair.

Source : Link , Question Author : user4045 , Answer Author : SheldonCooper

Leave a Comment