Automated procedure for selecting subset of data points w/ strongest correlation?

Is there some standard procedure (such that one might cite it as a reference) for selecting the subset of data points from a larger pool with the strongest correlation (along just two dimensions)?

For instance, say you have 100 data points. You want a subset of 40 points with the strongest correlation possible along the X and Y dimensions.

I realize that writing code to do this would be relatively straightforward, but I’m wondering if there’s any source to cite for it?


I would say that your method fits into the general category described in this wikipedia article which also has other references if you need something more than just wikipedia. Some of the links within that article would also apply.

Other terms that could apply (if you want to do some more searching) include “Data Dredging” and “Torturing the data until it confesses”.

Note that you can always get a correlation of 1 if you just choose 2 points that don’t have identical x or y values. There was an article in Chance magazine a few years back that showed when you have an x and y variable with essentially no correlation you can find a way to bin the x’s and average the y’s within the bins to show either an increasing or decreasing trend (Chance 2006, Visual Revelations: Finding What Is Not There through the Unfortunate binning of Results: The Mendel Effect, pp. 49-52). Also with a full dataset showing a moderate positive correlation it is possible to choose a subset that shows a negative correlation. Given these, even if you have a legitimate reason for doing what you propose, you are giving any skeptics a lot of arguments to use against any conclusions that you come up with.

Source : Link , Question Author : Julie , Answer Author : Greg Snow

Leave a Comment