(I have no real idea what to tag this with because I’m no statistician and I don’t know what field this falls into. Feel free to add more suitable tags.)
I work for a company that produces data analysis software, and we need a decent set of data to test and demo our latest product with. We can’t just fill the database with the output of a random number generator because the program’s outputs would become nonsensical. One of the simplest ways to get such data is from a client; we have a large body of data from a trial we ran. Now, obviously we can’t publish a client’s actual data, so we need to alter it a bit, but we still need it to behave like real data.
The aim here is to take their set of data, and apply a “fuzz” to it so that it can’t be recognised as specifically theirs. My memory of statistical theory is itself a little fuzzy, so I’d like to run this by you guys:
Essentially, the data we have (from the client) is itself a sample of all the data that exists (in the country, or the world). What I’d like to know is what type of operations can be applied to make the sample no longer strongly representative of the client’s sample population, while still keeping it roughly representative of the world’s population.
For reference, as far as we’re aware the data we have generally follows rough normal (Gaussian) distributions.
The original dataset isn’t widely available, but could theoretically be recognised from some regionally-specific characteristics (we don’t know what those characteristics are, and it’s doubtful whether anyone does to a sufficient level, but we know that variations exist from place to place). Anyways, I’m more interested in the theory of this than the practice – I want to know whether an operation makes it impossible (or at least difficult) to identify the source dataset by parameter X, whether or not anyone has or could work out parameter X in the first place.
The approach I’ve come up with is to separate the readings into the various types, (without giving much away, let’s say a group might be “length” or “time taken to do X”.) For each of those, calculate the standard deviation. Then, to each value, add a random value between the positive and negative values of (n * stddev) where n is some fraction that I can use to tune the result until the data is sufficiently “fuzzed”. I didn’t want to simply apply a static range (say, random between 90% and 110% of the original value) because some values vary much more or less than others – in some measurements, being 10% over the mean is barely noticeable,
but in others it makes you a serious outlier.
Is this sufficient to mask the source of the original data? If not, by which statistical measures would the data still be identifiable, and how would I mask those while still keeping the resultant data vaguely realistic?
There are some suggestions:
- Convert it to dimensionless form. If it goes from 0 to 1 and doesn’t have units like furlongs per fortnight or tons of coal attached then it is harder to recognize.
- Add a small random number to it. When you convolute a gaussian with a gaussian, you just get another gaussian. It doesn’t change the essensce of it, but moving from exact values keeps someone googling numbers to try and figure out what it is.
- I like the idea of rotating it. You could take a lag of some number of time-steps to create a 2d data-set from the 1d data set. You can then use PCA, or SVD (after centering and scaling) to determine a rotation. Once the data is rotated appropriately you have changed the variance and confounded the information in itself. You can report out one of the rotated coordinate axes as the “sample data”.
- You could mix it with strongly formed data from some other source. So if your sample data is stock market data, you could add perturbations based on the weather, or on the variations from the mean of pitch from your favorite soundtrack of the Beatles. Whether or not people can make sense of Nasdaq, they will have trouble making sense of Nasdaq + Beatles.