If I have a sample data set of 5000 points with many features and I have to generate a dataset with say 1 million data points using the sample data. It is like oversampling the sample data to generate many synthetic out-of-sample data points. The out-of-sample data must reflect the distributions satisfied by the sample data. The data here is of telecom type where we have various usage data from users. Is there any techniques available for this? Can SMOTE be applied for this problem?
I am trying to answer my own question after doing few initial experiments. I tried the SMOTE technique to generate new synthetic samples. And the results are encouraging. It generates synthetic data which has almost similar characteristics of the sample data. The code is from http://comments.gmane.org/gmane.comp.python.scikit-learn/5278 by Karsten Jeschkies which is as below
import numpy as np from random import randrange, choice from sklearn.neighbors import NearestNeighbors def SMOTE(T, N, k): """ Returns (N/100) * n_minority_samples synthetic minority samples. Parameters ---------- T : array-like, shape = [n_minority_samples, n_features] Holds the minority samples N : percetange of new synthetic samples: n_synthetic_samples = N/100 * n_minority_samples. Can be < 100. k : int. Number of nearest neighbours. Returns ------- S : array, shape = [(N/100) * n_minority_samples, n_features] """ n_minority_samples, n_features = T.shape if N < 100: #create synthetic samples only for a subset of T. #TODO: select random minortiy samples N = 100 pass if (N % 100) != 0: raise ValueError("N must be < 100 or multiple of 100") N = N/100 n_synthetic_samples = N * n_minority_samples S = np.zeros(shape=(n_synthetic_samples, n_features)) #Learn nearest neighbours neigh = NearestNeighbors(n_neighbors = k) neigh.fit(T) #Calculate synthetic samples for i in xrange(n_minority_samples): nn = neigh.kneighbors(T[i], return_distance=False) for n in xrange(N): nn_index = choice(nn) #NOTE: nn includes T[i], we don't want to select it while nn_index == i: nn_index = choice(nn) dif = T[nn_index] - T[i] gap = np.random.random() S[n + i * N, :] = T[i,:] + gap * dif[:] return S
The got the following results with a small dataset of 4999 samples having 2 features.
Sample or the small data description
After Before count 4999.000000 4999.000000 mean 350.577866 391.757958 std 566.065273 693.179718 min 0.000000 0.000000 25% 52.975000 93.991500 50% 183.388000 226.027000 75% 414.599000 453.261167 max 10980.004000 27028.158333
Histogram is as follows
After using SMOTE technique to generate twice the number of samples, I get the following
After Before count 9998.000000 9998.000000 mean 350.042946 389.020419 std 556.334086 652.886148 min 0.000000 0.000000 25% 53.074959 94.885295 50% 184.067407 226.802912 75% 414.955448 454.008691 max 10685.308012 26688.626042
Histogram is as follows