Generate synthetic data to match sample data

If I have a sample data set of 5000 points with many features and I have to generate a dataset with say 1 million data points using the sample data. It is like oversampling the sample data to generate many synthetic out-of-sample data points. The out-of-sample data must reflect the distributions satisfied by the sample data. The data here is of telecom type where we have various usage data from users. Is there any techniques available for this? Can SMOTE be applied for this problem?

Answer

I am trying to answer my own question after doing few initial experiments. I tried the SMOTE technique to generate new synthetic samples. And the results are encouraging. It generates synthetic data which has almost similar characteristics of the sample data. The code is from http://comments.gmane.org/gmane.comp.python.scikit-learn/5278 by Karsten Jeschkies which is as below

import numpy as np
from random import randrange, choice
from sklearn.neighbors import NearestNeighbors

def SMOTE(T, N, k):
"""
Returns (N/100) * n_minority_samples synthetic minority samples.

Parameters
----------
T : array-like, shape = [n_minority_samples, n_features]
    Holds the minority samples
N : percetange of new synthetic samples: 
    n_synthetic_samples = N/100 * n_minority_samples. Can be < 100.
k : int. Number of nearest neighbours. 

Returns
-------
S : array, shape = [(N/100) * n_minority_samples, n_features]
"""    
n_minority_samples, n_features = T.shape

if N < 100:
    #create synthetic samples only for a subset of T.
    #TODO: select random minortiy samples
    N = 100
    pass

if (N % 100) != 0:
    raise ValueError("N must be < 100 or multiple of 100")

N = N/100
n_synthetic_samples = N * n_minority_samples
S = np.zeros(shape=(n_synthetic_samples, n_features))

#Learn nearest neighbours
neigh = NearestNeighbors(n_neighbors = k)
neigh.fit(T)

#Calculate synthetic samples
for i in xrange(n_minority_samples):
    nn = neigh.kneighbors(T[i], return_distance=False)
    for n in xrange(N):
        nn_index = choice(nn[0])
        #NOTE: nn includes T[i], we don't want to select it 
        while nn_index == i:
            nn_index = choice(nn[0])

        dif = T[nn_index] - T[i]
        gap = np.random.random()
        S[n + i * N, :] = T[i,:] + gap * dif[:]

return S

The got the following results with a small dataset of 4999 samples having 2 features.

Sample or the small data description

          After        Before
count   4999.000000   4999.000000
mean     350.577866    391.757958
std      566.065273    693.179718
min        0.000000      0.000000
25%       52.975000     93.991500
50%      183.388000    226.027000
75%      414.599000    453.261167
max    10980.004000  27028.158333

Histogram is as follows

enter image description here

Scatter plot to see the joint distribution is as follows:
enter image description here

After using SMOTE technique to generate twice the number of samples, I get the following

          After        Before
count   9998.000000   9998.000000
mean     350.042946    389.020419
std      556.334086    652.886148
min        0.000000      0.000000
25%       53.074959     94.885295
50%      184.067407    226.802912
75%      414.955448    454.008691
max    10685.308012  26688.626042

Histogram is as follows

enter image description here

Scatter plot to see the joint distribution is as follows:
enter image description here

Attribution
Source : Link , Question Author : prashanth , Answer Author : prashanth

Leave a Comment