If I have a sample data set of 5000 points with many features and I have to generate a dataset with say 1 million data points using the sample data. It is like oversampling the sample data to generate many synthetic out-of-sample data points. The out-of-sample data must reflect the distributions satisfied by the sample data. The data here is of telecom type where we have various usage data from users. Is there any techniques available for this? Can SMOTE be applied for this problem?

**Answer**

I am trying to answer my own question after doing few initial experiments. I tried the SMOTE technique to generate new synthetic samples. And the results are encouraging. It generates synthetic data which has almost similar characteristics of the sample data. The code is from http://comments.gmane.org/gmane.comp.python.scikit-learn/5278 by Karsten Jeschkies which is as below

```
import numpy as np
from random import randrange, choice
from sklearn.neighbors import NearestNeighbors
def SMOTE(T, N, k):
"""
Returns (N/100) * n_minority_samples synthetic minority samples.
Parameters
----------
T : array-like, shape = [n_minority_samples, n_features]
Holds the minority samples
N : percetange of new synthetic samples:
n_synthetic_samples = N/100 * n_minority_samples. Can be < 100.
k : int. Number of nearest neighbours.
Returns
-------
S : array, shape = [(N/100) * n_minority_samples, n_features]
"""
n_minority_samples, n_features = T.shape
if N < 100:
#create synthetic samples only for a subset of T.
#TODO: select random minortiy samples
N = 100
pass
if (N % 100) != 0:
raise ValueError("N must be < 100 or multiple of 100")
N = N/100
n_synthetic_samples = N * n_minority_samples
S = np.zeros(shape=(n_synthetic_samples, n_features))
#Learn nearest neighbours
neigh = NearestNeighbors(n_neighbors = k)
neigh.fit(T)
#Calculate synthetic samples
for i in xrange(n_minority_samples):
nn = neigh.kneighbors(T[i], return_distance=False)
for n in xrange(N):
nn_index = choice(nn[0])
#NOTE: nn includes T[i], we don't want to select it
while nn_index == i:
nn_index = choice(nn[0])
dif = T[nn_index] - T[i]
gap = np.random.random()
S[n + i * N, :] = T[i,:] + gap * dif[:]
return S
```

The got the following results with a small dataset of 4999 samples having 2 features.

Sample or the small data description

```
After Before
count 4999.000000 4999.000000
mean 350.577866 391.757958
std 566.065273 693.179718
min 0.000000 0.000000
25% 52.975000 93.991500
50% 183.388000 226.027000
75% 414.599000 453.261167
max 10980.004000 27028.158333
```

Histogram is as follows

Scatter plot to see the joint distribution is as follows:

After using SMOTE technique to generate twice the number of samples, I get the following

```
After Before
count 9998.000000 9998.000000
mean 350.042946 389.020419
std 556.334086 652.886148
min 0.000000 0.000000
25% 53.074959 94.885295
50% 184.067407 226.802912
75% 414.955448 454.008691
max 10685.308012 26688.626042
```

Histogram is as follows

**Attribution***Source : Link , Question Author : prashanth , Answer Author : prashanth*