I have a highly-imbalanced test data set. The positive set consists of 100 cases while the negative set consists of 1500 cases. On the training side, I have a larger candidate pool: the positive training set has 1200 cases and the negative training set has 12000 cases. For this kind of scenario, I have several choices:
1) Using weighted SVM for the whole training set (P: 1200, N: 12000)
2) Using SVM based on the sampled training set (P:1200, N :1200), the 1200 negative cases are sampled from 12000 cases.
Is there any theoretical guidance on deciding which approach is better? Since the test data set is highly imbalanced, should I use the imbalanced training set as well?
From a recent post on reddit, the reply by datapraxis will be of interest.