As you didn't provide us with the dataset, I'm using mock data generated by means of make_blobs. It remains unclear from your question how many test samples there should be. I've defined test_samples = 50000 but you can change this value to fit your needs.
from sklearn import datasets train_samples = 5000 test_samples = 50000 total_samples = train_samples + train_samples X, y = datasets.make_blobs(n_samples=total_samples, centers=2, random_state=0)
The following snippet splits data into train and test with balanced classes:
from sklearn.model_selection import StratifiedShuffleSplit sss = StratifiedShuffleSplit(train_size=train_samples, n_splits=1, test_size=test_samples, random_state=0) for train_index, test_index in sss.split(X, y): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index]
Demo:
In [54]: from scipy import stats In [55]: stats.itemfreq(y_train) Out[55]: array([[ 0, 2500], [ 1, 2500]], dtype=int64) In [56]: stats.itemfreq(y_test) Out[56]: array([[ 0, 25000], [ 1, 25000]], dtype=int64)
EDIT
As @geompalik correctly pointed out, if your dataset is unbalanced StratifiedShuffleSplit won't yield balanced splits. In that case you might find this function useful:
def stratified_split(y, train_ratio): def split_class(y, label, train_ratio): indices = np.flatnonzero(y == label) n_train = int(indices.size*train_ratio) train_index = indices[:n_train] test_index = indices[n_train:] return (train_index, test_index) idx = [split_class(y, label, train_ratio) for label in np.unique(y)] train_index = np.concatenate([train for train, _ in idx]) test_index = np.concatenate([test for _, test in idx]) return train_index, test_index
Demo:
I have previuosuly generated mock data with the number of samples per class you indicated (code not shown here).
In [153]: y Out[153]: array([1, 0, 1, ..., 0, 0, 1]) In [154]: y.size Out[154]: 55000 In [155]: train_ratio = float(train_samples)/(train_samples + test_samples) In [156]: train_ratio Out[156]: 0.09090909090909091 In [157]: train_index, test_index = stratified_split(y, train_ratio) In [158]: y_train = y[train_index] In [159]: y_test = y[test_index] In [160]: y_train.size Out[160]: 5000 In [161]: y_test.size Out[161]: 50000 In [162]: stats.itemfreq(y_train) Out[162]: array([[ 0, 2438], [ 1, 2562]], dtype=int64) In [163]: stats.itemfreq(y_test) Out[163]: array([[ 0, 24380], [ 1, 25620]], dtype=int64)
X?