12

I have a dataset with binary class labels. I want to extract samples with balanced classes from my data set. Code I have written below gives me imbalanced dataset.

sss = StratifiedShuffleSplit(train_size=5000, n_splits=1, test_size=50000, random_state=0) for train_index, test_index in sss.split(X, y): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] print(itemfreq(y_train)) 

As you can see that class 0 has 2438 samples and class 1 has 2562.

[[ 0.00000000e+00 2.43800000e+03] [ 1.00000000e+00 2.56200000e+03]] 

How should I proceed to get 2500 samples in class 1 and class 0 each in my training set. (And the test set too with 25000)

1
  • What is the actual size of your X? Commented Mar 7, 2017 at 12:54

4 Answers 4

8

As you didn't provide us with the dataset, I'm using mock data generated by means of make_blobs. It remains unclear from your question how many test samples there should be. I've defined test_samples = 50000 but you can change this value to fit your needs.

from sklearn import datasets train_samples = 5000 test_samples = 50000 total_samples = train_samples + train_samples X, y = datasets.make_blobs(n_samples=total_samples, centers=2, random_state=0) 

The following snippet splits data into train and test with balanced classes:

from sklearn.model_selection import StratifiedShuffleSplit sss = StratifiedShuffleSplit(train_size=train_samples, n_splits=1, test_size=test_samples, random_state=0) for train_index, test_index in sss.split(X, y): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] 

Demo:

In [54]: from scipy import stats In [55]: stats.itemfreq(y_train) Out[55]: array([[ 0, 2500], [ 1, 2500]], dtype=int64) In [56]: stats.itemfreq(y_test) Out[56]: array([[ 0, 25000], [ 1, 25000]], dtype=int64) 

EDIT

As @geompalik correctly pointed out, if your dataset is unbalanced StratifiedShuffleSplit won't yield balanced splits. In that case you might find this function useful:

 def stratified_split(y, train_ratio): def split_class(y, label, train_ratio): indices = np.flatnonzero(y == label) n_train = int(indices.size*train_ratio) train_index = indices[:n_train] test_index = indices[n_train:] return (train_index, test_index) idx = [split_class(y, label, train_ratio) for label in np.unique(y)] train_index = np.concatenate([train for train, _ in idx]) test_index = np.concatenate([test for _, test in idx]) return train_index, test_index 

Demo:

I have previuosuly generated mock data with the number of samples per class you indicated (code not shown here).

In [153]: y Out[153]: array([1, 0, 1, ..., 0, 0, 1]) In [154]: y.size Out[154]: 55000 In [155]: train_ratio = float(train_samples)/(train_samples + test_samples) In [156]: train_ratio Out[156]: 0.09090909090909091 In [157]: train_index, test_index = stratified_split(y, train_ratio) In [158]: y_train = y[train_index] In [159]: y_test = y[test_index] In [160]: y_train.size Out[160]: 5000 In [161]: y_test.size Out[161]: 50000 In [162]: stats.itemfreq(y_train) Out[162]: array([[ 0, 2438], [ 1, 2562]], dtype=int64) In [163]: stats.itemfreq(y_test) Out[163]: array([[ 0, 24380], [ 1, 25620]], dtype=int64) 
Sign up to request clarification or add additional context in comments.

2 Comments

my dataset is imbalanced. How do I obtain balanced classes from an imbalanced dataset?.
The question is slightly different. This strategy would not achieve balanced splits in an unbalanced dataset by definition.
3

The problem is that the StratifiedShuffleSplit method you use by definition splits by preserving the percentages of the class by definition (stratification).

A straightforward way to achieve what you want while using StratifiedShuffleSplit is to subsample the dominant class first, so that the initial dataset is balanced and then continue. Using numpy this is easy to accomplish. Although the splits you describe are almost balanced.

Comments

3

There are many ways to achieve balanced data.

Here is one simple way that doesn't require sklearn.

positives = [] negatives = [] for text, label in training_data: if label == 1: positives.append(text, label) else: negatives.append(text, label) min_rows = min(len(positives), len(negatives)) # Finally, create a balanced data set using an equal number of positive and negative samples. balanced_data = positives[0:min_rows] balanced_data.extend(negatives[0:min_rows]) 

For more advanced techniques, consider checking out imbalanced-learn. It is a library that closely mirrors sklearn in many ways but is specifically focused on dealing with imbalanced data. For example, they provide a bunch of code for undersampling or oversampling your data.

Comments

1

Here's a wrapper around pandas.DataFrame.sample that uses the weights parameter to perform the balancing. It works for more than 2 classes and multiple features.

def pd_sample_balanced(X, y, n_times): """ Resample X and y with equal classes in y """ assert y.shape[0] == X.shape[0] assert (y.index == X.index).all() c = y.value_counts() n_samples = c.max() * c.shape[0] * n_times weights = (1 / (c / y.shape[0])).reset_index().rename(columns={"index": "y", 0: "w"}) weights = pd.DataFrame({"y": y}).merge(weights, on="y", how="left").w X = X.sample(n=n_samples, weights=weights, random_state=random_state, replace=True) y = y[X.index] X = X.reset_index(drop=True) y = y.reset_index(drop=True) return X, y 

Example usage

y1 = pd.Series([0, 0, 1, 1, 1, 1, 1, 1, 1, 2]) X1 = pd.DataFrame({"f1": np.arange(len(y1)), "f2": np.arange(len(y1))}) X2, y2 = pd_sample_balanced(X1, y1, 100) print("before, y:") print(y1.value_counts()) print("") print("before, X:") print(X1.value_counts()) print("") print("after, y:") print(y2.value_counts()) print("") print("after, X:") print(X2.value_counts()) 

Output of example

before, y: 1 7 0 2 2 1 dtype: int64 before, X: f1 f2 9 9 1 8 8 1 7 7 1 6 6 1 5 5 1 4 4 1 3 3 1 2 2 1 1 1 1 0 0 1 dtype: int64 after, y: 2 720 0 691 1 689 Name: 0, dtype: int64 after, X: f1 f2 9 9 720 1 1 361 0 0 330 7 7 110 6 6 104 4 4 98 3 3 98 8 8 97 5 5 94 2 2 88 dtype: int64 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.