Training and test sets

Preprocessing for Machine Learning in Python

James Chapman

Curriculum Manager, DataCamp

Why split?

Reduces overfitting
Evaluate performance on a holdout set

Splitting up your dataset

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

   X_train y_train            
0      1.0       n
1      4.0       n
       ...
5      5.0       n
6      6.0       n

   X_test y_test
0     9.0      y
1     1.0      n
2     4.0      n

Stratified sampling

Dataset of $100$ samples: $80$ class 1 and $20$ class 2
Training set of $75$ samples: $60$ class 1 and $15$ class 2
Test set of $25$ samples: $20$ class 1 and $5$ class 2

Stratified sampling

X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y, random_state=42)

y["labels"].value_counts()

class1    80
class2    20
Name: labels, dtype: int64

Stratified sampling

y_train["labels"].value_counts()

class1    60
class2    15
Name: labels, dtype: int64

y_test["labels"].value_counts()

class1    20
class2    5
Name: labels, dtype: int64

Let's practice!

Preprocessing for Machine Learning in Python