Training and test sets

Preprocessing for Machine Learning in Python

James Chapman

Curriculum Manager, DataCamp

Why split?

 

  1. Reduces overfitting
  2. Evaluate performance on a holdout set
Preprocessing for Machine Learning in Python

Splitting up your dataset

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
   X_train y_train            
0      1.0       n
1      4.0       n
       ...
5      5.0       n
6      6.0       n

   X_test y_test
0     9.0      y
1     1.0      n
2     4.0      n
Preprocessing for Machine Learning in Python

Stratified sampling

 

  • Dataset of $100$ samples: $80$ class 1 and $20$ class 2
  • Training set of $75$ samples: $60$ class 1 and $15$ class 2
  • Test set of $25$ samples: $20$ class 1 and $5$ class 2
Preprocessing for Machine Learning in Python

Stratified sampling

X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y, random_state=42)
y["labels"].value_counts()
class1    80
class2    20
Name: labels, dtype: int64
Preprocessing for Machine Learning in Python

Stratified sampling

y_train["labels"].value_counts()
class1    60
class2    15
Name: labels, dtype: int64
y_test["labels"].value_counts()
class1    20
class2    5
Name: labels, dtype: int64
Preprocessing for Machine Learning in Python

Let's practice!

Preprocessing for Machine Learning in Python

Preparing Video For Download...