Supervised learning pipelines

Designing Machine Learning Workflows in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

Labeled data

  • Feature variables (shorthand: X)
  • Labels or class (shorthand: y)
credit_scoring.head(4)
  checking_status  duration  ...  foreign_worker class
0            '<0'         6  ...             yes  good
1      '0<=X<200'        48  ...             yes   bad
2   'no checking'        12  ...             yes  good
3            '<0'        42  ...             yes  good
Designing Machine Learning Workflows in Python

Feature engineering

  • Most classifiers expect numeric features
  • Need to convert string columns to numbers

Preprocess using LabelEncoder from sklearn.preprocessing:

le = LabelEncoder()
le.fit_transform(credit_scoring['checking_status'])[:4]
array([1, 0, 3, 1])
Designing Machine Learning Workflows in Python

Model fitting

  • .fit(features, labels)
  • .predict(features)
features, labels = credit_scoring.drop('class', 1), credit_scoring['class']

model_nb = GaussianNB() model_nb.fit(features, labels) model_nb.predict(features.head(5))
['good' 'bad' 'good' 'bad' 'good']

60% accuracy on first 5 examples.

Designing Machine Learning Workflows in Python

Model selection

  • .fit() optimizes the parameters of the given model
  • What about other models?

AdaBoostClassifier outperforms GaussianNB on first five data points:

model_ab = AdaBoostClassifier()
model_ab.fit(features, labels)
model_ab.predict(features.head(5))
numpy.array(labels[0:5])
['good' 'bad' 'good' 'good' 'bad']
['good' 'bad' 'good' 'good' 'bad']
Designing Machine Learning Workflows in Python

Performance assessment

Larger sample sizes $\Rightarrow$ better accuracy estimates:

from sklearn.metrics import accuracy_score
accuracy_score(labels, model_nb.predict(features)) # naive bayes
0.706
accuracy_score(labels, model_ab.predict(features)) # adaboost
0.802

What is wrong with this calculation?

Designing Machine Learning Workflows in Python

Overfitting and data splitting

Overfitting: a model will always perform better on the data it was trained on than on unseen data.

Train on X_train, y_train, assess accuracy on X_test, y_test:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

GaussianNB().fit(X_train, y_train).predict(X_test)
Designing Machine Learning Workflows in Python

A standard supervised learning workflow starting with feature engineering, data splitting into train and test, followed by model assessment and model selection. The standard pipeline is sometimes not enough for real-world problems, however.

Designing Machine Learning Workflows in Python

So, what is this course about?

  1. Scalable ways to tune your pipeline.
  2. Making sure your predictions are relevant by involving domain experts.
  3. Making sure your model continues to perform well over time.
  4. Fitting models when you don't have enough labels.
Designing Machine Learning Workflows in Python

Could you have prevented the mortgage crisis?

Designing Machine Learning Workflows in Python

Preparing Video For Download...