Measuring model performance

Supervised Learning with scikit-learn

George Boorman

Core Curriculum Manager, DataCamp

Measuring model performance

  • In classification, accuracy is a commonly used metric

  • Accuracy:

correct predictions divided by total observations

Supervised Learning with scikit-learn

Measuring model performance

  • How do we measure accuracy?

  • Could compute accuracy on the data used to fit the classifier

  • NOT indicative of ability to generalize

Supervised Learning with scikit-learn

Computing accuracy

data being split into a training set and test set

Supervised Learning with scikit-learn

Computing accuracy

fitting a classifier on the training set

Supervised Learning with scikit-learn

Computing accuracy

calculate accuracy using the test set

Supervised Learning with scikit-learn

Train/test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))
0.8800599700149925
Supervised Learning with scikit-learn

Model complexity

  • Larger k = less complex model = can cause underfitting

  • Smaller k = more complex model = can lead to overfitting

the scatter plots showing a KNN decision boundary for k=1, k=9, and k=18

Supervised Learning with scikit-learn

Model complexity and over/underfitting

train_accuracies = {}
test_accuracies = {}
neighbors = np.arange(1, 26)

for neighbor in neighbors:
knn = KNeighborsClassifier(n_neighbors=neighbor)
knn.fit(X_train, y_train)
train_accuracies[neighbor] = knn.score(X_train, y_train) test_accuracies[neighbor] = knn.score(X_test, y_test)
Supervised Learning with scikit-learn

Plotting our results

plt.figure(figsize=(8, 6))
plt.title("KNN: Varying Number of Neighbors")
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.show()
Supervised Learning with scikit-learn

Model complexity curve

line plot of accuracy vs number of neighbors, where both the training and test set accuracy decrease as k increases

Supervised Learning with scikit-learn

Model complexity curve

arrow in the line plot pointing to 13 neighbors as producing the best test set accuracy

Supervised Learning with scikit-learn

Let's practice!

Supervised Learning with scikit-learn

Preparing Video For Download...