Measuring model performance

Supervised Learning with scikit-learn

George Boorman

Core Curriculum Manager, DataCamp

Measuring model performance

In classification, accuracy is a commonly used metric
Accuracy:

correct predictions divided by total observations

Measuring model performance

How do we measure accuracy?
Could compute accuracy on the data used to fit the classifier
NOT indicative of ability to generalize

Computing accuracy

data being split into a training set and test set

Computing accuracy

fitting a classifier on the training set

Computing accuracy

calculate accuracy using the test set

Train/test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=21, stratify=y)

knn = KNeighborsClassifier(n_neighbors=6)

knn.fit(X_train, y_train)

print(knn.score(X_test, y_test))

0.8800599700149925

Model complexity

Larger k = less complex model = can cause underfitting
Smaller k = more complex model = can lead to overfitting

the scatter plots showing a KNN decision boundary for k=1, k=9, and k=18

Model complexity and over/underfitting

train_accuracies = {}
test_accuracies = {}
neighbors = np.arange(1, 26)

for neighbor in neighbors:

    knn = KNeighborsClassifier(n_neighbors=neighbor)

    knn.fit(X_train, y_train)

    train_accuracies[neighbor] = knn.score(X_train, y_train)
    test_accuracies[neighbor] = knn.score(X_test, y_test)

Plotting our results

plt.figure(figsize=(8, 6))
plt.title("KNN: Varying Number of Neighbors")
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.show()

Model complexity curve

line plot of accuracy vs number of neighbors, where both the training and test set accuracy decrease as k increases

Model complexity curve

arrow in the line plot pointing to 13 neighbors as producing the best test set accuracy

Let's practice!

Supervised Learning with scikit-learn