Logistic regression and the ROC curve

Supervised Learning with scikit-learn

George Boorman

Core Curriculum Manager, DataCamp

Logistic regression for binary classification

  • Logistic regression is used for classification problems

  • Logistic regression outputs probabilities

  • If the probability, $ \ p>0.5$:

    • The data is labeled 1
  • If the probability, $ \ p<0.5$:

    • The data is labeled 0
Supervised Learning with scikit-learn

Linear decision boundary

scatter plot of feature1 vs feature 2, with a straight line decision boundary for predicting churn running left to right

Supervised Learning with scikit-learn

Logistic regression in scikit-learn

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
Supervised Learning with scikit-learn

Predicting probabilities

y_pred_probs = logreg.predict_proba(X_test)[:, 1]

print(y_pred_probs[0])
[0.08961376]
Supervised Learning with scikit-learn

Probability thresholds

  • By default, logistic regression threshold = 0.5

  • Not specific to logistic regression

    • KNN classifiers also have thresholds
  • What happens if we vary the threshold?

Supervised Learning with scikit-learn

The ROC curve

true positive rate vs false positive rate with a dotted line running bottom left to top right

Supervised Learning with scikit-learn

The ROC curve

zero threshold highlighted in the top right

Supervised Learning with scikit-learn

The ROC curve

threshold of 1 also highlighted in the bottom left

Supervised Learning with scikit-learn

The ROC curve

both thresholds highlighted

Supervised Learning with scikit-learn

The ROC curve

Dots curving up and to the right above the dotted line, representing different thresholds

Supervised Learning with scikit-learn

The ROC curve

line curving up and to the right above the dotted line, representing different thresholds

Supervised Learning with scikit-learn

Plotting the ROC curve

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
plt.plot([0, 1], [0, 1], 'k--') plt.plot(fpr, tpr) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Logistic Regression ROC Curve') plt.show()
Supervised Learning with scikit-learn

Plotting the ROC curve

roc curve plot for churn data set, with a line moving up and to the right from the bottom left

Supervised Learning with scikit-learn

ROC AUC

roc curve plot for churn data set, with a line moving up and to the right from the bottom left, with p=0.67 annotated

Supervised Learning with scikit-learn

ROC AUC in scikit-learn

from sklearn.metrics import roc_auc_score

print(roc_auc_score(y_test, y_pred_probs))
0.6700964152663693
Supervised Learning with scikit-learn

Let's practice!

Supervised Learning with scikit-learn

Preparing Video For Download...