Verliesfuncties deel I

Machine Learning-workflows ontwerpen in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

De KDD '99 Cup-dataset

kdd.iloc[0]

kdd.iloc[0]
duration                         51
protocol_type                   tcp
service                        smtp
flag                             SF
src_bytes                      1169
dst_bytes                       332
land                              0
...
dst_host_rerror_rate              0
dst_host_srv_rerror_rate          0
label                          good

False positives vs. false negatives

Label binariseren:

kdd['label'] = kdd['label'] == 'bad'

Train een Gaussian Naive Bayes-classifier:

clf = GaussianNB().fit(X_train, y_train)
predictions = clf.predict(X_test)
results = pd.DataFrame({
    'actual': y_test,
    'predicted': predictions
})

False positives vs. false negatives

Label binariseren:

kdd['label'] = kdd['label'] == 'bad'

Train een Gaussian Naive Bayes-classifier:

clf = GaussianNB().fit(X_train, y_train)
predictions = clf.predict(X_test)
results = pd.DataFrame({
    'actual': y_test,
    'predicted': predictions
})

Er zijn vier mogelijke combinaties van labels en voorspellingen: beide True, beide False, label True met voorspelling False, en label False met voorspelling True. De laatste combinatie is hier gemarkeerd.

False positives vs. false negatives

Label binariseren:

kdd['label'] = kdd['label'] == 'bad'

Train een Gaussian Naive Bayes-classifier:

clf = GaussianNB().fit(X_train, y_train)
predictions = clf.predict(X_test)
results = pd.DataFrame({
    'actual': y_test,
    'predicted': predictions
})

Nu is de combinatie label True en voorspelling False gemarkeerd.

False positives vs. false negatives

Label binariseren:

kdd['label'] = kdd['label'] == 'bad'

Train een Gaussian Naive Bayes-classifier:

clf = GaussianNB().fit(X_train, y_train)
predictions = clf.predict(X_test)
results = pd.DataFrame({
    'actual': y_test,
    'predicted': predictions
})

De twee gevallen waar de voorspelling overeenkomt met het label zijn nu gemarkeerd.

De verwarringsmatrix

conf_mat = confusion_matrix(
    ground_truth, predictions)

array([[9477,   19],
       [ 397, 2458]])

tn, fp, fn, tp = conf_mat.ravel()
(fp, fn)

(19, 397)

Een verwarringsmatrix die het aantal gevallen telt voor elk van de vier genoemde combinaties voor deze dataset.

Scalaire prestatiemetrics

accuracy = 1-(fp + fn)/len(ground_truth)

recall = tp/(tp+fn)

fpr = fp/(tn+fp)

precision = tp/(tp+fp)

f1 = 2*(precision*recall)/(precision+recall)

accuracy_score(ground_truth, predictions)
recall_score(ground_truth, predictions)
precision_score(ground_truth, predictions)
f1_score(ground_truth, predictions)

False positives vs. false negatives

Classifier A:

tn, fp, fn, tp = confusion_matrix(
    ground_truth, predictions_A).ravel()
(fp,fn)

(3, 3)

cost = 10 * fp + fn

Classifier B:

tn, fp, fn, tp = confusion_matrix(
    ground_truth, predictions_B).ravel()
(fp,fn)

(0, 26)

cost = 10 * fp + fn

Welke classifier is beter?

Machine Learning-workflows ontwerpen in Python