Labels, weak labels and truth

Designing Machine Learning Workflows in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

Labels are not always perfect

Degrees of truth:

  • Ground truth
    • the computer crashes and a message asks for ransom money
  • Human expert labeling
    • the analyst inspects the computer logs and identifies unauthorized behaviors
  • Heuristic labeling
    • too many ports received traffic in a very small period of time
Designing Machine Learning Workflows in Python

Labels are not always perfect

Noiseless or strong labels:

  • Ground truth
  • Human expert labeling

Noisy or weak labels:

  • Heuristic labeling

Feature engineering:

  • Features used in heuristics
Designing Machine Learning Workflows in Python

Features and heuristics

Average of unique ports visited by each infected host:

np.mean(X[y]['unique_ports'])
15.11

Average of unique ports visited per host disregarding labels:

np.mean(X['unique_ports'])
11.23
Designing Machine Learning Workflows in Python

From features to labels

Convert a feature into a labeling heuristic:

X_train, X_test, y_train, y_test = train_test_split(X, y)
y_weak_train = X_train['unique_ports'] > 15

hist_cropped.png

Designing Machine Learning Workflows in Python

From features to labels

Two copies of the feature matrix stacked on top of each other. One of them already carries labels produced by domain experts, and the other is labeled using a heuristic.

X_train_aug = pd.concat([X_train, X_train])
y_train_aug = pd.concat([pd.Series(y_train), pd.Series(y_weak_train)])
Designing Machine Learning Workflows in Python

The data are stacked in the same way as the previous slide, but a weight of 1.0 is given to the original labels, and a weight of 0.5 to the ones produced from the heuristic.

weights = [1.0]*len(y_train) + [0.1]*len(y_weak_train)
Designing Machine Learning Workflows in Python

Accuracy using ground truth only:

0.91

Ground truth and weak labels without weights:

accuracy_score(y_test, clf.fit(X_train_aug, y_train_aug).predict(X_test))
0.93

Add weights:

accuracy_score(y_test, clf.fit(X_train_aug, y_train_aug, sample_weight=weights).predict(X_test))
0.95
Designing Machine Learning Workflows in Python

Labels do not need to be perfect!

Designing Machine Learning Workflows in Python

Preparing Video For Download...