Labels, weak labels and truth

Designing Machine Learning Workflows in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

Labels are not always perfect

Degrees of truth:

Ground truth
- the computer crashes and a message asks for ransom money
Human expert labeling
- the analyst inspects the computer logs and identifies unauthorized behaviors
Heuristic labeling
- too many ports received traffic in a very small period of time

Labels are not always perfect

Noiseless or strong labels:

Ground truth
Human expert labeling

Noisy or weak labels:

Heuristic labeling

Feature engineering:

Features used in heuristics

Features and heuristics

Average of unique ports visited by each infected host:

np.mean(X[y]['unique_ports'])

15.11

Average of unique ports visited per host disregarding labels:

np.mean(X['unique_ports'])

11.23

From features to labels

Convert a feature into a labeling heuristic:

X_train, X_test, y_train, y_test = train_test_split(X, y)
y_weak_train = X_train['unique_ports'] > 15

From features to labels

Two copies of the feature matrix stacked on top of each other. One of them already carries labels produced by domain experts, and the other is labeled using a heuristic.

X_train_aug = pd.concat([X_train, X_train])
y_train_aug = pd.concat([pd.Series(y_train), pd.Series(y_weak_train)])

The data are stacked in the same way as the previous slide, but a weight of 1.0 is given to the original labels, and a weight of 0.5 to the ones produced from the heuristic.

weights = [1.0]*len(y_train) + [0.1]*len(y_weak_train)

Accuracy using ground truth only:

0.91

Ground truth and weak labels without weights:

accuracy_score(y_test, clf.fit(X_train_aug, y_train_aug).predict(X_test))

0.93

Add weights:

accuracy_score(y_test, clf.fit(X_train_aug, y_train_aug, sample_weight=weights).predict(X_test))

0.95

Labels do not need to be perfect!

Designing Machine Learning Workflows in Python