Distance-based learning

Designing Machine Learning Workflows in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

Distance and similarity

from sklearn.neighbors import DistanceMetric as dm
dist = dm.get_metric('euclidean')

X = [[0,1], [2,3], [0,6]] dist.pairwise(X)
array([[0.        , 2.82842712, 5.        ],
       [2.82842712, 0.        , 3.60555128],
       [5.        , 3.60555128, 0.        ]])
X = np.matrix(X)
np.sqrt(np.sum(np.square(X[0,:] - X[1,:])))
2.82842712
Designing Machine Learning Workflows in Python

Non-Euclidean Local Outlier Factor

clf = LocalOutlierFactor(
    novelty=True, metric='chebyshev')
clf.fit(X_train)
y_pred = clf.predict(X_test)
dist = dm.get_metric('chebyshev')
X = [[0,1], [2,3], [0,6]]
dist.pairwise(X)
array([[0., 2., 5.],
       [2., 0., 3.],
       [5., 3., 0.]])

Two clusters of black points with some isolated red points.

Designing Machine Learning Workflows in Python

Are all metrics similar?

Hamming distance matrix:

dist = dm.get_metric('hamming')
X = [[0,1], [2,3], [0,6]]
dist.pairwise(X)
array([[0. , 1. , 0.5],
       [1. , 0. , 1. ],
       [0.5, 1. , 0. ]])
Designing Machine Learning Workflows in Python

Are all metrics similar?

from scipy.spatial.distance import pdist

X = [[0,1], [2,3], [0,6]] pdist(X, 'cityblock')
array([4., 5., 5.])
from scipy.spatial.distance import \ 
    squareform
squareform(pdist(X, 'cityblock'))
array([[0., 4., 5.],
       [4., 0., 5.],
       [5., 5., 0.]])
Designing Machine Learning Workflows in Python

A real-world example

The Hepatitis dataset:

   Class   AGE  SEX  STEROID    ...      
0    2.0  40.0  0.0      0.0    ...      
1    2.0  30.0  0.0      0.0    ...      
2    1.0  47.0  0.0      1.0    ...      
1 https://archive.ics.uci.edu/ml/datasets/Hepatitis
Designing Machine Learning Workflows in Python

A real-world example

Euclidean distance:

squareform(pdist(X_hep, 'euclidean'))
[[  0.  127.   64.1]
 [127.    0.  128.2]
 [ 64.1 128.2   0. ]]
  • 1 nearest to 3: wrong class

Hamming distance:

squareform(pdist(X_hep, 'hamming'))
[[0.  0.5 0.7]
 [0.5 0.  0.6]
 [0.7 0.6 0. ]]
  • 1 nearest to 2: right class
Designing Machine Learning Workflows in Python

A bigger toolbox

Designing Machine Learning Workflows in Python

Preparing Video For Download...