Outlier-robust feature scaling

Anomaly Detection in Python

Bekhruz (Bex) Tuychiev

Kaggle Master, Data Science Content Creator

Euclidean distance

A = np.array([9, 1, 6])
B = np.array([25, 44, 85])


diffs = (B - A) ** 2

dist_AB = np.sqrt(np.sum(diffs))

print(dist_AB)

91.35644476444998

The formula to calculate euclidean distance and a visual to illustrate it.

Euclidean in SciPy

from scipy.spatial.distance import \
euclidean

dist_AB = euclidean(A, B)
dist_AB

91.35644476444998

The formula to calculate euclidean distance and a visual to illustrate it.

Standardization

Mean is subtracted and divided by the STD
Result: mean of zero and STD of 1

StandardScaler

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()


# Extract feature and target
X = males.drop("weightkg", axis=1) 
y = males[['weightkg']]


# Fit
ss.fit(X)

Transforming

X_transformed = ss.transform(X)

X_transformed[:5]

array([[-1.05174523],
       [-0.29289108],
       [ 1.3446363 ],
       [-1.21654894],
       [0.056451235]])

fit_transform

ss = StandardScaler()

X_transformed = ss.fit_transform(X)

QuantileTransformer

from sklearn.preprocessing import QuantileTransformer


# Init
qt = QuantileTransformer()
X = males.drop("weightkg", axis=1) 
y = males[['weightkg']]


X_transformed = qt.fit_transform(X)

X_transformed.shape

(4082, 94)

Preserving column names

qt = QuantileTransformer()

X.loc[:, :] = qt.fit_transform(X)

X.head()

The first five rows of a transformed version of Ansur Males body measurements dataset.

Uniform histogram

plt.hist(X['footlength'], color='red')

plt.xlabel("Foot length")
plt.title("Histogram of foot lengths")

plt.show()

A histogram of the foot length column that shows a uniform distribution

Normal histogram

qt = QuantileTransformer(
  output_distribution='normal')

# Rebuild the overridden feature array
X = males.drop("weightkg", axis=1)
X.loc[:, :] = qt.fit_transform(X)

plt.hist(X['footlength'], color='r')
plt.xlabel("Foot length")
plt.title("Histogram of foot lengths")

plt.show()

A histogram of the foot length column that shows a close-to-normal distribution

Let's practice!

Anomaly Detection in Python