Data distributions and transformations

Practicing Machine Learning Interview Questions in Python

Lisa Stuart

Data Scientist

Different distributions

Data distributions

1 https://www.researchgate.net/figure/Bias-Training-and-test-data-sets-are-drawn-from-different-distributions_fig22_330485084
Practicing Machine Learning Interview Questions in Python

Train/test split

train, test = train_test_split(X, y, test_size=0.3)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

sns.pairplot() --> plot matrix of distributions and scatterplots

Practicing Machine Learning Interview Questions in Python

Data transformation

Data transformation

1 https://www.researchgate.net/figure/Example-of-the-effect-of-a-log-transformation-on-the-distribution-of-the-dataset_fig20_308007227
Practicing Machine Learning Interview Questions in Python
Box-Cox Transformations

scipy.stats.boxcox(data, lmbda= )

lmbda (p) $x^p$ transform
-2 $x^{-2} = 1/2$ reciprocal square
-1 $x^{-1} = 1/x$ reciprocal
-0.5 $x^{-1/2} = 1/\sqrt{x}$ reciprocal square root
0.0 $\log{(x)}$ log
0.5 $x^{1/2} = \sqrt{x}$ square root
1 $x^1 = x$ no transform
2 $x^2 = x$ square
Practicing Machine Learning Interview Questions in Python

Let's practice!

Practicing Machine Learning Interview Questions in Python

Preparing Video For Download...