Using Databricks for machine learning

Databricks Concepts

Kevin Barlow

Data Practitioner

Machine Learning Lifecycle

¹ https://www.datacamp.com/blog/machine-learning-lifecycle-explained

Planning and preparation

ML Lifecycle - EDA

Planning for machine learning

What do I have?

Data availability
Business requirements
Data scientists/data analysts

Data team and resources

What do I want?

Use cases
Legal and security compliance
Business outcomes

Business outcomes

ML Runtime

Extension of Databricks compute
Optimized for machine learning applications
Contains most common libraries and frameworks
- scikit-learn, SparkML, TensorFlow
- MLFlow
Works with cluster library management

Databricks ML Runtime

Exploratory Data Analysis

import pandas as pd
pd.describe(df)

# Spark DF
df.summary()

dbutils.data.summarize()

import bamboolib as bam
df

EDA in Databricks

Feature tables and feature stores

Raw Data

count	category	price	shelf_loc	rating
4	horror	12.50	end	3
6	romance	13.99	top	4.5
12	sci-fi	16.50	bottom	5
31	romance	9.99	bottom	3.5
23	fantasy	24.99	top	4
18	horror	19.99	end	2.5
19	cooking	17.50	end	4.5
7	fantasy	12.99	top	3
37	sci-fi	14.99	bottom	5

Feature table

count	category	price	shelf_loc	rating
4	1	12.50	1	3
6	2	13.99	2	4.5
12	3	16.50	3	5
31	2	9.99	3	3.5
23	4	24.99	2	4
18	1	19.99	1	2.5
19	5	17.50	1	4.5
7	4	12.99	2	3
37	3	14.99	3	5

Databricks Feature Store

Centralized storage for featurized datasets
Easily discover and re-use features for machine learning models
Upstream and downstream lineage

Databricks Feature Store

from databricks import feature_store

fs = feature_store.FeatureStoreClient()

fs.create_table(
    name=table_name,
    primary_keys=["wine_id"],
    df=features_df,
    schema=features_df.schema,
    description="wine features"
)

Let's practice!

Databricks Concepts