Transform

Introduction to Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

Kind of transformations

customer_id	email	state	created_at
1	[email protected]	New York	2019-01-01 07:00:00

Selection of attribute (e.g. 'email')
Translation of code values (e.g. 'New York' -> 'NY')
Data validation (e.g. date input in 'created_at')
Splitting columns into multiple columns
Joining from multiple sources

An example: split (Pandas)

customer_id	email	username	domain
1	[email protected]	jane.doe	theweb.com

customer_df # Pandas DataFrame with customer data

# Split email column into 2 columns on the '@' symbol
split_email = customer_df.email.str.split("@", expand=True)

# At this point, split_email will have 2 columns, a first
# one with everything before @, and a second one with
# everything after @

# Create 2 new columns using the resulting DataFrame.
customer_df = customer_df.assign(
  username=split_email[0],
  domain=split_email[1],
)

Transforming in PySpark

Extract data into PySpark

import pyspark.sql

spark = pyspark.sql.SparkSession.builder.getOrCreate()


spark.read.jdbc("jdbc:postgresql://localhost:5432/pagila",

                "customer",

                properties={"user":"repl","password":"password"})

An example: join

A new ratings table

customer_id	film_id	rating
1	2	1
2	1	5
2	2	3
...	...	...

The customer table

customer_id	first_name	last_name	...
1	Jane	Doe	...
2	Joe	Doe	...
...	...	...	...

customer_id overlaps with ratings table

An example: join (PySpark)

customer_df # PySpark DataFrame with customer data
ratings_df # PySpark DataFrame with ratings data


# Groupby ratings
ratings_per_customer = ratings_df.groupBy("customer_id").mean("rating")


# Join on customer ID
customer_df.join(
  ratings_per_customer,
  customer_df.customer_id==ratings_per_customer.customer_id
)

Let's practice!

Introduction to Data Engineering