Introduction to Data Engineering
Vincent Vankrunkelsven
Data Engineer @ DataCamp
customer_id | state | created_at | |
---|---|---|---|
1 | [email protected] | New York | 2019-01-01 07:00:00 |
customer_id | username | domain | |
---|---|---|---|
1 | [email protected] | jane.doe | theweb.com |
customer_df # Pandas DataFrame with customer data # Split email column into 2 columns on the '@' symbol split_email = customer_df.email.str.split("@", expand=True)
# At this point, split_email will have 2 columns, a first # one with everything before @, and a second one with # everything after @ # Create 2 new columns using the resulting DataFrame. customer_df = customer_df.assign( username=split_email[0], domain=split_email[1], )
Extract data into PySpark
import pyspark.sql spark = pyspark.sql.SparkSession.builder.getOrCreate()
spark.read.jdbc("jdbc:postgresql://localhost:5432/pagila",
"customer",
properties={"user":"repl","password":"password"})
A new ratings table
customer_id | film_id | rating |
---|---|---|
1 | 2 | 1 |
2 | 1 | 5 |
2 | 2 | 3 |
... | ... | ... |
The customer table
customer_id | first_name | last_name | ... |
---|---|---|---|
1 | Jane | Doe | ... |
2 | Joe | Doe | ... |
... | ... | ... | ... |
customer_id
overlaps with ratings table
customer_df # PySpark DataFrame with customer data ratings_df # PySpark DataFrame with ratings data
# Groupby ratings ratings_per_customer = ratings_df.groupBy("customer_id").mean("rating")
# Join on customer ID customer_df.join( ratings_per_customer, customer_df.customer_id==ratings_per_customer.customer_id )
Introduction to Data Engineering