Die Transformation

Einführung in das Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

Arten von Transformationen

customer_id	E-Mail	Staat	Erstellt am
1	[email protected]	New York	2019-01-01 07:00:00.

Auswahl des Attributs (z. B. „E-Mail“)
Übersetzung von Codewerten (z. B. „New York“ – „NY“)
Datenüberprüfung (z. B. Datumsangabe in „created_at”)
Aufteilen in mehrere Spalten
Zusammenführung aus mehreren Quellen

Ein Beispiel: Aufteilen (Pandas)

customer_id	E-Mail	Benutzername	Domäne
1	[email protected]	Jane Doe	theweb.com

customer_df # Pandas DataFrame with customer data

# Split email column into 2 columns on the '@' symbol
split_email = customer_df.email.str.split("@", expand=True)

# At this point, split_email will have 2 columns, a first
# one with everything before @, and a second one with
# everything after @

# Create 2 new columns using the resulting DataFrame.
customer_df = customer_df.assign(
  username=split_email[0],
  domain=split_email[1],
)

Transformation in PySpark

Daten in PySpark extrahieren

import pyspark.sql

spark = pyspark.sql.SparkSession.builder.getOrCreate()


spark.read.jdbc("jdbc:postgresql://localhost:5432/pagila",

                "customer",

                properties={"user":"repl","password":"password"})

Ein Beispiel: Zusammenführen

Eine neue Bewertungstabelle

customer_id	film_id	rating
1	2	1
2	1	5
2	2	3
...	...	...

Die Kundentabelle

customer_id	first_name	last_name	...
1	Jane	Doe	...
2	Joe	Doe	...
...	...	...	...

customer_id überschneidet sich mit der Bewertungstabelle

Ein Beispiel: Zusammenfügen (PySpark)

customer_df # PySpark DataFrame with customer data
ratings_df # PySpark DataFrame with ratings data


# Groupby ratings
ratings_per_customer = ratings_df.groupBy("customer_id").mean("rating")


# Join on customer ID
customer_df.join(
  ratings_per_customer,
  customer_df.customer_id==ratings_per_customer.customer_id
)

Lass uns üben!

Einführung in das Data Engineering