Transformer

Introduction au data engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

Types de transformations

customer_id	email	state	created_at
1	[email protected]	New York	2019-01-01 07:00:00

Sélection d’attribut (ex. « email »)
Traduction de codes (ex. « New York » -> « NY »)
Validation des données (ex. date dans « created_at »)
Fractionner une colonne en plusieurs
Joindre plusieurs sources

Exemple : séparation (Pandas)

customer_id	email	username	domain
1	[email protected]	jane.doe	theweb.com

customer_df # DataFrame Pandas avec les données clients

# Scinder la colonne email en 2 colonnes sur le symbole « @ »
split_email = customer_df.email.str.split("@", expand=True)

# À ce stade, split_email aura 2 colonnes :
# la première avec tout avant @, la seconde avec
# tout après @

# Créer 2 nouvelles colonnes à partir du DataFrame résultant.
customer_df = customer_df.assign(
  username=split_email[0],
  domain=split_email[1],
)

Transformer dans PySpark

Extraire des données dans PySpark

import pyspark.sql

spark = pyspark.sql.SparkSession.builder.getOrCreate()


spark.read.jdbc("jdbc:postgresql://localhost:5432/pagila",

                "customer",

                properties={"user":"repl","password":"password"})

Exemple : jointure

Nouvelle table des notes

customer_id	film_id	rating
1	2	1
2	1	5
2	2	3
...	...	...

Table client

customer_id	first_name	last_name	...
1	Jane	Doe	...
2	Joe	Doe	...
...	...	...	...

customer_id est commun avec la table des notes

Exemple : jointure (PySpark)

customer_df # DataFrame PySpark avec les données clients
ratings_df # DataFrame PySpark avec les notes


# Regrouper les notes
ratings_per_customer = ratings_df.groupBy("customer_id").mean("rating")


# Jointure sur l’ID client
customer_df.join(
  ratings_per_customer,
  customer_df.customer_id==ratings_per_customer.customer_id
)

Passons à la pratique !

Introduction au data engineering