Einführung in das Data Engineering
Vincent Vankrunkelsven
Data Engineer @ DataCamp
Analyse

Anwendungen

Analyse

Anwendungen

Massively Parallel Processing Databases

Die Daten aus der Datei in das spaltenorientierte Speicherformat laden
# Pandas .to_parquet() method
df.to_parquet("./s3://path/to/bucket/customer.parquet")
# PySpark .write.parquet() method
df.write.parquet("./s3://path/to/bucket/customer.parquet")
COPY customer
FROM 's3://path/to/bucket/customer.parquet'
FORMAT as parquet
...
pandas.to_sql()
# Transformation on data
recommendations = transform_find_recommendatins(ratings_df)
# Load into PostgreSQL database
recommendations.to_sql("recommendations",
db_engine,
schema="store",
if_exists="replace")
Einführung in das Data Engineering