Loading

Introduction to Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

Analytics or applications databases

Analytics

Icon representing analytics

  • Aggregate queries
  • Online analytical processing (OLAP)

Applications

Icon representing applications

  • Lots of transactions
  • Online transaction processing (OLTP)
Introduction to Data Engineering

Column- and row-oriented

Analytics

  • Column-oriented

Example of column-oriented storage

  • Queries about subset of columns
  • Parallelization

Applications

  • Row-oriented

Example of row-oriented storage

  • Stored per record
  • Added per transaction
  • E.g. adding customer is fast
Introduction to Data Engineering

MPP Databases

Massively Parallel Processing Databases

Diagram of MPP database

 

  • Amazon Redshift
  • Azure SQL Data Warehouse
  • Google BigQuery
Introduction to Data Engineering

An example: Redshift

Load from file to columnar storage format

# Pandas .to_parquet() method
df.to_parquet("./s3://path/to/bucket/customer.parquet")
# PySpark .write.parquet() method
df.write.parquet("./s3://path/to/bucket/customer.parquet")
COPY customer
FROM 's3://path/to/bucket/customer.parquet'
FORMAT as parquet
...
Introduction to Data Engineering

Load to PostgreSQL

pandas.to_sql()

# Transformation on data
recommendations = transform_find_recommendatins(ratings_df)

# Load into PostgreSQL database
recommendations.to_sql("recommendations",
                       db_engine,
                       schema="store",
                       if_exists="replace")
Introduction to Data Engineering

Let's practice!

Introduction to Data Engineering

Preparing Video For Download...