Data warehouses and data lakes

Understanding Data Engineering

Hadrien Lacroix

Content Developer

Warehouses with stunning view on the lake

Understanding Data Engineering

pipeline

Understanding Data Engineering

Data lakes and data warehouses

Data lake

  • Stores all the raw data
  • Can be petabytes (1 million GBs)
  • Stores all data structures
  • Cost-effective
  • Difficult to analyze
  • Requires an up-to-date data catalog
  • Used by data scientists
  • Big data, real-time analytics

Data warehouse

  • Specific data for specific use
  • Relatively small
  • Stores mainly structured data
  • More costly to update
  • Optimized for data analysis
  • Also used by data analysts and business analysts
  • Ad-hoc, read-only queries
Understanding Data Engineering

Data catalog for data lakes

  • What is the source of this data?
  • Where is this data used?
  • Who is the owner of the data?
  • How often is this data updated?
  • Good practice in terms of data governance
  • Ensures reproducibility
  • No catalog --> data swamp
  • Good practice for any data storage solution
    • Reliability
    • Autonomy
    • Scalability
    • Speed
Understanding Data Engineering

Database vs. data warehouse

  • Database:
    • General term
    • Loosely defined as organized data stored and accessed on a computer
  • Data warehouse is a type of database
Understanding Data Engineering

Summary

  • Data lakes
  • Data warehouses
  • Databases
  • Data catalog
Understanding Data Engineering

Let's practice!

Understanding Data Engineering

Preparing Video For Download...