Data warehouses and data lakes

Capire il Data Engineering

Hadrien Lacroix

Content Developer

Warehouses with stunning view on the lake

Capire il Data Engineering

pipeline

Capire il Data Engineering

Data lakes and data warehouses

Data lake

  • Stores all the raw data
  • Can be petabytes (1 million GBs)
  • Stores all data structures
  • Cost-effective
  • Difficult to analyze
  • Requires an up-to-date data catalog
  • Used by data scientists
  • Big data, real-time analytics

Data warehouse

  • Specific data for specific use
  • Relatively small
  • Stores mainly structured data
  • More costly to update
  • Optimized for data analysis
  • Also used by data analysts and business analysts
  • Ad-hoc, read-only queries
Capire il Data Engineering

Data catalog for data lakes

  • What is the source of this data?
  • Where is this data used?
  • Who is the owner of the data?
  • How often is this data updated?
  • Good practice in terms of data governance
  • Ensures reproducibility
  • No catalog --> data swamp
  • Good practice for any data storage solution
    • Reliability
    • Autonomy
    • Scalability
    • Speed
Capire il Data Engineering

Database vs. data warehouse

  • Database:
    • General term
    • Loosely defined as organized data stored and accessed on a computer
  • Data warehouse is a type of database
Capire il Data Engineering

Summary

  • Data lakes
  • Data warehouses
  • Databases
  • Data catalog
Capire il Data Engineering

Let's practice!

Capire il Data Engineering

Preparing Video For Download...