Data warehouses and data lakes
Understanding Data Engineering
Hadrien Lacroix
Content Developer
Warehouses with stunning view on the lake
Data lakes and data warehouses
Data lake
Stores all the raw data
Can be petabytes (1 million GBs)
Stores all data structures
Cost-effective
Difficult to analyze
Requires an up-to-date data catalog
Used by data scientists
Big data, real-time analytics
Data warehouse
Specific data for specific use
Relatively small
Stores mainly structured data
More costly to update
Optimized for data analysis
Also used by data analysts and business analysts
Ad-hoc, read-only queries
Data catalog for data lakes
What is the source of this data?
Where is this data used?
Who is the owner of the data?
How often is this data updated?
Good practice in terms of data governance
Ensures reproducibility
No catalog --> data swamp
Good practice for any data storage solution
Reliability
Autonomy
Scalability
Speed
Database vs. data warehouse
Database:
General term
Loosely defined as
organized data stored and accessed on a computer
Data warehouse is a type of database
Summary
Data lakes
Data warehouses
Databases
Data catalog
Let's practice!
Understanding Data Engineering
Preparing Video For Download...