Storing data

Database Design

Lis Sulmont

Curriculum Manager

Structuring data

1. Structured data  

  • Follows a schema
  • Defined data types & relationships

_e.g., SQL, tables in a relational database _

2. Unstructured data  

  • Schemaless
  • Makes up most of data in the world

e.g., photos, chat logs, MP3

3. Semi-structured data

  • Does not follow larger schema
  • Self-describing structure

e.g., NoSQL, XML, JSON

# Example of a JSON file
"user": {
     "profile_use_background_image": true, 
     "statuses_count": 31, 
     "profile_background_color": "C0DEED", 
     "followers_count": 3066, 
     ...
Database Design

Structuring data

Diagram showing the spectrum between structured and unstructured data

1 Flower by Sam Oth and Database Diagram by Nick Jenkins via Wikimedia Commons https://commons.wikimedia.org/wiki/File:Languages_xml.png
Database Design

Storing data beyond traditional databases

  • Traditional databases
    • For storing real-time relational structured data ? OLTP
  • Data warehouses
    • For analyzing archived structured data ? OLAP
  • Data lakes
    • For storing data of all structures = flexibility and scalability
    • For analyzing big data
Database Design

Data warehouses

  • Optimized for analytics - OLAP
    • Organized for reading/aggregating data
    • Usually read-only
  • Contains data from multiple sources
  • Massively Parallel Processing (MPP)
  • Typically uses a denormalized schema and dimensional modeling

Data marts

  • Subset of data warehouses
  • Dedicated to a specific topic

Amazon Redshift, Google Bog Query and Azure SQL Data Warehouse

Data mart as a subset of a data warehouse

Database Design

Data lakes

  • Store all types of data at a lower cost:
    • e.g., raw, operational databases, IoT device logs, real-time, relational and non-relational
  • Retains all data and can take up petabytes
  • Schema-on-read as opposed to schema-on-write
  • Need to catalog data otherwise becomes a data swamp
  • Run big data analytics using services such as Apache Spark and Hadoop
    • Useful for deep learning and data discovery because activities require so much data

Amazon, Google and Microsoft all offer Data Lakes solutions

Database Design

ETL

ELT

Database Design

Let's practice!

Database Design

Preparing Video For Download...