Data structures

Understanding Data Engineering

Hadrien Lacroix

Content Developer at DataCamp

Structured data

  • Easy to search and organize
  • Consistent model, rows and columns
  • Defined types
  • Can be grouped to form relations
  • Stored in relational databases
  • About 20% of the data is structured
  • Created and queried using SQL
Understanding Data Engineering

Employee table

index last_name first_name role team full_time office
0 Thien Vivian Data Engineer Data Science 1 Belgium
1 Huong Julian Data Scientist Data Science 1 Belgium
2 Duplantier Norbert Software Developer Infrastructure 1 United Kingdom
3 McColgan Jeff Business Developer Sales 1 United States
4 Sanchez Rick Support Agent Customer Service 0 United States
Understanding Data Engineering

Relational database

office address number city zipcode
Belgium Martelarenlaan 38 Leuven 3010
UK Old Street 207 London EC1V 9NR
USA 5th Ave 350 New York 10118
Understanding Data Engineering

Relational database

index last_name first_name office address number city zipcode
0 Thien Vivian Belgium Martelarenlaan 38 Leuven 3010
1 Huong Julian Belgium Martelarenlaan 38 Leuven 3010
2 Duplantier Norbert UK Old Street 207 London EC1V 9NR
3 McColgan Jeff USA 5th Ave 350 New York 10118
4 Sanchez Rick USA 5th Ave 350 New York 10118
Understanding Data Engineering

Semi-structured data

  • Relatively easy to search and organize
  • Consistent model, less-rigid implementation: different observations have different sizes
  • Different types
  • Can be grouped, but needs more work
  • NoSQL databases: JSON, XML, YAML
Understanding Data Engineering

Favorite artists JSON file

{
  {"user_1645156":
     "last_name": "Lacroix",
     "first_name: "Hadrien",
     "favorite_artists": ["Fools in Deed", "Gojira", "Pain", "Nanowar of Steel"]},
  {"user_5913764":
     "last_name": "Billen",
     "first_name: "Sara",
     "favorite_artists": ["Tamino", "Taylor Swift"]},
  {"user_8436791":
     "last_name": "Sulmont",
     "first_name: "Lis",
     "favorite_artists": ["Arctic Monkeys", "Rihanna", "Nina Simone"]},
  ...
}
Understanding Data Engineering

Unstructured data

  • Does not follow a model, can't be contained in rows and columns
  • Difficult to search and organize
  • Usually text, sound, pictures or videos
  • Usually stored in data lakes, can appear in data warehouses or databases
  • Most of the data is unstructured
  • Can be extremely valuable
Understanding Data Engineering

lyrics

Understanding Data Engineering

song spectrum

Understanding Data Engineering

album cover

Understanding Data Engineering

music video

Understanding Data Engineering

Adding some structure

  • Use AI to search and organize unstructured data
  • Add information to make it semi-structured
Understanding Data Engineering

Summary

  • Structured data
  • Semi-structured data
  • Unstructured data
  • Differences between the three
  • Give examples
Understanding Data Engineering

Let's practice!

Understanding Data Engineering

Preparing Video For Download...