Data structures

Capire il Data Engineering

Hadrien Lacroix

Content Developer at DataCamp

Structured data

  • Easy to search and organize
  • Consistent model, rows and columns
  • Defined types
  • Can be grouped to form relations
  • Stored in relational databases
  • About 20% of the data is structured
  • Created and queried using SQL
Capire il Data Engineering

Employee table

index last_name first_name role team full_time office
0 Thien Vivian Data Engineer Data Science 1 Belgium
1 Huong Julian Data Scientist Data Science 1 Belgium
2 Duplantier Norbert Software Developer Infrastructure 1 United Kingdom
3 McColgan Jeff Business Developer Sales 1 United States
4 Sanchez Rick Support Agent Customer Service 0 United States
Capire il Data Engineering

Relational database

office address number city zipcode
Belgium Martelarenlaan 38 Leuven 3010
UK Old Street 207 London EC1V 9NR
USA 5th Ave 350 New York 10118
Capire il Data Engineering

Relational database

index last_name first_name office address number city zipcode
0 Thien Vivian Belgium Martelarenlaan 38 Leuven 3010
1 Huong Julian Belgium Martelarenlaan 38 Leuven 3010
2 Duplantier Norbert UK Old Street 207 London EC1V 9NR
3 McColgan Jeff USA 5th Ave 350 New York 10118
4 Sanchez Rick USA 5th Ave 350 New York 10118
Capire il Data Engineering

Semi-structured data

  • Relatively easy to search and organize
  • Consistent model, less-rigid implementation: different observations have different sizes
  • Different types
  • Can be grouped, but needs more work
  • NoSQL databases: JSON, XML, YAML
Capire il Data Engineering

Favorite artists JSON file

{
  {"user_1645156":
     "last_name": "Lacroix",
     "first_name: "Hadrien",
     "favorite_artists": ["Fools in Deed", "Gojira", "Pain", "Nanowar of Steel"]},
  {"user_5913764":
     "last_name": "Billen",
     "first_name: "Sara",
     "favorite_artists": ["Tamino", "Taylor Swift"]},
  {"user_8436791":
     "last_name": "Sulmont",
     "first_name: "Lis",
     "favorite_artists": ["Arctic Monkeys", "Rihanna", "Nina Simone"]},
  ...
}
Capire il Data Engineering

Unstructured data

  • Does not follow a model, can't be contained in rows and columns
  • Difficult to search and organize
  • Usually text, sound, pictures or videos
  • Usually stored in data lakes, can appear in data warehouses or databases
  • Most of the data is unstructured
  • Can be extremely valuable
Capire il Data Engineering

lyrics

Capire il Data Engineering

song spectrum

Capire il Data Engineering

album cover

Capire il Data Engineering

music video

Capire il Data Engineering

Adding some structure

  • Use AI to search and organize unstructured data
  • Add information to make it semi-structured
Capire il Data Engineering

Summary

  • Structured data
  • Semi-structured data
  • Unstructured data
  • Differences between the three
  • Give examples
Capire il Data Engineering

Let's practice!

Capire il Data Engineering

Preparing Video For Download...