Processing data

Capire il Data Engineering

Hadrien Lacroix

Content Developer at DataCamp

data pipeline

Capire il Data Engineering

moving data to the data lake

Capire il Data Engineering

moving data to the data lake

Capire il Data Engineering

checking for corrupted data

Capire il Data Engineering

A general definition

  • Data processing: converting raw data into meaningful information
Capire il Data Engineering

Data processing value

Conceptually

  • Remove unwanted data
  • Optimize memory, process and network costs
  • Convert data from one type to another

At Spotflix

  • No long term need for testing feature data
  • Can't afford to store and stream files this big
Capire il Data Engineering

data pipeline

Capire il Data Engineering

data pipeline

Capire il Data Engineering

data pipeline

Capire il Data Engineering

Data processing value

Conceptually

  • Remove unwanted data
  • To save memory
  • Convert data from one type to another
  • Organize data
  • To fit into a schema/structure
  • Increase productivity

At Spotflix

  • No need for lossless format
  • Can't afford to store files this big
  • Convert songs from .flac to .ogg
  • Reorganize data from the data lake to data warehouses
  • Employee table example
  • Enable data scientists
Capire il Data Engineering

How data engineers process data

  • Data manipulation, cleaning, and tidying tasks
    • that can be automated
    • that will always need to be done
  • Store data in a sanely structured database
  • Create views on top of the database tables
  • Optimizing the performance of the database
  • Rejecting corrupt song files
  • Deciding what happens with missing metadata
  • Separate artists and albums tables...
  • ...but provide view combining them
  • Indexing
Capire il Data Engineering

1 The difference between batch and stream will be explained in the next lesson!
Capire il Data Engineering

Apache Spark logo

Capire il Data Engineering

Summary

  • What data processing is
  • Why it's necessary
  • What it consists in
  • How we process data at Spotflix
Capire il Data Engineering

Let's practice!

Capire il Data Engineering

Preparing Video For Download...