Processing data

Understanding Data Engineering

Hadrien Lacroix

Content Developer at DataCamp

data pipeline

Understanding Data Engineering

moving data to the data lake

Understanding Data Engineering

moving data to the data lake

Understanding Data Engineering

checking for corrupted data

Understanding Data Engineering

A general definition

  • Data processing: converting raw data into meaningful information
Understanding Data Engineering

Data processing value

Conceptually

  • Remove unwanted data
  • Optimize memory, process and network costs
  • Convert data from one type to another

At Spotflix

  • No long term need for testing feature data
  • Can't afford to store and stream files this big
Understanding Data Engineering

data pipeline

Understanding Data Engineering

data pipeline

Understanding Data Engineering

data pipeline

Understanding Data Engineering

Data processing value

Conceptually

  • Remove unwanted data
  • To save memory
  • Convert data from one type to another
  • Organize data
  • To fit into a schema/structure
  • Increase productivity

At Spotflix

  • No need for lossless format
  • Can't afford to store files this big
  • Convert songs from .flac to .ogg
  • Reorganize data from the data lake to data warehouses
  • Employee table example
  • Enable data scientists
Understanding Data Engineering

How data engineers process data

  • Data manipulation, cleaning, and tidying tasks
    • that can be automated
    • that will always need to be done
  • Store data in a sanely structured database
  • Create views on top of the database tables
  • Optimizing the performance of the database
  • Rejecting corrupt song files
  • Deciding what happens with missing metadata
  • Separate artists and albums tables...
  • ...but provide view combining them
  • Indexing
Understanding Data Engineering

1 The difference between batch and stream will be explained in the next lesson!
Understanding Data Engineering

Apache Spark logo

Understanding Data Engineering

Summary

  • What data processing is
  • Why it's necessary
  • What it consists in
  • How we process data at Spotflix
Understanding Data Engineering

Let's practice!

Understanding Data Engineering

Preparing Video For Download...