Data Processing

Comprendere la data architecture moderna

Miller Trujillo

Senior Software Engineer

What is data processing?

  • Exploration
  • Data quality: Checks and transformations
  • Analysis
  • Aggregations
  • Transformations
Comprendere la data architecture moderna

Batch processing

  • Batch and streaming
  • Fixed amount of data

Batch processing with multiple open source frameworks and cloud services

Comprendere la data architecture moderna

Streaming processing

Fixed time window

Fixed time window

Sliding time window

Sliding time window

1 https://beam.apache.org/documentation/programming-guide/#windowing
Comprendere la data architecture moderna

Streaming processing concepts

  • When data was generated
  • When data arrived
  • Watermarks
  • Late data
  • Trigger new processing
Comprendere la data architecture moderna

Processing technologies

Use case Solution Cloud solution
Batch/streaming, big data, cluster Apache Spark, Flink, Beam AWS EMR, AWS Glue, Google Dataproc, Google Dataflow
Batch/streaming, big data, serverless (servers are fully managed by the provider) Apache Spark, Beam AWS Glue, Google Dataflow
Individual events, simple processing, 24/7 support without servers running General programming languages: Python, Javascript, C#, Java, Go AWS Lambda, Google Cloud Functions
Comprendere la data architecture moderna

Let's practice!

Comprendere la data architecture moderna

Preparing Video For Download...