Data Processing

Understanding Modern Data Architecture

Miller Trujillo

Senior Software Engineer

What is data processing?

  • Exploration
  • Data quality: Checks and transformations
  • Analysis
  • Aggregations
  • Transformations
Understanding Modern Data Architecture

Batch processing

  • Batch and streaming
  • Fixed amount of data

Batch processing with multiple open source frameworks and cloud services

Understanding Modern Data Architecture

Streaming processing

Fixed time window

Fixed time window

Sliding time window

Sliding time window

1 https://beam.apache.org/documentation/programming-guide/#windowing
Understanding Modern Data Architecture

Streaming processing concepts

  • When data was generated
  • When data arrived
  • Watermarks
  • Late data
  • Trigger new processing
Understanding Modern Data Architecture

Processing technologies

Use case Solution Cloud solution
Batch/streaming, big data, cluster Apache Spark, Flink, Beam AWS EMR, AWS Glue, Google Dataproc, Google Dataflow
Batch/streaming, big data, serverless (servers are fully managed by the provider) Apache Spark, Beam AWS Glue, Google Dataflow
Individual events, simple processing, 24/7 support without servers running General programming languages: Python, Javascript, C#, Java, Go AWS Lambda, Google Cloud Functions
Understanding Modern Data Architecture

Let's practice!

Understanding Modern Data Architecture

Preparing Video For Download...