Popular streaming systems

Streaming Concepts

Mike Metzger

Data engineer

Streaming tools

  • Various tools available depending on needs
  • Allows designers to specify the best tool for the job
  • Common systems include:
    • Celery
    • Kafka
    • Spark Streaming

Celery

Apache Kafka

Spark Streaming

Streaming Concepts

Celery

  • Distributed task queue / FIFO
  • Used primarily as a job or task queue
  • Often used for asynchronous tasks
    • Sending password reset emails
    • Fulfilling digital orders
    • Resizing images
  • Allows for real-time processing of significant quantity of messages
  • Provides functionality for management and scaling (vertical & horizontal)

Celery project

Streaming Concepts

Apache Kafka

  • Distributed event streaming system
  • Designed to send events between producers and consumers
    • Producers create events on a topic
    • Topics are basically messages of a specified form
    • Consumers receive new events
  • Different consumers can handle events as needed (logging, transformations, relaying, etc)
  • Handles storing of events as specified
  • Extremely powerful, but can be tricky to set up

Apache Kafka

Streaming Concepts

Kafka applications

  • Best used for passing data between multiple systems
    • Single source of truth
    • Change data capture
    • Data backups
    • Data system migrations
Streaming Concepts

Spark streaming

  • Part of Apache Spark
  • Designed to process streaming data
  • Builds upon the capabilities of Spark to process data in Scala, Python, SQL, and so forth
  • Useful for processing large amounts of data and in machine learning scenarios
  • Able to transition from batch to stream processing fairly easily
  • Not designed to store or log events, but primarily to process or modify the data

Apache Spark

Streaming Concepts

Let's practice!

Streaming Concepts

Preparing Video For Download...