Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Meetup 2 2019 - Juan Pampliega - Implementing S...

ArqConf
April 25, 2019

Meetup 2 2019 - Juan Pampliega - Implementing Stream Processing Systems

ArqConf

April 25, 2019
Tweet

More Decks by ArqConf

Other Decks in Technology

Transcript

  1. Juan Pampliega Information Engineering @ ITBA Co-Founder @ Mutt Data

    Professor @ ITBA - Especialización en Ciencia de Datos • Working in Data Projects since 2010 at ITBA, Globant (Google), Despegar, Socialmetrix, Jampp, Claro, etc. • Co Founder @ Mutt Data a company specialized in developing projects using Big Data and Data Science. • Developed first production ready stream processing system in 2015 @juanpampliega | [email protected]
  2. Event-Command vs Event-Driven For organizations to become real time, they

    must become event driven In Event-Command pattern (like REST applications) the endpoint is known, the method being called is also known and lastly the calls tended to return a value. In Event-Driven pattern services communicate only by generating events that can be reused by any service in the system which leads to less coupling.
  3. Event Streams Data is generally born as continuous event streams

    Batch processing => bounded datasets Stream processing => unbounded datasets Stream processing means computing on data directly as it is produced or received.
  4. Batch Processing Traditional data processing storing streams of data in

    databases and later processing it using ETLs. Still needed when you need to explore data an when you are not yet sure what you want to do with it.
  5. Stream Processing In stream processing the application logic, analytics, and

    queries exist continuously, and data flows through them continuously.
  6. To produce results in real time a system must continuously

    compute and update results with each new event. Modern applications and microservices should operate in an event-driven fashion. Their logic and computation is triggered by events. Unified event-driven applications and real-time analytics
  7. Stateful Streaming Processing Stateful stream processing is a subset of

    stream processing in which the computation maintains contextual state. This state is used to store information derived from the previously-seen events.
  8. Windows Partition a stream of data into discrete batches. Needed

    to compute metrics that require context (average by minute, partial counts, etc.)
  9. Event Time vs Processing Time • Processing and transport technologies.

    • Characteristics of the data source. (distribution, throughput, burstiness, etc.) • Hardware. (network, RAM, CPUs, etc.) • FAILURES & RETRIES
  10. Fault Tolerant Local State Local state is a fundamental primitive

    in stream processing. It can be indexed and accessed in a variety of rich ways. Local, in-process data access is much faster. It’s easier to isolate. Implemented with in memory hash table, bloom filters, bit maps, RocksDB like systems, etc.
  11. Critical Questions in Stream Processing Tyler Akidau (tech lead for

    internal streaming data processing systems @ Google) defined 4 critical questions any stream processing system should be able to answer: What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  12. Apache Beam Beam Model: Fn Runners Apache Flink Apache Spark

    Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- starting with Java 3. Runners for Existing Distributed Processing Backends ◦ Apache Apex ◦ Apache Flink ◦ Apache Spark ◦ Google Cloud Dataflow ◦ Local (in-process) runner for testing
  13. Ben Stopford, The Data Dichotomy: Rethinking the Way We Treat

    Data and Services https://www.confluent.io/blog/data-dichotomy-rethinking-the-way-we-treat-data-and- services/ Tyler Akidau, Streaming 101: The world beyond batch https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 Tyler Akidau, Streaming 102: The world beyond batch https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Ververica, What is Stream Processing? https://www.ververica.com/what-is-stream-processing Apache Flink Documentation https://ci.apache.org/projects/flink/flink-docs-release-1.8/ Apache Beam https://beam.apache.org/
  14. Title of class/project By Juan Martín Pampliega Thank you !

    Implementing Stream Processing Systems By Juan Martín Pampliega We are hiring Data Engineers! [email protected]