Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction_to_Apache_Flink__Current_Meetup_2022___1_.pdf

Robert Metzger
February 29, 2024
5

 Introduction_to_Apache_Flink__Current_Meetup_2022___1_.pdf

Robert Metzger

February 29, 2024
Tweet

Transcript

  1. What is Apache Flink? • Data processing engine for low

    latency, high throughput stream processing • Open source at the Apache Software Foundation, one of the biggest projects there • Wide industry adoption, various hosted services available https://flink.apache.org/poweredby.html
  2. Common use-cases for Apache Flink Categories: • Real-time reporting /

    dashboards • Low-latency alerting, notifications, promotions etc. • Materialized view maintenance, caches • Real-time cross-database sync, lookup joins, windowed joins, aggregations • Machine learning: model serving, feature engineering • Change data capture, data integration Examples: • Stripe: CDC use-cases such as Stripe Dashboard, Stripe Search, Financial Reporting • Uber: Ads on Uber Eats • Netflix: Real-time product metrics (viewing session duration, clickpaths, …) https://flink.apache.org/poweredby.html
  3. Properties of Apache Flink Low Cost Low Latency High throughput

    Efficient real-time Easy to build Applications Declarative APIs SQL Debuggability Observability Python Integrations and Connectors Kafka Kubernetes AWS Kinesis AWS S3 Pulsar Change data capture Joins Aggregations Avro JSON Parquet Easy to operate Community & Documentation
  4. Focus of the upcoming slides: Properties of Apache Flink Low

    Cost Low Latency High throughput Efficiency In real-time Easy to build Applications Declarative APIs SQL Debuggability Observability Python Integrations and Connectors Kafka Kubernetes AWS Kinesis AWS S3 Pulsar Change data capture Joins Aggregations Avro JSON Parquet Easy to operate Community & Documentation
  5. Flink Application Development • Flink offers 3 underlying primitives: events,

    state, time • Events in a dataflow: • Flink takes care of efficient, parallel execution in a cluster Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation Data Source Data Aggregation Data Sink partition by key
  6. Flink Application Development • Flink offers 3 underlying primitives: events,

    state, time Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation Data Source Data Aggreg ation Data Sink partition by key Src Src Agg Agg Sink
  7. Flink Application Development • Flink offers 3 underlying primitives: events,

    state, time • State: • Flink guarantees that state is always available, by backing it up to cheap storage • Flink guarantees exactly-once semantics for state Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation Data Source Current Kafka reader offsets Data Aggregation Current aggregates (e.g. count by key) Data Sink Pending Kafka transaction data partition by key Cheap, durable storage (S3) Checkpointing
  8. Flink Application Development • Flink offers 3 underlying primitives: events,

    state, time • Time: event time and watermarks Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation 9am 10am 11am {id: 15, op: logout, time: “9:33”} {id: 299, op: add, time: “10:01”} {id: 2, op: logout, time: “10:29”} {id: 2, op: update, time: “9:48”} {id: 74, op: login, time: “10:36”} {id: 81, op: login, time: “11:15”} Stream of Events: Window Operator: → Events are arriving out of order
  9. Flink Application Development • Flink offers 3 underlying primitives: events,

    state, time • Time: event time and watermarks Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation 9am 10am 11am {id: 15, op: logout, time: “9:33”} {id: 299, op: add, time: “10:01”} {id: 2, op: update, time: “9:48”} {id: 74, op: login, time: “10:36”} {id: 81, op: login, time: “11:15”} {id: 2, op: logout, time: “10:29”} Window Operator with event-state: When has an hourly window seen all events? → Out-of-order events: buffer them in the window operator
  10. Flink Application Development • Flink offers 3 underlying primitives: events,

    state, time • Time: event time and watermarks Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation 9am 10am 11am {id: 15, op: logout, time: “9:33”} {id: 299, op: add, time: “10:01”} {id: 2, op: logout, time: “10:29”} {id: 2, op: update, time: “9:48”} Stream of Events: Window Operator: When has an hourly window seen all events? → Watermarks Watermark time: “10:11”} {id: 74, op: login, time: “10:36”} {id: 81, op: login, time: “11:15”} Triggers Window Processing Watermark = virtual clock for event time
  11. Application Development: APIs Easy to build Applications Declarative APIs SQL

    Python Joins Aggregations Community & Documentation This slide is copied from “Change Data Capture with Flink SQL and Debezium” a presentation at DataEngBytes by Marta Paes https://noti.st/morsapaes/liQzgs/change-data-capture-with-flink-sql-and-debezium
  12. Flink SQL example Easy to build Applications Declarative APIs SQL

    Python Joins Aggregations Community & Documentation Defining an input: SELECT window_start, window_end, COUNT(*) FROM TABLE( TUMBLE(TABLE kafka_example, DESCRIPTOR(`timestamp`), INTERVAL '1' SECOND)) GROUP BY window_start, window_end CREATE TABLE kafka_example ( `user_id` BIGINT, `item_id` BIGINT, `behavior` STRING, `timestamp` TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'bootstrap.servers' = 'host:9092', 'group.id' = 'testGroup', 'format' = 'csv' ) Running a tumbling-window streaming pipeline:
  13. Process Function Easy to build Applications Declarative APIs SQL Python

    Joins Aggregations Community & Documentation public class MyFunction extends KeyedProcessFunction<Tuple, String, Tuple2<String, Long>> { /** The state that is maintained by this process function */ private ValueState<CountWithTimestamp> state; @Override public void processElement(String value, Context ctx, Collector<Tuple2<String, Long>> out) throws Exception { // set the state's timestamp to the record's assigned event time timestamp current.lastModified = ctx.timestamp(); // write the state back state.update(current); // schedule the next timer 60 seconds from the current event time ctx.timerService().registerEventTimeTimer(current.lastModified + 60000); } @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out) throws Exception { // do stuff with time } }
  14. Next slide: Properties of Apache Flink Low Cost Low Latency

    High throughput Efficiently In real-time Easy to build Applications Declarative APIs SQL Debuggability Observability Python Integrations and Connectors Kafka Kubernetes AWS Kinesis AWS S3 Pulsar Change data capture Joins Aggregations Avro JSON Parquet Easy to operate Community & Documentation
  15. Efficiency & Performance • Highly-optimized engine, battle tested: Pinterest runs

    Flink at 300M messages per second (150TB/s) • Examples ◦ State and checkpointing ▪ Scale state beyond memory using build-in RocksDB statebackend ▪ Fast, incremental, asynchronous checkpoints ◦ Network stack (Netty) ▪ Native backpressure support, optimized for both latency and throughput ◦ SQL ▪ Optimized using Apache Calcite, micro-batched aggregations, skew handling, efficient internal data format Low Cost Low Latency High throughput Efficiency In real-time Source: https://www.slideshare.net/FlinkForward/flink-powered-stream-processing-platform-at-pinterest
  16. Next slide: Properties of Apache Flink Low Cost Low Latency

    High throughput Efficiently In real-time Easy to build Applications Declarative APIs SQL Debuggability Observability Python Integrations and Connectors Kafka Kubernetes AWS Kinesis AWS S3 Pulsar Change data capture Joins Aggregations Avro JSON Parquet Easy to operate Community & Documentation
  17. Integrations and Connectors Deployment Formats Data Connectors Observability Various Kubernetes

    deployment options: • Operator • Native integration • DIY DIY Deployments: Bash scripts, library-style, MiniCluster, etc. Hadoop YARN is still around
  18. Operations • Autoscaling Flink via Kubernetes Operator, Flink Reactive Mode

    • Persist in-flight state via Savepoints, then upgrade Flink version, Flink application, investigate/rewrite state • Observability: Latency-tracking, RocksDB metrics, operator/task/JVM-level performance metrics, Flame Graph UI, Backpressure monitoring • Local Debugging/Profiling: run the cluster code from your IDE or Unit tests • High Availability with Zookeeper or K8s etcd
  19. Flink Q&A Low Cost Low Latency High throughput Efficiency In

    real-time Easy to build Applications Declarative APIs SQL Debuggability Observability Python Integrations and Connectors Kafka Kubernetes AWS Kinesis AWS S3 Pulsar Change data capture Joins Aggregations Avro JSON Parquet Easy to operate Community & Documentation Follow me on Twitter: @rmetzger_ Visit the decodable booth (G5) to discuss Flink and Stream Processing