Introduction_to_Apache_Flink__Current_Meetup_2022___1_.pdf

Introduction to Apache Flink Robert Metzger, Staff Engineer @ decodable
Apache Flink Committer and PMC Chair

What is Apache Flink? • Data processing engine for low
latency, high throughput stream processing • Open source at the Apache Software Foundation, one of the biggest projects there • Wide industry adoption, various hosted services available https://flink.apache.org/poweredby.html

Common use-cases for Apache Flink Categories: • Real-time reporting /
dashboards • Low-latency alerting, notifications, promotions etc. • Materialized view maintenance, caches • Real-time cross-database sync, lookup joins, windowed joins, aggregations • Machine learning: model serving, feature engineering • Change data capture, data integration Examples: • Stripe: CDC use-cases such as Stripe Dashboard, Stripe Search, Financial Reporting • Uber: Ads on Uber Eats • Netflix: Real-time product metrics (viewing session duration, clickpaths, …) https://flink.apache.org/poweredby.html

Properties of Apache Flink Low Cost Low Latency High throughput
Efficient real-time Easy to build Applications Declarative APIs SQL Debuggability Observability Python Integrations and Connectors Kafka Kubernetes AWS Kinesis AWS S3 Pulsar Change data capture Joins Aggregations Avro JSON Parquet Easy to operate Community & Documentation

Focus of the upcoming slides: Properties of Apache Flink Low
Cost Low Latency High throughput Efficiency In real-time Easy to build Applications Declarative APIs SQL Debuggability Observability Python Integrations and Connectors Kafka Kubernetes AWS Kinesis AWS S3 Pulsar Change data capture Joins Aggregations Avro JSON Parquet Easy to operate Community & Documentation

Flink Application Development • Flink offers 3 underlying primitives: events,
state, time • Events in a dataflow: • Flink takes care of efficient, parallel execution in a cluster Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation Data Source Data Aggregation Data Sink partition by key

state, time Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation Data Source Data Aggreg ation Data Sink partition by key Src Src Agg Agg Sink

state, time • State: • Flink guarantees that state is always available, by backing it up to cheap storage • Flink guarantees exactly-once semantics for state Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation Data Source Current Kafka reader offsets Data Aggregation Current aggregates (e.g. count by key) Data Sink Pending Kafka transaction data partition by key Cheap, durable storage (S3) Checkpointing

state, time • Time: event time and watermarks Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation 9am 10am 11am {id: 15, op: logout, time: “9:33”} {id: 299, op: add, time: “10:01”} {id: 2, op: logout, time: “10:29”} {id: 2, op: update, time: “9:48”} {id: 74, op: login, time: “10:36”} {id: 81, op: login, time: “11:15”} Stream of Events: Window Operator: → Events are arriving out of order

state, time • Time: event time and watermarks Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation 9am 10am 11am {id: 15, op: logout, time: “9:33”} {id: 299, op: add, time: “10:01”} {id: 2, op: update, time: “9:48”} {id: 74, op: login, time: “10:36”} {id: 81, op: login, time: “11:15”} {id: 2, op: logout, time: “10:29”} Window Operator with event-state: When has an hourly window seen all events? → Out-of-order events: buffer them in the window operator

state, time • Time: event time and watermarks Easy to build Applications Declarative APIs SQL Python Joins Aggregations Community & Documentation 9am 10am 11am {id: 15, op: logout, time: “9:33”} {id: 299, op: add, time: “10:01”} {id: 2, op: logout, time: “10:29”} {id: 2, op: update, time: “9:48”} Stream of Events: Window Operator: When has an hourly window seen all events? → Watermarks Watermark time: “10:11”} {id: 74, op: login, time: “10:36”} {id: 81, op: login, time: “11:15”} Triggers Window Processing Watermark = virtual clock for event time

Application Development: APIs Easy to build Applications Declarative APIs SQL
Python Joins Aggregations Community & Documentation This slide is copied from “Change Data Capture with Flink SQL and Debezium” a presentation at DataEngBytes by Marta Paes https://noti.st/morsapaes/liQzgs/change-data-capture-with-flink-sql-and-debezium

Flink SQL example Easy to build Applications Declarative APIs SQL
Python Joins Aggregations Community & Documentation Defining an input: SELECT window_start, window_end, COUNT(*) FROM TABLE( TUMBLE(TABLE kafka_example, DESCRIPTOR(`timestamp`), INTERVAL '1' SECOND)) GROUP BY window_start, window_end CREATE TABLE kafka_example ( `user_id` BIGINT, `item_id` BIGINT, `behavior` STRING, `timestamp` TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', 'topic' = 'user_behavior', 'bootstrap.servers' = 'host:9092', 'group.id' = 'testGroup', 'format' = 'csv' ) Running a tumbling-window streaming pipeline:

Process Function Easy to build Applications Declarative APIs SQL Python
Joins Aggregations Community & Documentation public class MyFunction extends KeyedProcessFunction<Tuple, String, Tuple2<String, Long>> { /** The state that is maintained by this process function */ private ValueState<CountWithTimestamp> state; @Override public void processElement(String value, Context ctx, Collector<Tuple2<String, Long>> out) throws Exception { // set the state's timestamp to the record's assigned event time timestamp current.lastModified = ctx.timestamp(); // write the state back state.update(current); // schedule the next timer 60 seconds from the current event time ctx.timerService().registerEventTimeTimer(current.lastModified + 60000); } @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out) throws Exception { // do stuff with time } }

Next slide: Properties of Apache Flink Low Cost Low Latency
High throughput Efficiently In real-time Easy to build Applications Declarative APIs SQL Debuggability Observability Python Integrations and Connectors Kafka Kubernetes AWS Kinesis AWS S3 Pulsar Change data capture Joins Aggregations Avro JSON Parquet Easy to operate Community & Documentation

Efficiency & Performance • Highly-optimized engine, battle tested: Pinterest runs
Flink at 300M messages per second (150TB/s) • Examples ◦ State and checkpointing ▪ Scale state beyond memory using build-in RocksDB statebackend ▪ Fast, incremental, asynchronous checkpoints ◦ Network stack (Netty) ▪ Native backpressure support, optimized for both latency and throughput ◦ SQL ▪ Optimized using Apache Calcite, micro-batched aggregations, skew handling, efficient internal data format Low Cost Low Latency High throughput Efficiency In real-time Source: https://www.slideshare.net/FlinkForward/flink-powered-stream-processing-platform-at-pinterest

Next slide: Properties of Apache Flink Low Cost Low Latency
High throughput Efficiently In real-time Easy to build Applications Declarative APIs SQL Debuggability Observability Python Integrations and Connectors Kafka Kubernetes AWS Kinesis AWS S3 Pulsar Change data capture Joins Aggregations Avro JSON Parquet Easy to operate Community & Documentation

Integrations and Connectors Deployment Formats Data Connectors Observability Various Kubernetes
deployment options: • Operator • Native integration • DIY DIY Deployments: Bash scripts, library-style, MiniCluster, etc. Hadoop YARN is still around

Operations • Autoscaling Flink via Kubernetes Operator, Flink Reactive Mode
• Persist in-flight state via Savepoints, then upgrade Flink version, Flink application, investigate/rewrite state • Observability: Latency-tracking, RocksDB metrics, operator/task/JVM-level performance metrics, Flame Graph UI, Backpressure monitoring • Local Debugging/Profiling: run the cluster code from your IDE or Unit tests • High Availability with Zookeeper or K8s etcd

Flink Q&A Low Cost Low Latency High throughput Efficiency In
real-time Easy to build Applications Declarative APIs SQL Debuggability Observability Python Integrations and Connectors Kafka Kubernetes AWS Kinesis AWS S3 Pulsar Change data capture Joins Aggregations Avro JSON Parquet Easy to operate Community & Documentation Follow me on Twitter: @rmetzger_ Visit the decodable booth (G5) to discuss Flink and Stream Processing

2022 Build real-time data apps & services. Fast. decodable.co

Introduction_to_Apache_Flink__Current_Meetup_20...

Introduction_to_Apache_Flink__Current_Meetup_2022___1_.pdf

Robert Metzger

More Decks by Robert Metzger

Featured

Transcript

Introduction to Apache Flink Robert Metzger, Staff Engineer @ decodable

What is Apache Flink? • Data processing engine for low

Common use-cases for Apache Flink Categories: • Real-time reporting /

Properties of Apache Flink Low Cost Low Latency High throughput

Focus of the upcoming slides: Properties of Apache Flink Low

Flink Application Development • Flink offers 3 underlying primitives: events,

Flink Application Development • Flink offers 3 underlying primitives: events,

Flink Application Development • Flink offers 3 underlying primitives: events,

Flink Application Development • Flink offers 3 underlying primitives: events,

Flink Application Development • Flink offers 3 underlying primitives: events,

Flink Application Development • Flink offers 3 underlying primitives: events,

Application Development: APIs Easy to build Applications Declarative APIs SQL

Flink SQL example Easy to build Applications Declarative APIs SQL

Process Function Easy to build Applications Declarative APIs SQL Python

Next slide: Properties of Apache Flink Low Cost Low Latency

Efficiency & Performance • Highly-optimized engine, battle tested: Pinterest runs

Next slide: Properties of Apache Flink Low Cost Low Latency

Integrations and Connectors Deployment Formats Data Connectors Observability Various Kubernetes

Operations • Autoscaling Flink via Kubernetes Operator, Flink Reactive Mode

Flink Q&A Low Cost Low Latency High throughput Efficiency In

2022 Build real-time data apps & services. Fast. decodable.co