$30 off During Our Annual Pro Sale. View Details »

Introduction to Apache Flink @ Current 2022

Introduction to Apache Flink @ Current 2022

Robert Metzger

October 05, 2022
Tweet

More Decks by Robert Metzger

Other Decks in Technology

Transcript

  1. Introduction to
    Apache Flink
    Robert Metzger, Staff Engineer @ decodable
    Apache Flink Committer and PMC Chair

    View Slide

  2. What is Apache Flink?
    ● Data processing engine for low latency, high throughput stream processing
    ● Open source at the Apache Software Foundation, one of the biggest projects there
    ● Wide industry adoption, various hosted services available
    https://flink.apache.org/poweredby.html

    View Slide

  3. Common use-cases for Apache Flink
    Categories:
    ● Real-time reporting / dashboards
    ● Low-latency alerting, notifications, promotions etc.
    ● Materialized view maintenance, caches
    ● Real-time cross-database sync, lookup joins, windowed joins, aggregations
    ● Machine learning: model serving, feature engineering
    ● Change data capture, data integration
    Examples:
    ● Stripe: CDC use-cases such as Stripe Dashboard, Stripe Search, Financial Reporting
    ● Uber: Ads on Uber Eats
    ● Netflix: Real-time product metrics (viewing session duration, clickpaths, …)
    https://flink.apache.org/poweredby.html

    View Slide

  4. Properties of Apache Flink
    Low Cost
    Low Latency
    High throughput
    Efficient
    real-time
    Easy to build
    Applications
    Declarative APIs
    SQL
    Debuggability
    Observability
    Python
    Integrations and
    Connectors
    Kafka
    Kubernetes
    AWS Kinesis
    AWS S3
    Pulsar
    Change data capture
    Joins
    Aggregations
    Avro
    JSON Parquet
    Easy to operate
    Community &
    Documentation

    View Slide

  5. Focus of the upcoming slides:
    Properties of Apache Flink
    Low Cost
    Low Latency
    High throughput
    Efficiency
    In real-time
    Easy to build
    Applications
    Declarative APIs
    SQL
    Debuggability
    Observability
    Python
    Integrations and
    Connectors
    Kafka
    Kubernetes
    AWS Kinesis
    AWS S3
    Pulsar
    Change data capture
    Joins
    Aggregations
    Avro
    JSON Parquet
    Easy to operate
    Community &
    Documentation

    View Slide

  6. Flink Application Development
    ● Flink offers 3 underlying primitives: events, state, time
    ● Events in a dataflow:
    ● Flink takes care of efficient, parallel execution in a cluster
    Easy to build
    Applications
    Declarative APIs
    SQL
    Python
    Joins
    Aggregations
    Community &
    Documentation
    Data Source
    Data
    Aggregation
    Data Sink
    partition
    by key

    View Slide

  7. Flink Application Development
    ● Flink offers 3 underlying primitives: events, state, time
    Easy to build
    Applications
    Declarative APIs
    SQL
    Python
    Joins
    Aggregations
    Community &
    Documentation
    Data
    Source
    Data
    Aggreg
    ation
    Data
    Sink
    partition
    by key
    Src
    Src
    Agg
    Agg
    Sink

    View Slide

  8. Flink Application Development
    ● Flink offers 3 underlying primitives: events, state, time
    ● State:
    ● Flink guarantees that state is always available, by backing it up to cheap
    storage
    ● Flink guarantees exactly-once semantics for state
    Easy to build
    Applications
    Declarative APIs
    SQL
    Python
    Joins
    Aggregations
    Community &
    Documentation
    Data Source
    Current
    Kafka reader
    offsets
    Data
    Aggregation
    Current
    aggregates
    (e.g. count
    by key)
    Data Sink
    Pending
    Kafka
    transaction
    data
    partition
    by key
    Cheap,
    durable
    storage
    (S3)
    Checkpointing

    View Slide

  9. Flink Application Development
    ● Flink offers 3 underlying primitives: events, state, time
    ● Time: event time and watermarks
    Easy to build
    Applications
    Declarative APIs
    SQL
    Python
    Joins
    Aggregations
    Community &
    Documentation
    9am
    10am
    11am
    {id: 15, op: logout, time: “9:33”}
    {id: 299, op: add, time: “10:01”}
    {id: 2, op: logout, time: “10:29”}
    {id: 2, op: update, time: “9:48”}
    {id: 74, op: login, time: “10:36”}
    {id: 81, op: login, time: “11:15”}
    Stream of Events: Window Operator:
    → Events are arriving out of order

    View Slide

  10. Flink Application Development
    ● Flink offers 3 underlying primitives: events, state, time
    ● Time: event time and watermarks
    Easy to build
    Applications
    Declarative APIs
    SQL
    Python
    Joins
    Aggregations
    Community &
    Documentation
    9am
    10am
    11am
    {id: 15, op: logout, time: “9:33”}
    {id: 299, op: add, time: “10:01”}
    {id: 2, op: update, time: “9:48”}
    {id: 74, op: login, time: “10:36”}
    {id: 81, op: login, time: “11:15”}
    {id: 2, op: logout, time: “10:29”}
    Window Operator
    with event-state:
    When has an
    hourly window
    seen all events?
    → Out-of-order events: buffer them in the window operator

    View Slide

  11. Flink Application Development
    ● Flink offers 3 underlying primitives: events, state, time
    ● Time: event time and watermarks
    Easy to build
    Applications
    Declarative APIs
    SQL
    Python
    Joins
    Aggregations
    Community &
    Documentation
    9am
    10am
    11am
    {id: 15, op: logout, time: “9:33”}
    {id: 299, op: add, time: “10:01”}
    {id: 2, op: logout, time: “10:29”}
    {id: 2, op: update, time: “9:48”}
    Stream of Events: Window Operator:
    When has an
    hourly window
    seen all events?
    → Watermarks
    Watermark time: “10:11”}
    {id: 74, op: login, time: “10:36”}
    {id: 81, op: login, time: “11:15”}
    Triggers Window
    Processing
    Watermark = virtual
    clock for event time

    View Slide

  12. Application Development: APIs
    Easy to build
    Applications
    Declarative APIs
    SQL
    Python
    Joins
    Aggregations
    Community &
    Documentation
    This slide is copied from “Change Data Capture with Flink SQL and Debezium” a presentation at DataEngBytes by Marta Paes
    https://noti.st/morsapaes/liQzgs/change-data-capture-with-flink-sql-and-debezium

    View Slide

  13. Flink SQL example
    Easy to build
    Applications
    Declarative APIs
    SQL
    Python
    Joins
    Aggregations
    Community &
    Documentation
    Defining an input:
    SELECT window_start,
    window_end,
    COUNT(*)
    FROM TABLE(
    TUMBLE(TABLE kafka_example,
    DESCRIPTOR(`timestamp`),
    INTERVAL '1' SECOND))
    GROUP BY window_start, window_end
    CREATE TABLE kafka_example (
    `user_id` BIGINT,
    `item_id` BIGINT,
    `behavior` STRING,
    `timestamp` TIMESTAMP(3)
    ) WITH (
    'connector' = 'kafka',
    'topic' = 'user_behavior',
    'bootstrap.servers' = 'host:9092',
    'group.id' = 'testGroup',
    'format' = 'csv'
    )
    Running a tumbling-window
    streaming pipeline:

    View Slide

  14. Process Function
    Easy to build
    Applications
    Declarative APIs
    SQL
    Python
    Joins
    Aggregations
    Community &
    Documentation
    public class MyFunction extends KeyedProcessFunction> {
    /** The state that is maintained by this process function */
    private ValueState state;
    @Override
    public void processElement(String value, Context ctx, Collector> out) throws Exception {
    // set the state's timestamp to the record's assigned event time timestamp
    current.lastModified = ctx.timestamp();
    // write the state back
    state.update(current);
    // schedule the next timer 60 seconds from the current event time
    ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
    }
    @Override
    public void onTimer(long timestamp, OnTimerContext ctx, Collector> out) throws Exception {
    // do stuff with time
    }
    }

    View Slide

  15. Next slide:
    Properties of Apache Flink
    Low Cost
    Low Latency
    High throughput
    Efficiently
    In real-time
    Easy to build
    Applications
    Declarative APIs
    SQL
    Debuggability
    Observability
    Python
    Integrations and
    Connectors
    Kafka
    Kubernetes
    AWS Kinesis
    AWS S3
    Pulsar
    Change data capture
    Joins
    Aggregations
    Avro
    JSON Parquet
    Easy to operate
    Community &
    Documentation

    View Slide

  16. Efficiency & Performance
    ● Highly-optimized engine, battle tested: Pinterest runs Flink at 300M messages per
    second (150TB/s)
    ● Examples
    ○ State and checkpointing
    ■ Scale state beyond memory using build-in RocksDB statebackend
    ■ Fast, incremental, asynchronous checkpoints
    ○ Network stack (Netty)
    ■ Native backpressure support, optimized for both latency and throughput
    ○ SQL
    ■ Optimized using Apache Calcite, micro-batched aggregations, skew handling,
    efficient internal data format
    Low Cost
    Low Latency
    High throughput
    Efficiency
    In real-time
    Source: https://www.slideshare.net/FlinkForward/flink-powered-stream-processing-platform-at-pinterest

    View Slide

  17. Next slide:
    Properties of Apache Flink
    Low Cost
    Low Latency
    High throughput
    Efficiently
    In real-time
    Easy to build
    Applications
    Declarative APIs
    SQL
    Debuggability
    Observability
    Python
    Integrations and
    Connectors
    Kafka
    Kubernetes
    AWS Kinesis
    AWS S3
    Pulsar
    Change data capture
    Joins
    Aggregations
    Avro
    JSON Parquet
    Easy to operate
    Community &
    Documentation

    View Slide

  18. Integrations and Connectors
    Deployment
    Formats
    Data Connectors
    Observability
    Various Kubernetes
    deployment options:
    ● Operator
    ● Native integration
    ● DIY
    DIY Deployments:
    Bash scripts,
    library-style,
    MiniCluster, etc.
    Hadoop YARN is still
    around

    View Slide

  19. Operations
    ● Autoscaling Flink via Kubernetes Operator, Flink Reactive Mode
    ● Persist in-flight state via Savepoints, then upgrade Flink version, Flink application,
    investigate/rewrite state
    ● Observability: Latency-tracking, RocksDB metrics, operator/task/JVM-level performance
    metrics, Flame Graph UI, Backpressure monitoring
    ● Local Debugging/Profiling: run the cluster code from your IDE or Unit tests
    ● High Availability with Zookeeper or K8s etcd

    View Slide

  20. Flink Q&A
    Low Cost
    Low Latency
    High throughput
    Efficiency
    In real-time
    Easy to build
    Applications
    Declarative APIs
    SQL
    Debuggability
    Observability
    Python
    Integrations and
    Connectors
    Kafka
    Kubernetes
    AWS Kinesis
    AWS S3
    Pulsar
    Change data capture
    Joins
    Aggregations
    Avro
    JSON Parquet
    Easy to operate
    Community &
    Documentation
    Follow me on Twitter: @rmetzger_
    Visit the decodable booth (G5) to
    discuss Flink and Stream Processing

    View Slide

  21. 2022
    Build real-time data apps &
    services. Fast.
    decodable.co

    View Slide