Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Flink@Data Platform - Ingestion Pipeline Redesign and Auto-scaling

Flink@Data Platform - Ingestion Pipeline Redesign and Auto-scaling

Atsutoshi Osuka (LINE / IU Data Connect Team / Software Engineer)
Hervé Froc (LINE / IU Data Connect Team / Data Engineer)

https://tech-verse.me/ja/sessions/34
https://tech-verse.me/en/sessions/34
https://tech-verse.me/ko/sessions/34

Tech-Verse2022

November 17, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. Agenda - First part • Brief introduction to Streaming Ingestion

    Pipeline in LINE’s Data Platform • Kafka-to-Elasticsearch pipeline redesign with Apache Flink - Second part • Auto Scaling implementation on Kubernetes
  2. Data Ingestion Pipeline Kafka Elasticsearch HDFS Internal Kafka Yet Another

    Kafka Kafka-to-Kafka job Kafka-to-Elasticsearch job Kafka-to-HDFS (Iceberg) job Kafka-to-HDFS (raw data) job Kafka-to-Kafka job
  3. Scale topics Number of Kafka topics 2500+ records/s Peak total

    throughput 19M+ partitions 384 Largest Kafka topic
  4. Apache Flink Introduction Job graph (as Java archive) Source Process

    Sink Sink Submit • Flink provides good abstraction for constructing a streaming processing job • Flink takes care of assigning each task of the processing job to workers (task managers) • Flink does a lot of heavy lifting for stream processing Flink cluster Runs on: • Standard servers • Kubernetes • YARN
  5. Issues of the Current Pipeline Issue 1 Scalability and efficiency

    Issue 2 Delivery guarantee Issue 3 Operational cost
  6. Issue 1: Scalability and efficiency Architectural overview of the current

    implementation Kafka Streams StreamThread Elasticsearch BulkProcessor Batching Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer
  7. Issue 1: Scalability and efficiency StreamThread BulkProcessor Batching Consume Process

    Buffer Elasticsearch Kafka Kafka Streams Application Buffer Record polling and processing in a single thread! Record polling can be a bottleneck!
  8. Issue 1: Scalability and efficiency Visual VM Data. CPU Sampling

    result of a StreamThread while the pipeline is lagging This thread is spending ~75% of time waiting for new data Partially due to bad configurations, but anyways, not desirable w.r.t. resource efficiency
  9. Issue 1: Scalability and efficiency Kafka Kafka Streams Application We

    can have at most one stream thread for one Kafka partition Pipeline throughput is now dependent on the number of Kafka partitions StreamThread
  10. Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Batching

    Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer Kafka Streams can commit the log offset to broker at this point
  11. Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Batching

    Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer The log might not be sent to Elasticsearch yet! It can break at-least-once! Kafka Streams can commit the log offset to broker at this point
  12. Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Bactching

    Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer The log might not be sent to Elasticsearch yet! It can break at-least-once! Kafka Streams can commit the log offset to broker at this point We implemented the current pipeline for realtime log monitoring, and occasional lost of log was acceptable
  13. Issue 3: Operational cost Kafka Elasticsearch Flink Kafka Streams Flink

    Flink HDFS Internal Kafka Other pipeline components are using Flink Domain specific knowledge required!
  14. Goals of this project Improve Scalability and efficiency Provide Better

    delivery guarantee Unify Base framework for pipeline implementation
  15. How Flink helps achieving these goals? Improve Scalability and efficiency

    Provide Better delivery guarantee Unify Base framework for pipeline implementation Buffering and back pressure mechanism! Checkpoint mechanism! We already use Flink elsewhere! AsyncSink abstraction! Per-task parallelism configuration
  16. Buffering and back pressure mechanism Kafka Consumer Thread Consume Elasticsearch

    Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread Batch Buffer IO (consume) and processing is executed in separate threads Each thread has a buffer so that it doesn’t have to wait others
  17. Buffering and back pressure mechanism Kafka Consumer Thread Consume Elasticsearch

    Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread User code (Chain 3) Buffer Slow down!
  18. Buffering and back pressure mechanism Kafka Consumer Thread Consume Elasticsearch

    Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread User code (Chain 3) Buffer Slow down! Back pressure Slow down!
  19. Per-task parallelism configuration Kafka Consumer Thread Consume Elasticsearch Kafka Flink

    Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread User code (Chain 3) Buffer The number of threads (subtasks) can be configured for each task!
  20. Checkpoint mechanism Checkpoint • Flink periodically save the snapshot of

    the state to an external storage (i.e. checkpointing) • For example, you can configure Flink to create a checkpoint every 30 seconds • User code can control what to snapshot with Flink API Flink Task User code State External Storage When running normally Save state snapshot on checkpoint Kafka consumer offset, state for stateful computation, etc.
  21. • Flink can recover the task state from the external

    storage on recovery (e.g. from crash) • Using the saved state, Flink can resume from the last checkpoint • The recovery mechanism is also used for restarting the stream processing job Checkpoint mechanism Restore Flink Task User code State External Storage On recovery from crash Load saved state from the external storage
  22. AsyncSink abstraction Request AsyncSink AsyncSinkWriter ElementConverter State Buffer Storage Batch

    3. Create request batch 1. Serialize record 4. Call async batch ingestion API 2. Store in buffer
  23. Our custom Elasticsearch sink On top of Flink AsyncSink CustomAsyncSink

    CustomSinkWriter implements AsyncSinkWriter CustomElementConverter State Buffer Elasticsearch Batch The PoC version was only 450 LoC in Kotlin!
  24. Efficient at-least-once Elasticsearch sink Why we implemented a custom connector,

    instead of using the official connector? - In at-least-once mode, the official connector limits the number of simultaneous request to Elasticsearch to one per subtask Our AsyncSink based implementation persists request buffer in Flink state, and doesn’t have this limitation!
  25. Experiment Can the new Flink based pipeline process production-level workload?

    Elasticsearch Flink Kafka • Test cluster • Hot-warm architecture • 45 hot nodes, 21 warm nodes • Production cluster • Use one of topics used in production for test • 375k records/s, ~85MB/s • 64 partitions • Test cluster on Kubernetes • 8 workers, each with 8 CPU cores and 8GB RAM
  26. We introduced our recent work for improving scalabiltiy, efficiency, delivery,

    and maintainability of our Kafka-to-Elasticsearch pipeline Summary In the near future, we’d like to roll out the new version to production The experiment showed the Flink based implementation can process production-level workload providing better delivery guarantee
  27. Scaling Flink Manual Scaling App State (S3, HDFS, …) Flink

    cluster Flink cluster Flink cluster 1. Flink cluster that needs re- scaling 2. Stop job with savepoint 3. Start job from savepoint Update cluster & job settings
  28. Scaling Flink Manual Scaling App State (S3, HDFS, …) Flink

    cluster Flink cluster Flink cluster 1. Flink cluster that needs re- scaling 2. Stop job with savepoint 3. Start job from savepoint Update cluster & job settings Advantages - Available in all Flink versions Disadvantages - Slow to restart (~2 minutes)
  29. Scaling Flink Reactive Scaling min=1 max=10 cpu=80% on=TaskManager Deployment HorizontalPodAutoScaler

    JM TM Flink Cluster TM Monitor metrics If CPU > 80% threshold, start new TM Register TM & Offer Slot On new resources available, the Job Manager will restart the job from last checkpoint with a new parrallelism.
  30. Scaling Flink Reactive Scaling min=1 max=10 cpu=80% on=TaskManager Deployment HorizontalPodAutoScaler

    JM TM Flink Cluster TM Monitor metrics If CPU > 80% threshold, start new TM Register TM & Offer Slot Advantages - Leverage elastic infrastructures (Kubernetes Autoscalers, AWS ASG) - Faster restart Disadvantages - No partial failover
  31. Automating Scaling 1. Automate manual scaling via argoCD - Flow

    : - 1. Engineers raise a PR to change the scale of a Flink cluster - 2. argoCD pick up the changes and handles the manual scaling. - Benefits: - Standardize operations on Flink (Not only for scaling but also to release changes). - Reduce operation cost and avoid human errors. - Compliant with our audit rules. Engineer Raise PR Github Enterprise Poll for changes argoCD On Sync Deploy changes ns: flink flink cluster JM CM TM
  32. Automating Scaling 2. Introduce an auto-scaler argoCD On Sync Deploy

    changes ns: flink flink cluster JM CM TM Prometheus Worker register/unregister 4. Sync ArgoCD via REST API 3. Update scaling CM store job Info (periodically) get job list that ready to be scaled enqueue scaling task 2. evaluate decision 1. scrape metrics pickup task auto-scaler Manager Webhook Implementation for clusters not supporting reactive scaling
  33. Automating Scaling 2. Introduce an auto-scaler argoCD On Sync Deploy

    changes ns: flink flink cluster JM TM Prometheus Worker 3. Update deployment TM store job Info (periodically) get job list that ready to be scaled enqueue scaling task 2. evaluate decision 1. scrape metrics pickup task auto-scaler Manager Webhook register/unregister Implementation for clusters supporting reactive scaling
  34. Automating Scaling Essential steps of the auto-scaler - - -

    - Metrics : JVM_CPU_LOAD KAFKA_CONSUMER_LAG KAFKA_RECORDS_IN KAFKA_RECORDS_OUT Rules: LAG above 5 mins LAG increasing CPU load above 80% Safeguard rules: Scale < max scale Scale > min scale 2. Evaluate Scaling Rules 3. Estimate scale 4. Post evaluation check 1. Sample Monitoring metrics Predict: Using a linear regression model, estimate the appropriate scale
  35. Automating Scaling Advantages & Disadvantages of the auto-scaler Advantages -

    Any Flink cluster can subscribe to the auto-scaler. - Easily configurable and extendable. - Ability to setup predictive rules. Disadvantages - Can require some tuning to get best scaling performances.
  36. Summary - Our approach enabled auto-scaling on any Flink cluster

    : - 1. Automate Flink operations via CD pipeline - 2. Introduce an auto-scaler that also integrate with the CD pipeline • Future work: • Improve the prediction model • Integrate the auto-scaler with other technologies (e.g Spark Streaming)