$30 off During Our Annual Pro Sale. View Details »

Flink@Data Platform - Ingestion Pipeline Redesign and Auto-scaling

Flink@Data Platform - Ingestion Pipeline Redesign and Auto-scaling

Atsutoshi Osuka (LINE / IU Data Connect Team / Software Engineer)
Hervé Froc (LINE / IU Data Connect Team / Data Engineer)

https://tech-verse.me/ja/sessions/34
https://tech-verse.me/en/sessions/34
https://tech-verse.me/ko/sessions/34

Tech-Verse2022
PRO

November 17, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. None
  2. Agenda - First part • Brief introduction to Streaming Ingestion

    Pipeline in LINE’s Data Platform • Kafka-to-Elasticsearch pipeline redesign with Apache Flink - Second part • Auto Scaling implementation on Kubernetes
  3. Overview of Our Streaming Ingestion Pipeline

  4. Data Ingestion Pipeline Kafka Elasticsearch HDFS Internal Kafka Yet Another

    Kafka Kafka-to-Kafka job Kafka-to-Elasticsearch job Kafka-to-HDFS (Iceberg) job Kafka-to-HDFS (raw data) job Kafka-to-Kafka job
  5. Scale topics Number of Kafka topics 2500+ records/s Peak total

    throughput 19M+ partitions 384 Largest Kafka topic
  6. Flink in the Pipeline Kafka Elasticsearch Flink Kafka Streams Flink

    Flink HDFS Internal Kafka
  7. Apache Flink Introduction Job graph (as Java archive) Source Process

    Sink Sink Submit • Flink provides good abstraction for constructing a streaming processing job • Flink takes care of assigning each task of the processing job to workers (task managers) • Flink does a lot of heavy lifting for stream processing Flink cluster Runs on: • Standard servers • Kubernetes • YARN
  8. Kafka-to-Elasticsearch Pipeline Kafka Elasticsearch Flink Kafka Streams Flink Flink HDFS

    Internal Kafka In the first part
  9. Redesigning Kafka-to-Elasticsearch (ES) Pipeline with Apache Flink

  10. Issues of the Current Pipeline Issue 1 Scalability and efficiency

    Issue 2 Delivery guarantee Issue 3 Operational cost
  11. Issue 1: Scalability and efficiency Architectural overview of the current

    implementation Kafka Streams StreamThread Elasticsearch BulkProcessor Batching Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer
  12. Issue 1: Scalability and efficiency StreamThread BulkProcessor Batching Consume Process

    Buffer Elasticsearch Kafka Kafka Streams Application Buffer Record polling and processing in a single thread! Record polling can be a bottleneck!
  13. Issue 1: Scalability and efficiency Visual VM Data. CPU Sampling

    result of a StreamThread while the pipeline is lagging This thread is spending ~75% of time waiting for new data Partially due to bad configurations, but anyways, not desirable w.r.t. resource efficiency
  14. Issue 1: Scalability and efficiency Kafka Kafka Streams Application We

    can have at most one stream thread for one Kafka partition Pipeline throughput is now dependent on the number of Kafka partitions StreamThread
  15. Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Batching

    Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer Kafka Streams can commit the log offset to broker at this point
  16. Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Batching

    Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer The log might not be sent to Elasticsearch yet! It can break at-least-once! Kafka Streams can commit the log offset to broker at this point
  17. Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Bactching

    Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer The log might not be sent to Elasticsearch yet! It can break at-least-once! Kafka Streams can commit the log offset to broker at this point We implemented the current pipeline for realtime log monitoring, and occasional lost of log was acceptable
  18. Issue 3: Operational cost Kafka Elasticsearch Flink Kafka Streams Flink

    Flink HDFS Internal Kafka Other pipeline components are using Flink Domain specific knowledge required!
  19. Goals of this project Improve Scalability and efficiency Provide Better

    delivery guarantee Unify Base framework for pipeline implementation
  20. How Flink helps achieving these goals? Improve Scalability and efficiency

    Provide Better delivery guarantee Unify Base framework for pipeline implementation Buffering and back pressure mechanism! Checkpoint mechanism! We already use Flink elsewhere! AsyncSink abstraction! Per-task parallelism configuration
  21. Buffering and back pressure mechanism Kafka Consumer Thread Consume Elasticsearch

    Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread Batch Buffer IO (consume) and processing is executed in separate threads Each thread has a buffer so that it doesn’t have to wait others
  22. Buffering and back pressure mechanism Kafka Consumer Thread Consume Elasticsearch

    Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread User code (Chain 3) Buffer Slow down!
  23. Buffering and back pressure mechanism Kafka Consumer Thread Consume Elasticsearch

    Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread User code (Chain 3) Buffer Slow down! Back pressure Slow down!
  24. Per-task parallelism configuration Kafka Consumer Thread Consume Elasticsearch Kafka Flink

    Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread User code (Chain 3) Buffer The number of threads (subtasks) can be configured for each task!
  25. Checkpoint mechanism Checkpoint • Flink periodically save the snapshot of

    the state to an external storage (i.e. checkpointing) • For example, you can configure Flink to create a checkpoint every 30 seconds • User code can control what to snapshot with Flink API Flink Task User code State External Storage When running normally Save state snapshot on checkpoint Kafka consumer offset, state for stateful computation, etc.
  26. • Flink can recover the task state from the external

    storage on recovery (e.g. from crash) • Using the saved state, Flink can resume from the last checkpoint • The recovery mechanism is also used for restarting the stream processing job Checkpoint mechanism Restore Flink Task User code State External Storage On recovery from crash Load saved state from the external storage
  27. AsyncSink abstraction Request AsyncSink AsyncSinkWriter ElementConverter State Buffer Storage Batch

    3. Create request batch 1. Serialize record 4. Call async batch ingestion API 2. Store in buffer
  28. AsyncSink abstraction Response AsyncSink AsyncSinkWriter ElementConverter State Buffer Storage Batch

    1. Await response 2. Update request buffer state
  29. Our custom Elasticsearch sink On top of Flink AsyncSink CustomAsyncSink

    CustomSinkWriter implements AsyncSinkWriter CustomElementConverter State Buffer Elasticsearch Batch The PoC version was only 450 LoC in Kotlin!
  30. Efficient at-least-once Elasticsearch sink Why we implemented a custom connector,

    instead of using the official connector? - In at-least-once mode, the official connector limits the number of simultaneous request to Elasticsearch to one per subtask Our AsyncSink based implementation persists request buffer in Flink state, and doesn’t have this limitation!
  31. Experiment Can the new Flink based pipeline process production-level workload?

    Elasticsearch Flink Kafka • Test cluster • Hot-warm architecture • 45 hot nodes, 21 warm nodes • Production cluster • Use one of topics used in production for test • 375k records/s, ~85MB/s • 64 partitions • Test cluster on Kubernetes • 8 workers, each with 8 CPU cores and 8GB RAM
  32. Result Yellow line: incoming rate Green line: outgoing rate

  33. We introduced our recent work for improving scalabiltiy, efficiency, delivery,

    and maintainability of our Kafka-to-Elasticsearch pipeline Summary In the near future, we’d like to roll out the new version to production The experiment showed the Flink based implementation can process production-level workload providing better delivery guarantee
  34. Auto- scaling Flink - Motivations - Scaling Flink - Automating

    Scaling @ Data Platform
  35. Motivations Optimize Resource Utilization Reduce Operation Cost Improve Latency

  36. Motivations Example Kafka-to-Kafka Flink job

  37. Manual Scaling Scaling Flink Methods available for standalone deployments Reactive

    Scaling
  38. Scaling Flink Manual Scaling App State (S3, HDFS, …) Flink

    cluster Flink cluster Flink cluster 1. Flink cluster that needs re- scaling 2. Stop job with savepoint 3. Start job from savepoint Update cluster & job settings
  39. Scaling Flink Manual Scaling App State (S3, HDFS, …) Flink

    cluster Flink cluster Flink cluster 1. Flink cluster that needs re- scaling 2. Stop job with savepoint 3. Start job from savepoint Update cluster & job settings Advantages - Available in all Flink versions Disadvantages - Slow to restart (~2 minutes)
  40. Scaling Flink Reactive Scaling min=1 max=10 cpu=80% on=TaskManager Deployment HorizontalPodAutoScaler

    JM TM Flink Cluster TM Monitor metrics If CPU > 80% threshold, start new TM Register TM & Offer Slot On new resources available, the Job Manager will restart the job from last checkpoint with a new parrallelism.
  41. Scaling Flink Reactive Scaling min=1 max=10 cpu=80% on=TaskManager Deployment HorizontalPodAutoScaler

    JM TM Flink Cluster TM Monitor metrics If CPU > 80% threshold, start new TM Register TM & Offer Slot Advantages - Leverage elastic infrastructures (Kubernetes Autoscalers, AWS ASG) - Faster restart Disadvantages - No partial failover
  42. Automating Scaling 1. Automate manual scaling via argoCD - Flow

    : - 1. Engineers raise a PR to change the scale of a Flink cluster - 2. argoCD pick up the changes and handles the manual scaling. - Benefits: - Standardize operations on Flink (Not only for scaling but also to release changes). - Reduce operation cost and avoid human errors. - Compliant with our audit rules. Engineer Raise PR Github Enterprise Poll for changes argoCD On Sync Deploy changes ns: flink flink cluster JM CM TM
  43. Automating Scaling 2. Introduce an auto-scaler argoCD On Sync Deploy

    changes ns: flink flink cluster JM CM TM Prometheus Worker register/unregister 4. Sync ArgoCD via REST API 3. Update scaling CM store job Info (periodically) get job list that ready to be scaled enqueue scaling task 2. evaluate decision 1. scrape metrics pickup task auto-scaler Manager Webhook Implementation for clusters not supporting reactive scaling
  44. Automating Scaling 2. Introduce an auto-scaler argoCD On Sync Deploy

    changes ns: flink flink cluster JM TM Prometheus Worker 3. Update deployment TM store job Info (periodically) get job list that ready to be scaled enqueue scaling task 2. evaluate decision 1. scrape metrics pickup task auto-scaler Manager Webhook register/unregister Implementation for clusters supporting reactive scaling
  45. Automating Scaling Essential steps of the auto-scaler - - -

    - Metrics : JVM_CPU_LOAD KAFKA_CONSUMER_LAG KAFKA_RECORDS_IN KAFKA_RECORDS_OUT Rules: LAG above 5 mins LAG increasing CPU load above 80% Safeguard rules: Scale < max scale Scale > min scale 2. Evaluate Scaling Rules 3. Estimate scale 4. Post evaluation check 1. Sample Monitoring metrics Predict: Using a linear regression model, estimate the appropriate scale
  46. Automating Scaling Auto-scaler in Action Workload oscillates between 50Mb/s ~

    300Mb/s Resource Utilization down 40%
  47. Automating Scaling Auto-scaler in Action Workload oscillates between 50Mb/s ~

    300Mb/s Resource Utilization down 40%
  48. Automating Scaling Auto-scaler in Action Workload oscillates between 0Mb/s ~

    400Mb/s Resource Utilization down 50%
  49. Automating Scaling Advantages & Disadvantages of the auto-scaler Advantages -

    Any Flink cluster can subscribe to the auto-scaler. - Easily configurable and extendable. - Ability to setup predictive rules. Disadvantages - Can require some tuning to get best scaling performances.
  50. Automating Scaling Example that required rule tuning. Linear regression based

    on too few data points can give wrong trend
  51. Summary - Our approach enabled auto-scaling on any Flink cluster

    : - 1. Automate Flink operations via CD pipeline - 2. Introduce an auto-scaler that also integrate with the CD pipeline • Future work: • Improve the prediction model • Integrate the auto-scaler with other technologies (e.g Spark Streaming)
  52. Thank You