Flink@Data Platform - Ingestion Pipeline Redesign and Auto-scaling

by Tech-Verse2022

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Agenda - First part • Brief introduction to Streaming Ingestion Pipeline in LINE’s Data Platform • Kafka-to-Elasticsearch pipeline redesign with Apache Flink - Second part • Auto Scaling implementation on Kubernetes

Slide 3

Slide 3 text

Overview of Our Streaming Ingestion Pipeline

Slide 4

Slide 4 text

Data Ingestion Pipeline Kafka Elasticsearch HDFS Internal Kafka Yet Another Kafka Kafka-to-Kafka job Kafka-to-Elasticsearch job Kafka-to-HDFS (Iceberg) job Kafka-to-HDFS (raw data) job Kafka-to-Kafka job

Slide 5

Slide 5 text

Scale topics Number of Kafka topics 2500+ records/s Peak total throughput 19M+ partitions 384 Largest Kafka topic

Slide 6

Slide 6 text

Flink in the Pipeline Kafka Elasticsearch Flink Kafka Streams Flink Flink HDFS Internal Kafka

Slide 7

Slide 7 text

Apache Flink Introduction Job graph (as Java archive) Source Process Sink Sink Submit • Flink provides good abstraction for constructing a streaming processing job • Flink takes care of assigning each task of the processing job to workers (task managers) • Flink does a lot of heavy lifting for stream processing Flink cluster Runs on: • Standard servers • Kubernetes • YARN

Slide 8

Slide 8 text

Kafka-to-Elasticsearch Pipeline Kafka Elasticsearch Flink Kafka Streams Flink Flink HDFS Internal Kafka In the first part

Slide 9

Slide 9 text

Redesigning Kafka-to-Elasticsearch (ES) Pipeline with Apache Flink

Slide 10

Slide 10 text

Issues of the Current Pipeline Issue 1 Scalability and efficiency Issue 2 Delivery guarantee Issue 3 Operational cost

Slide 11

Slide 11 text

Issue 1: Scalability and efficiency Architectural overview of the current implementation Kafka Streams StreamThread Elasticsearch BulkProcessor Batching Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer

Slide 12

Slide 12 text

Issue 1: Scalability and efficiency StreamThread BulkProcessor Batching Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer Record polling and processing in a single thread! Record polling can be a bottleneck!

Slide 13

Slide 13 text

Issue 1: Scalability and efficiency Visual VM Data. CPU Sampling result of a StreamThread while the pipeline is lagging This thread is spending ~75% of time waiting for new data Partially due to bad configurations, but anyways, not desirable w.r.t. resource efficiency

Slide 14

Slide 14 text

Issue 1: Scalability and efficiency Kafka Kafka Streams Application We can have at most one stream thread for one Kafka partition Pipeline throughput is now dependent on the number of Kafka partitions StreamThread

Slide 15

Slide 15 text

Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Batching Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer Kafka Streams can commit the log offset to broker at this point

Slide 16

Slide 16 text

Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Batching Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer The log might not be sent to Elasticsearch yet! It can break at-least-once! Kafka Streams can commit the log offset to broker at this point

Slide 17

Slide 17 text

Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Bactching Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer The log might not be sent to Elasticsearch yet! It can break at-least-once! Kafka Streams can commit the log offset to broker at this point We implemented the current pipeline for realtime log monitoring, and occasional lost of log was acceptable

Slide 18

Slide 18 text

Issue 3: Operational cost Kafka Elasticsearch Flink Kafka Streams Flink Flink HDFS Internal Kafka Other pipeline components are using Flink Domain specific knowledge required!

Slide 19

Slide 19 text

Goals of this project Improve Scalability and efficiency Provide Better delivery guarantee Unify Base framework for pipeline implementation

Slide 20

Slide 20 text

How Flink helps achieving these goals? Improve Scalability and efficiency Provide Better delivery guarantee Unify Base framework for pipeline implementation Buffering and back pressure mechanism! Checkpoint mechanism! We already use Flink elsewhere! AsyncSink abstraction! Per-task parallelism configuration

Slide 21

Slide 21 text

Buffering and back pressure mechanism Kafka Consumer Thread Consume Elasticsearch Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread Batch Buffer IO (consume) and processing is executed in separate threads Each thread has a buffer so that it doesn’t have to wait others

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Buffering and back pressure mechanism Kafka Consumer Thread Consume Elasticsearch Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread User code (Chain 3) Buffer Slow down! Back pressure Slow down!

Slide 24

Slide 24 text

Per-task parallelism configuration Kafka Consumer Thread Consume Elasticsearch Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread User code (Chain 3) Buffer The number of threads (subtasks) can be configured for each task!

Slide 25

Slide 25 text

Checkpoint mechanism Checkpoint • Flink periodically save the snapshot of the state to an external storage (i.e. checkpointing) • For example, you can configure Flink to create a checkpoint every 30 seconds • User code can control what to snapshot with Flink API Flink Task User code State External Storage When running normally Save state snapshot on checkpoint Kafka consumer offset, state for stateful computation, etc.

Slide 26

Slide 26 text

• Flink can recover the task state from the external storage on recovery (e.g. from crash) • Using the saved state, Flink can resume from the last checkpoint • The recovery mechanism is also used for restarting the stream processing job Checkpoint mechanism Restore Flink Task User code State External Storage On recovery from crash Load saved state from the external storage

Slide 27

Slide 27 text

AsyncSink abstraction Request AsyncSink AsyncSinkWriter ElementConverter State Buffer Storage Batch 3. Create request batch 1. Serialize record 4. Call async batch ingestion API 2. Store in buffer

Slide 28

Slide 28 text

AsyncSink abstraction Response AsyncSink AsyncSinkWriter ElementConverter State Buffer Storage Batch 1. Await response 2. Update request buffer state

Slide 29

Slide 29 text

Our custom Elasticsearch sink On top of Flink AsyncSink CustomAsyncSink CustomSinkWriter implements AsyncSinkWriter CustomElementConverter State Buffer Elasticsearch Batch The PoC version was only 450 LoC in Kotlin!

Slide 30

Slide 30 text

Efficient at-least-once Elasticsearch sink Why we implemented a custom connector, instead of using the official connector? - In at-least-once mode, the official connector limits the number of simultaneous request to Elasticsearch to one per subtask Our AsyncSink based implementation persists request buffer in Flink state, and doesn’t have this limitation!

Slide 31

Slide 31 text

Experiment Can the new Flink based pipeline process production-level workload? Elasticsearch Flink Kafka • Test cluster • Hot-warm architecture • 45 hot nodes, 21 warm nodes • Production cluster • Use one of topics used in production for test • 375k records/s, ~85MB/s • 64 partitions • Test cluster on Kubernetes • 8 workers, each with 8 CPU cores and 8GB RAM

Slide 32

Slide 32 text

Result Yellow line: incoming rate Green line: outgoing rate

Slide 33

Slide 33 text

We introduced our recent work for improving scalabiltiy, efficiency, delivery, and maintainability of our Kafka-to-Elasticsearch pipeline Summary In the near future, we’d like to roll out the new version to production The experiment showed the Flink based implementation can process production-level workload providing better delivery guarantee

Slide 34

Slide 34 text

Auto- scaling Flink - Motivations - Scaling Flink - Automating Scaling @ Data Platform

Slide 35

Slide 35 text

Motivations Optimize Resource Utilization Reduce Operation Cost Improve Latency

Slide 36

Slide 36 text

Motivations Example Kafka-to-Kafka Flink job

Slide 37

Slide 37 text

Manual Scaling Scaling Flink Methods available for standalone deployments Reactive Scaling

Slide 38

Slide 38 text

Scaling Flink Manual Scaling App State (S3, HDFS, …) Flink cluster Flink cluster Flink cluster 1. Flink cluster that needs re- scaling 2. Stop job with savepoint 3. Start job from savepoint Update cluster & job settings

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Scaling Flink Reactive Scaling min=1 max=10 cpu=80% on=TaskManager Deployment HorizontalPodAutoScaler JM TM Flink Cluster TM Monitor metrics If CPU > 80% threshold, start new TM Register TM & Offer Slot On new resources available, the Job Manager will restart the job from last checkpoint with a new parrallelism.

Slide 41

Slide 41 text

Scaling Flink Reactive Scaling min=1 max=10 cpu=80% on=TaskManager Deployment HorizontalPodAutoScaler JM TM Flink Cluster TM Monitor metrics If CPU > 80% threshold, start new TM Register TM & Offer Slot Advantages - Leverage elastic infrastructures (Kubernetes Autoscalers, AWS ASG) - Faster restart Disadvantages - No partial failover

Slide 42

Slide 42 text

Automating Scaling 1. Automate manual scaling via argoCD - Flow : - 1. Engineers raise a PR to change the scale of a Flink cluster - 2. argoCD pick up the changes and handles the manual scaling. - Benefits: - Standardize operations on Flink (Not only for scaling but also to release changes). - Reduce operation cost and avoid human errors. - Compliant with our audit rules. Engineer Raise PR Github Enterprise Poll for changes argoCD On Sync Deploy changes ns: flink flink cluster JM CM TM

Slide 43

Slide 43 text

Automating Scaling 2. Introduce an auto-scaler argoCD On Sync Deploy changes ns: flink flink cluster JM CM TM Prometheus Worker register/unregister 4. Sync ArgoCD via REST API 3. Update scaling CM store job Info (periodically) get job list that ready to be scaled enqueue scaling task 2. evaluate decision 1. scrape metrics pickup task auto-scaler Manager Webhook Implementation for clusters not supporting reactive scaling

Slide 44

Slide 44 text

Automating Scaling 2. Introduce an auto-scaler argoCD On Sync Deploy changes ns: flink flink cluster JM TM Prometheus Worker 3. Update deployment TM store job Info (periodically) get job list that ready to be scaled enqueue scaling task 2. evaluate decision 1. scrape metrics pickup task auto-scaler Manager Webhook register/unregister Implementation for clusters supporting reactive scaling

Slide 45

Slide 45 text

Automating Scaling Essential steps of the auto-scaler - - - - Metrics : JVM_CPU_LOAD KAFKA_CONSUMER_LAG KAFKA_RECORDS_IN KAFKA_RECORDS_OUT Rules: LAG above 5 mins LAG increasing CPU load above 80% Safeguard rules: Scale < max scale Scale > min scale 2. Evaluate Scaling Rules 3. Estimate scale 4. Post evaluation check 1. Sample Monitoring metrics Predict: Using a linear regression model, estimate the appropriate scale

Slide 46

Slide 46 text

Automating Scaling Auto-scaler in Action Workload oscillates between 50Mb/s ~ 300Mb/s Resource Utilization down 40%

Slide 47

Slide 47 text

Automating Scaling Auto-scaler in Action Workload oscillates between 50Mb/s ~ 300Mb/s Resource Utilization down 40%

Slide 48

Slide 48 text

Automating Scaling Auto-scaler in Action Workload oscillates between 0Mb/s ~ 400Mb/s Resource Utilization down 50%

Slide 49

Slide 49 text

Automating Scaling Advantages & Disadvantages of the auto-scaler Advantages - Any Flink cluster can subscribe to the auto-scaler. - Easily configurable and extendable. - Ability to setup predictive rules. Disadvantages - Can require some tuning to get best scaling performances.

Slide 50

Slide 50 text

Automating Scaling Example that required rule tuning. Linear regression based on too few data points can give wrong trend

Slide 51

Slide 51 text

Summary - Our approach enabled auto-scaling on any Flink cluster : - 1. Automate Flink operations via CD pipeline - 2. Introduce an auto-scaler that also integrate with the CD pipeline • Future work: • Improve the prediction model • Integrate the auto-scaler with other technologies (e.g Spark Streaming)

Slide 52

Slide 52 text

Thank You