Flink@Data Platform - Ingestion Pipeline Redesign and Auto-scaling

Agenda - First part • Brief introduction to Streaming Ingestion
Pipeline in LINE’s Data Platform • Kafka-to-Elasticsearch pipeline redesign with Apache Flink - Second part • Auto Scaling implementation on Kubernetes

Overview of Our Streaming Ingestion Pipeline

Data Ingestion Pipeline Kafka Elasticsearch HDFS Internal Kafka Yet Another
Kafka Kafka-to-Kafka job Kafka-to-Elasticsearch job Kafka-to-HDFS (Iceberg) job Kafka-to-HDFS (raw data) job Kafka-to-Kafka job

Scale topics Number of Kafka topics 2500+ records/s Peak total
throughput 19M+ partitions 384 Largest Kafka topic

Flink in the Pipeline Kafka Elasticsearch Flink Kafka Streams Flink
Flink HDFS Internal Kafka

Apache Flink Introduction Job graph (as Java archive) Source Process
Sink Sink Submit • Flink provides good abstraction for constructing a streaming processing job • Flink takes care of assigning each task of the processing job to workers (task managers) • Flink does a lot of heavy lifting for stream processing Flink cluster Runs on: • Standard servers • Kubernetes • YARN

Kafka-to-Elasticsearch Pipeline Kafka Elasticsearch Flink Kafka Streams Flink Flink HDFS
Internal Kafka In the first part

Redesigning Kafka-to-Elasticsearch (ES) Pipeline with Apache Flink

Issues of the Current Pipeline Issue 1 Scalability and efficiency
Issue 2 Delivery guarantee Issue 3 Operational cost

Issue 1: Scalability and efficiency Architectural overview of the current
implementation Kafka Streams StreamThread Elasticsearch BulkProcessor Batching Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer

Issue 1: Scalability and efficiency StreamThread BulkProcessor Batching Consume Process
Buffer Elasticsearch Kafka Kafka Streams Application Buffer Record polling and processing in a single thread! Record polling can be a bottleneck!

Issue 1: Scalability and efficiency Visual VM Data. CPU Sampling
result of a StreamThread while the pipeline is lagging This thread is spending ~75% of time waiting for new data Partially due to bad configurations, but anyways, not desirable w.r.t. resource efficiency

Issue 1: Scalability and efficiency Kafka Kafka Streams Application We
can have at most one stream thread for one Kafka partition Pipeline throughput is now dependent on the number of Kafka partitions StreamThread

Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Batching
Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer Kafka Streams can commit the log offset to broker at this point

Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Batching
Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer The log might not be sent to Elasticsearch yet! It can break at-least-once! Kafka Streams can commit the log offset to broker at this point

Issue 2: Delivery guarantee Kafka Streams StreamThread Elasticsearch BulkProcessor Bactching
Consume Process Buffer Elasticsearch Kafka Kafka Streams Application Buffer The log might not be sent to Elasticsearch yet! It can break at-least-once! Kafka Streams can commit the log offset to broker at this point We implemented the current pipeline for realtime log monitoring, and occasional lost of log was acceptable

Issue 3: Operational cost Kafka Elasticsearch Flink Kafka Streams Flink
Flink HDFS Internal Kafka Other pipeline components are using Flink Domain specific knowledge required!

Goals of this project Improve Scalability and efficiency Provide Better
delivery guarantee Unify Base framework for pipeline implementation

How Flink helps achieving these goals? Improve Scalability and efficiency
Provide Better delivery guarantee Unify Base framework for pipeline implementation Buffering and back pressure mechanism! Checkpoint mechanism! We already use Flink elsewhere! AsyncSink abstraction! Per-task parallelism configuration

Buffering and back pressure mechanism Kafka Consumer Thread Consume Elasticsearch
Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread Batch Buffer IO (consume) and processing is executed in separate threads Each thread has a buffer so that it doesn’t have to wait others

Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread User code (Chain 3) Buffer Slow down!

Kafka Flink Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread User code (Chain 3) Buffer Slow down! Back pressure Slow down!

Per-task parallelism configuration Kafka Consumer Thread Consume Elasticsearch Kafka Flink
Application Buffer Processing Thread User code Buffer Elasticsearch Sink Thread User code (Chain 3) Buffer The number of threads (subtasks) can be configured for each task!

Checkpoint mechanism Checkpoint • Flink periodically save the snapshot of
the state to an external storage (i.e. checkpointing) • For example, you can configure Flink to create a checkpoint every 30 seconds • User code can control what to snapshot with Flink API Flink Task User code State External Storage When running normally Save state snapshot on checkpoint Kafka consumer offset, state for stateful computation, etc.

• Flink can recover the task state from the external
storage on recovery (e.g. from crash) • Using the saved state, Flink can resume from the last checkpoint • The recovery mechanism is also used for restarting the stream processing job Checkpoint mechanism Restore Flink Task User code State External Storage On recovery from crash Load saved state from the external storage

AsyncSink abstraction Request AsyncSink AsyncSinkWriter ElementConverter State Buffer Storage Batch
3. Create request batch 1. Serialize record 4. Call async batch ingestion API 2. Store in buffer

AsyncSink abstraction Response AsyncSink AsyncSinkWriter ElementConverter State Buffer Storage Batch
1. Await response 2. Update request buffer state

Our custom Elasticsearch sink On top of Flink AsyncSink CustomAsyncSink
CustomSinkWriter implements AsyncSinkWriter CustomElementConverter State Buffer Elasticsearch Batch The PoC version was only 450 LoC in Kotlin!

Efficient at-least-once Elasticsearch sink Why we implemented a custom connector,
instead of using the official connector? - In at-least-once mode, the official connector limits the number of simultaneous request to Elasticsearch to one per subtask Our AsyncSink based implementation persists request buffer in Flink state, and doesn’t have this limitation!

Experiment Can the new Flink based pipeline process production-level workload?
Elasticsearch Flink Kafka • Test cluster • Hot-warm architecture • 45 hot nodes, 21 warm nodes • Production cluster • Use one of topics used in production for test • 375k records/s, ~85MB/s • 64 partitions • Test cluster on Kubernetes • 8 workers, each with 8 CPU cores and 8GB RAM

Result Yellow line: incoming rate Green line: outgoing rate

We introduced our recent work for improving scalabiltiy, efficiency, delivery,
and maintainability of our Kafka-to-Elasticsearch pipeline Summary In the near future, we’d like to roll out the new version to production The experiment showed the Flink based implementation can process production-level workload providing better delivery guarantee

Auto- scaling Flink - Motivations - Scaling Flink - Automating
Scaling @ Data Platform

Motivations Optimize Resource Utilization Reduce Operation Cost Improve Latency

Motivations Example Kafka-to-Kafka Flink job

Manual Scaling Scaling Flink Methods available for standalone deployments Reactive
Scaling

Scaling Flink Manual Scaling App State (S3, HDFS, …) Flink
cluster Flink cluster Flink cluster 1. Flink cluster that needs re- scaling 2. Stop job with savepoint 3. Start job from savepoint Update cluster & job settings

Scaling Flink Manual Scaling App State (S3, HDFS, …) Flink
cluster Flink cluster Flink cluster 1. Flink cluster that needs re- scaling 2. Stop job with savepoint 3. Start job from savepoint Update cluster & job settings Advantages - Available in all Flink versions Disadvantages - Slow to restart (~2 minutes)

Scaling Flink Reactive Scaling min=1 max=10 cpu=80% on=TaskManager Deployment HorizontalPodAutoScaler
JM TM Flink Cluster TM Monitor metrics If CPU > 80% threshold, start new TM Register TM & Offer Slot On new resources available, the Job Manager will restart the job from last checkpoint with a new parrallelism.

Scaling Flink Reactive Scaling min=1 max=10 cpu=80% on=TaskManager Deployment HorizontalPodAutoScaler
JM TM Flink Cluster TM Monitor metrics If CPU > 80% threshold, start new TM Register TM & Offer Slot Advantages - Leverage elastic infrastructures (Kubernetes Autoscalers, AWS ASG) - Faster restart Disadvantages - No partial failover

Automating Scaling 1. Automate manual scaling via argoCD - Flow
: - 1. Engineers raise a PR to change the scale of a Flink cluster - 2. argoCD pick up the changes and handles the manual scaling. - Benefits: - Standardize operations on Flink (Not only for scaling but also to release changes). - Reduce operation cost and avoid human errors. - Compliant with our audit rules. Engineer Raise PR Github Enterprise Poll for changes argoCD On Sync Deploy changes ns: flink flink cluster JM CM TM

Automating Scaling 2. Introduce an auto-scaler argoCD On Sync Deploy
changes ns: flink flink cluster JM CM TM Prometheus Worker register/unregister 4. Sync ArgoCD via REST API 3. Update scaling CM store job Info (periodically) get job list that ready to be scaled enqueue scaling task 2. evaluate decision 1. scrape metrics pickup task auto-scaler Manager Webhook Implementation for clusters not supporting reactive scaling

Automating Scaling 2. Introduce an auto-scaler argoCD On Sync Deploy
changes ns: flink flink cluster JM TM Prometheus Worker 3. Update deployment TM store job Info (periodically) get job list that ready to be scaled enqueue scaling task 2. evaluate decision 1. scrape metrics pickup task auto-scaler Manager Webhook register/unregister Implementation for clusters supporting reactive scaling

Automating Scaling Essential steps of the auto-scaler - - -
- Metrics : JVM_CPU_LOAD KAFKA_CONSUMER_LAG KAFKA_RECORDS_IN KAFKA_RECORDS_OUT Rules: LAG above 5 mins LAG increasing CPU load above 80% Safeguard rules: Scale < max scale Scale > min scale 2. Evaluate Scaling Rules 3. Estimate scale 4. Post evaluation check 1. Sample Monitoring metrics Predict: Using a linear regression model, estimate the appropriate scale

Automating Scaling Auto-scaler in Action Workload oscillates between 50Mb/s ~
300Mb/s Resource Utilization down 40%

Automating Scaling Auto-scaler in Action Workload oscillates between 0Mb/s ~
400Mb/s Resource Utilization down 50%

Automating Scaling Advantages & Disadvantages of the auto-scaler Advantages -
Any Flink cluster can subscribe to the auto-scaler. - Easily configurable and extendable. - Ability to setup predictive rules. Disadvantages - Can require some tuning to get best scaling performances.

Automating Scaling Example that required rule tuning. Linear regression based
on too few data points can give wrong trend

Summary - Our approach enabled auto-scaling on any Flink cluster
: - 1. Automate Flink operations via CD pipeline - 2. Introduce an auto-scaler that also integrate with the CD pipeline • Future work: • Improve the prediction model • Integrate the auto-scaler with other technologies (e.g Spark Streaming)

Thank You

Flink@Data Platform - Ingestion Pipeline Redesi...

Flink@Data Platform - Ingestion Pipeline Redesign and Auto-scaling

More Decks by Tech-Verse2022

Other Decks in Technology

Featured

Transcript