The top 3 challenges running multi-tenant Flink at scale

The top 3 challenges running multi-tenant Flink at scale Sharon
Xie ([email protected]) Founding Engineer at Decodable

Flink at Decodable • Runs all the data processing jobs
• Mixed deployment modes for different use cases

Challenge 1: Abstract Flink away ✅Connect to the data sources
and destinations ✅Declare data processing logic 🚫Distributed checkpointing systems 🚫State migrations 🚫Job/Task configurations

Our Model

Our Model Connections • Handle connectivity with external systems •
Connector specific configurations • Let records with valid schema to pass through • Small Flink states Pipelines • Handle data processing • Same sources and sinks • Records are already in valid forms • Flink states can vary a lot based on the query

Separate Connectivity VS Processing • Clear ownership, responsibility and access
control • Reusability • Well defined contract for workflow automation • Optimize & scale separately

Challenge 2: Managing Isolation • Reducing the cost with resource
sharing VS Increasing the risk of ◦ Noisy neighbors ◦ Blast radius ◦ Security

Three layers • Infra (CPU, Disk, Memory, Network) • Data
processing (Flink) • Configuration storage

Infra cost-effective deployment VPC K8S Cluster DB Cluster Cell: 1
Cell: 1 decodable-cell-1-control decodable-cell-1-data Cell: 2 Cell: 2 EKS: cluster + nodes + ... Subnets, SGs, ... RDS Aurora cluster + R/W instances n K8s Namespaces: decodable-cell-2-control decodable-cell-2-data K8s Namespaces: Kafka Cluster MSK (AWS Managed Kafka) topics topics

Infra max isolation deployment VPC DB Cluster Kafka Cluster K8S
Cluster Cell

Job Isolation - Flink • Mini-clusters for preview jobs •
Application mode clusters for connections and pipelines K8S Deployment Application mode cluster JobManager TaskManager TaskManager TaskManager K8S Job Preview cluster pool Mini-cluster 1 Mini-cluster 2 Mini-cluster N

Isolate sensitive info API Service Secrets Manager Create connection Store
Secrets Activate connection Retrieve Secrets Flink Cluster Launch K8S Secret Control Plane Data Plane IRSA* IRSA*: AWS IAM Roles for Service Accounts

Isolate sensitive info • Principle of least privilege • Minimize
the # of services with access to the sensitive info • Audit everything

Challenge 3: Observability Internally • Things should just work •
We commit to SLAs with customers • We evaluate ourselves against our SLOs Users • Know if their jobs are healthy • Understand performance metrics • Get notified with actionable messages if something goes wrong

Internal key metrics Primary signals: • Each running job has
a healthy Flink deployment • # successful checkpoints per job is increasing Secondary signals: • Error rate / # job restarts • Consumer lag

User facing Error classifier: • Unexpected server error (Page) •
Temporary server error (Monitor) • User configuration error (Notify)

Parting thoughts The paradox of choice • Massive Flink configurations
• Different APIs, deployment modes Isolation is very important (and hard). For us, scalability is really about increase the # of users who can do real-time data processing.

2022 Build real-time data apps & services. Fast. decodable.co

The top 3 challenges running multi-tenant Flink...

The top 3 challenges running multi-tenant Flink at scale

Sharon Xie

More Decks by Sharon Xie

Featured

Transcript