The top 3 challenges running multi-tenant Flink at scale

Slide 1

Slide 1 text

The top 3 challenges running multi-tenant Flink at scale Sharon Xie ([email protected]) Founding Engineer at Decodable

Slide 2

Slide 2 text

Flink at Decodable ● Runs all the data processing jobs ● Mixed deployment modes for different use cases

Slide 3

Slide 3 text

Challenge 1: Abstract Flink away ✅Connect to the data sources and destinations ✅Declare data processing logic 🚫Distributed checkpointing systems 🚫State migrations 🚫Job/Task configurations

Slide 4

Slide 4 text

Our Model

Slide 5

Slide 5 text

Our Model

Slide 6

Slide 6 text

Our Model Connections ● Handle connectivity with external systems ● Connector specific configurations ● Let records with valid schema to pass through ● Small Flink states Pipelines ● Handle data processing ● Same sources and sinks ● Records are already in valid forms ● Flink states can vary a lot based on the query

Slide 7

Slide 7 text

Separate Connectivity VS Processing ● Clear ownership, responsibility and access control ● Reusability ● Well defined contract for workflow automation ● Optimize & scale separately

Slide 8

Slide 8 text

Challenge 2: Managing Isolation ● Reducing the cost with resource sharing VS Increasing the risk of ○ Noisy neighbors ○ Blast radius ○ Security

Slide 9

Slide 9 text

Three layers ● Infra (CPU, Disk, Memory, Network) ● Data processing (Flink) ● Configuration storage

Slide 10

Slide 10 text

Infra cost-effective deployment VPC K8S Cluster DB Cluster Cell: 1 Cell: 1 decodable-cell-1-control decodable-cell-1-data Cell: 2 Cell: 2 EKS: cluster + nodes + ... Subnets, SGs, ... RDS Aurora cluster + R/W instances n K8s Namespaces: decodable-cell-2-control decodable-cell-2-data K8s Namespaces: Kafka Cluster MSK (AWS Managed Kafka) topics topics

Slide 11

Slide 11 text

Infra max isolation deployment VPC DB Cluster Kafka Cluster K8S Cluster Cell

Slide 12

Slide 12 text

Job Isolation - Flink ● Mini-clusters for preview jobs ● Application mode clusters for connections and pipelines K8S Deployment Application mode cluster JobManager TaskManager TaskManager TaskManager K8S Job Preview cluster pool Mini-cluster 1 Mini-cluster 2 Mini-cluster N

Slide 13

Slide 13 text

Isolate sensitive info API Service Secrets Manager Create connection Store Secrets Activate connection Retrieve Secrets Flink Cluster Launch K8S Secret Control Plane Data Plane IRSA* IRSA*: AWS IAM Roles for Service Accounts

Slide 14

Slide 14 text

Isolate sensitive info ● Principle of least privilege ● Minimize the # of services with access to the sensitive info ● Audit everything

Slide 15

Slide 15 text

Challenge 3: Observability Internally ● Things should just work ● We commit to SLAs with customers ● We evaluate ourselves against our SLOs Users ● Know if their jobs are healthy ● Understand performance metrics ● Get notified with actionable messages if something goes wrong

Slide 16

Slide 16 text

Internal key metrics Primary signals: ● Each running job has a healthy Flink deployment ● # successful checkpoints per job is increasing Secondary signals: ● Error rate / # job restarts ● Consumer lag

Slide 17

Slide 17 text

User facing Error classifier: ● Unexpected server error (Page) ● Temporary server error (Monitor) ● User configuration error (Notify)

Slide 18

Slide 18 text

Parting thoughts The paradox of choice ● Massive Flink configurations ● Different APIs, deployment modes Isolation is very important (and hard). For us, scalability is really about increase the # of users who can do real-time data processing.

Slide 19

Slide 19 text

2022 Build real-time data apps & services. Fast. decodable.co