The Top 5 Mistakes Deploying Apache Flink

Slide 1

Slide 1 text

The Top 5 Mistakes Deploying Apache Flink Webinar [email protected] @rmetzger_ [email protected] @esammer Robert Metzger Decodable Eric Sammer Decodable

Slide 2

Slide 2 text

Today’s Webinar The Top 5 Mistakes Deploying Apache Flink Common Stream Processing Patterns using SQL Q&A

Slide 3

Slide 3 text

Common Flink Mistakes Robert Metzger Staff Engineer @ decodable, Committer and PMC Chair @ Flink

Slide 4

Slide 4 text

#1 Mistake: Serialization is expensive - Mistake: People use Java Maps, Sets etc. to store state or do network transfers - Serialization happens when - transferring data over the network (between TaskManagers or from/to Sources/Sinks) - accessing state in RocksDB (even in-memory) - Sending data between non-chained tasks locally - Serialization costs a lot of CPU cycles

Slide 5

Slide 5 text

co.decodable.talks.flink.performance.Location #1 Mistake: Serialization is expensive Example: package co.decodable.talks.flink.performance; private static class Location { int lon; int lat; } DataStream> s1 = ... 2 start co.decodable.talks.flink.performance.Location end ~120 bytes 11 22 88 99 Map size 4 bytes 1st entry key 5 bytes 1st entry value type 46 bytes 1st entry value fields 8 bytes 2nd entry key fields 3 bytes 2n entry value type 46 bytes 2nd entry value fields 8 bytes start end lon:11 lat:22 lon:88 lat:99

Slide 6

Slide 6 text

#1 Mistake: Serialization is expensive Example: public record OptimizedLocation (int startLon, int startLat, int endLon, int endLat) {} DataStream< OptimizedLocation > s2 = ... 11 16 bytes → 7.5x reduction in data Fewer object allocations = less CPU cycles Disclaimer: The actual binary representation used by Kryo might differ, this is for demonstration purposes only 22 88 99 Further reading: “Flink Serialization Tuning Vol. 1: Choosing your Serializer — if you can” https://flink.apache.org/news/2020/04/15/flink-ser ialization-tuning-vol-1.html

Slide 7

Slide 7 text

#2 Mistake: Flink doesn’t always need to be distributed - Flink’s MiniCluster allows you to spin up a full-fledged Flink cluster with everything known from distributed clusters (Rocksdb, checkpointing, the web UI, SQL, …) var clusterConfig = new MiniClusterConfiguration.Builder() .setNumTaskManagers( 1) .setNumSlotsPerTaskManager( 1) .build(); var cluster = new MiniCluster(clusterConfig); cluster.start(); var clusterAddress = cluster.getRestAddress().get(); var env = new RemoteStreamEnvironment(clusterAddress.getHost(), clusterAddress.getPort());

Slide 8

Slide 8 text

#2 Mistake: Flink doesn’t always need to be distributed - Use-cases - Local debugging and performance profiling: Step through the code as it executes, sample most frequently used code paths - Testing: make sure your Flink jobs work in end to end tests (together with Kafka’s MiniCluster, minio as an S3 replacement). Check out https://www.testcontainers.org/ - Processing small streams efficiently

Slide 9

Slide 9 text

… unless you have a good reason to do something else. - Flink’s deployment options might seem confusing. Here’s a simple framework to think about it: - Flink has 3 execution modes - Session mode - Per-job mode - Application Mode (preferred) - Flink has 2 deployment models - Integrated (active): Native K8s, YARN, (Mesos) - Flink requests resources from the resource manager as needed - Standalone (passive): well suited for K8s, bare metal, local deployment, DIY - Resources are provided to Flink from the outside world #3 Advice: Deploy one job per cluster, use standalone mode

Slide 10

Slide 10 text

#3 Execution Modes JobManager Job1 Job2 Job3 Session Mode Multiple Jobs share a JobManager JobManager Job1 Application Mode One Job per JobManager, planned on the JobManager JobManager Job1 Per-Job Mode One Job per JobManager, planned outside the JobManager Recommended as default

Slide 11

Slide 11 text

#3 Deployment Options Passive Deployment Flink resources managed externally (“Standalone mode”) → “a bunch of JVMs” Deployed on bare metal, Docker, Kubernetes Pros / Cons: + Reactive Mode (“autoscaling”) + DIY scenarios + Fast deployments - Restart Active Deployment Flink actively manages resources → Flink talks to a resource manager Implementations: Native Kubernetes, YARN Pros / cons: + Automatically restarts failed resources + Allocates only required resources - Requires a lot of K8s permissions

Slide 12

Slide 12 text

#4 Mistake: Inappropriate Cluster sizing - Mistake: Under or over-provisioning of clusters for a given workload - Understand the amount of data you have incoming and outgoing - How much network bandwidth do you have? How much throughput does your Kafka have? - Understand the amount of state you’ll need in Flink - Which state backend do you use? - How much memory / disk space do you have (per instance, in your cluster) available? - How fast is your connection to your state backup (e.g. S3)? This will give you a baseline for the checkpointing times

Slide 13

Slide 13 text

Solution: Proper cluster sizing - Do a back of the napkin calculation of your use-case in your environment - … assuming normal operation (“baseline”). Include a buffer for spiky loads (failure recovery, …)

Slide 14

Slide 14 text

Example: Proper cluster sizing ● Data: ○ Message size: 2 KB ○ Throughput: 1,000,000 msg/sec ○ Distinct keys: 500,000,000 (aggregation in window: 4 longs per key) ○ Checkpoint every minute Kafka Source keyBy userId Sliding Window 5m size 1m slide Kafka Sink RocksDB ● Hardware: ○ 5 machines, each running a TaskManager

Slide 15

Slide 15 text

Example: A machine’s perspective TaskManager n Kafka Source keyBy window Kafka Sink Kafka: 400 MB/s 2 KB * 1,000,000 = 2GB/s 2GB/s / 5 machines = 400 MB/s Shuffle: 320 MB/s 80 MB/s Shuffle: 320 MB/s 400MB/s / 5 receivers = 80MB/s 1 receiver is local, 4 remote: 4 * 80 = 320 MB/s out Kafka: 67 MB/s

Slide 16

Slide 16 text

Excursion: State & Checkpointing How much state are we checkpointing? per machine: 40 bytes * 5 windows * 100,000,000 keys = 20 GB We checkpoint every minute, so: 20 GB / 60 seconds = 333 MB/s How is the Window operator accessing state on disk? For each key-value access, we need to retrieve 40 bytes from disk, update the aggregates and put 40 bytes back per machine: 40 bytes * 5 windows * 200,000 msg/sec = 40 MB/s

Slide 17

Slide 17 text

Example: A machine’s perspective TaskManager n Kafka Source keyBy window Kafka Sink Kafka: 400 MB/s Shuffle: 320 MB/s 80 MB/s Shuffle: 320 MB/s Kafka: 67 MB/s Checkpoints: 333 MB/s Total In: 720 MB/s Total Out: 720 MB/s

Slide 18

Slide 18 text

Cluster sizing: Conclusion - This was just a “back of the napkin” approximation! Real world results will differ! - Ignored network factors - Protocol overheads (Ethernet, IP, TCP, …) - RPC (Flink‘s own RPC, Kafka, checkpoint store) - Checkpointing causes network bursts - A window emission causes bursts - Other systems using the network - CPU, memory, disk access speed have not been considered

Slide 19

Slide 19 text

#5 Advice: Ask for Help! - Most problems have been solved already online - Official, old-school way: [email protected] mailing list - Indexed by Google, searchable through https://lists.apache.org/ - Stack Overflow: the apache-flink tag has 6300 questions! - Apache Flink Slack instance - Global meetup communities, Flink Forward (w/ training)

Slide 20

Slide 20 text

Any Flink deployment & ops related questions?

Slide 21

Slide 21 text

Get Started with Decodable ● Visit http://decodable.co ● Start Free http://app.decodable.co ● Read the docs http://docs.decodable.co ● Watch demos on our YouTube Channel ● Join our community Slack channel ● Join us for future Demo Days and Webinars!