Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Top 5 Mistakes Deploying Apache Flink

The Top 5 Mistakes Deploying Apache Flink

Learn about the 5 most common mistakes deploying Apache Flink, and how you can avoid them from Flink co-creator and PMC member Robert Metzger.

Robert Metzger

June 25, 2022

More Decks by Robert Metzger

Other Decks in Technology


  1. #1 Mistake: Serialization is expensive - Mistake: People use Java

    Maps, Sets etc. to store state or do network transfers - Serialization happens when - transferring data over the network (between TaskManagers or from/to Sources/Sinks) - accessing state in RocksDB (even in-memory) - Sending data between non-chained tasks locally - Serialization costs a lot of CPU cycles
  2. co.decodable.talks.flink.performance.Location #1 Mistake: Serialization is expensive Example: package co.decodable.talks.flink.performance; private

    static class Location { int lon; int lat; } DataStream<HashMap< String, Location>> s1 = ... 2 start co.decodable.talks.flink.performance.Location end ~120 bytes 11 22 88 99 Map size 4 bytes 1st entry key 5 bytes 1st entry value type 46 bytes 1st entry value fields 8 bytes 2nd entry key fields 3 bytes 2n entry value type 46 bytes 2nd entry value fields 8 bytes start end lon:11 lat:22 lon:88 lat:99
  3. #1 Mistake: Serialization is expensive Example: public record OptimizedLocation (int

    startLon, int startLat, int endLon, int endLat) {} DataStream< OptimizedLocation > s2 = ... 11 16 bytes → 7.5x reduction in data Fewer object allocations = less CPU cycles Disclaimer: The actual binary representation used by Kryo might differ, this is for demonstration purposes only 22 88 99 Further reading: “Flink Serialization Tuning Vol. 1: Choosing your Serializer — if you can” https://flink.apache.org/news/2020/04/15/flink-ser ialization-tuning-vol-1.html
  4. #2 Mistake: Flink doesn’t always need to be distributed -

    Flink’s MiniCluster allows you to spin up a full-fledged Flink cluster with everything known from distributed clusters (Rocksdb, checkpointing, the web UI, SQL, …) var clusterConfig = new MiniClusterConfiguration.Builder() .setNumTaskManagers( 1) .setNumSlotsPerTaskManager( 1) .build(); var cluster = new MiniCluster(clusterConfig); cluster.start(); var clusterAddress = cluster.getRestAddress().get(); var env = new RemoteStreamEnvironment(clusterAddress.getHost(), clusterAddress.getPort());
  5. #2 Mistake: Flink doesn’t always need to be distributed -

    Use-cases - Local debugging and performance profiling: Step through the code as it executes, sample most frequently used code paths - Testing: make sure your Flink jobs work in end to end tests (together with Kafka’s MiniCluster, minio as an S3 replacement). Check out https://www.testcontainers.org/ - Processing small streams efficiently
  6. … unless you have a good reason to do something

    else. - Flink’s deployment options might seem confusing. Here’s a simple framework to think about it: - Flink has 3 execution modes - Session mode - Per-job mode - Application Mode (preferred) - Flink has 2 deployment models - Integrated (active): Native K8s, YARN, (Mesos) - Flink requests resources from the resource manager as needed - Standalone (passive): well suited for K8s, bare metal, local deployment, DIY - Resources are provided to Flink from the outside world #3 Advice: Deploy one job per cluster, use standalone mode
  7. #3 Execution Modes JobManager Job1 Job2 Job3 Session Mode Multiple

    Jobs share a JobManager JobManager Job1 Application Mode One Job per JobManager, planned on the JobManager JobManager Job1 Per-Job Mode One Job per JobManager, planned outside the JobManager Recommended as default
  8. #3 Deployment Options Passive Deployment Flink resources managed externally (“Standalone

    mode”) → “a bunch of JVMs” Deployed on bare metal, Docker, Kubernetes Pros / Cons: + Reactive Mode (“autoscaling”) + DIY scenarios + Fast deployments - Restart Active Deployment Flink actively manages resources → Flink talks to a resource manager Implementations: Native Kubernetes, YARN Pros / cons: + Automatically restarts failed resources + Allocates only required resources - Requires a lot of K8s permissions
  9. #4 Mistake: Inappropriate Cluster sizing - Mistake: Under or over-provisioning

    of clusters for a given workload - Understand the amount of data you have incoming and outgoing - How much network bandwidth do you have? How much throughput does your Kafka have? - Understand the amount of state you’ll need in Flink - Which state backend do you use? - How much memory / disk space do you have (per instance, in your cluster) available? - How fast is your connection to your state backup (e.g. S3)? This will give you a baseline for the checkpointing times
  10. Solution: Proper cluster sizing - Do a back of the

    napkin calculation of your use-case in your environment - … assuming normal operation (“baseline”). Include a buffer for spiky loads (failure recovery, …)
  11. Example: Proper cluster sizing • Data: ◦ Message size: 2

    KB ◦ Throughput: 1,000,000 msg/sec ◦ Distinct keys: 500,000,000 (aggregation in window: 4 longs per key) ◦ Checkpoint every minute Kafka Source keyBy userId Sliding Window 5m size 1m slide Kafka Sink RocksDB • Hardware: ◦ 5 machines, each running a TaskManager
  12. Example: A machine’s perspective TaskManager n Kafka Source keyBy window

    Kafka Sink Kafka: 400 MB/s 2 KB * 1,000,000 = 2GB/s 2GB/s / 5 machines = 400 MB/s Shuffle: 320 MB/s 80 MB/s Shuffle: 320 MB/s 400MB/s / 5 receivers = 80MB/s 1 receiver is local, 4 remote: 4 * 80 = 320 MB/s out Kafka: 67 MB/s
  13. Excursion: State & Checkpointing How much state are we checkpointing?

    per machine: 40 bytes * 5 windows * 100,000,000 keys = 20 GB We checkpoint every minute, so: 20 GB / 60 seconds = 333 MB/s How is the Window operator accessing state on disk? For each key-value access, we need to retrieve 40 bytes from disk, update the aggregates and put 40 bytes back per machine: 40 bytes * 5 windows * 200,000 msg/sec = 40 MB/s
  14. Example: A machine’s perspective TaskManager n Kafka Source keyBy window

    Kafka Sink Kafka: 400 MB/s Shuffle: 320 MB/s 80 MB/s Shuffle: 320 MB/s Kafka: 67 MB/s Checkpoints: 333 MB/s Total In: 720 MB/s Total Out: 720 MB/s
  15. Cluster sizing: Conclusion - This was just a “back of

    the napkin” approximation! Real world results will differ! - Ignored network factors - Protocol overheads (Ethernet, IP, TCP, …) - RPC (Flink‘s own RPC, Kafka, checkpoint store) - Checkpointing causes network bursts - A window emission causes bursts - Other systems using the network - CPU, memory, disk access speed have not been considered
  16. #5 Advice: Ask for Help! - Most problems have been

    solved already online - Official, old-school way: [email protected] mailing list - Indexed by Google, searchable through https://lists.apache.org/ - Stack Overflow: the apache-flink tag has 6300 questions! - Apache Flink Slack instance - Global meetup communities, Flink Forward (w/ training)
  17. Get Started with Decodable • Visit http://decodable.co • Start Free

    http://app.decodable.co • Read the docs http://docs.decodable.co • Watch demos on our YouTube Channel • Join our community Slack channel • Join us for future Demo Days and Webinars!