Maps, Sets etc. to store state or do network transfers - Serialization happens when - transferring data over the network (between TaskManagers or from/to Sources/Sinks) - accessing state in RocksDB (even in-memory) - Sending data between non-chained tasks locally - Serialization costs a lot of CPU cycles
startLon, int startLat, int endLon, int endLat) {} DataStream< OptimizedLocation > s2 = ... 11 16 bytes → 7.5x reduction in data Fewer object allocations = less CPU cycles Disclaimer: The actual binary representation used by Kryo might differ, this is for demonstration purposes only 22 88 99 Further reading: “Flink Serialization Tuning Vol. 1: Choosing your Serializer — if you can” https://flink.apache.org/news/2020/04/15/flink-ser ialization-tuning-vol-1.html
Flink’s MiniCluster allows you to spin up a full-fledged Flink cluster with everything known from distributed clusters (Rocksdb, checkpointing, the web UI, SQL, …) var clusterConfig = new MiniClusterConfiguration.Builder() .setNumTaskManagers( 1) .setNumSlotsPerTaskManager( 1) .build(); var cluster = new MiniCluster(clusterConfig); cluster.start(); var clusterAddress = cluster.getRestAddress().get(); var env = new RemoteStreamEnvironment(clusterAddress.getHost(), clusterAddress.getPort());
Use-cases - Local debugging and performance profiling: Step through the code as it executes, sample most frequently used code paths - Testing: make sure your Flink jobs work in end to end tests (together with Kafka’s MiniCluster, minio as an S3 replacement). Check out https://www.testcontainers.org/ - Processing small streams efficiently
else. - Flink’s deployment options might seem confusing. Here’s a simple framework to think about it: - Flink has 3 execution modes - Session mode - Per-job mode - Application Mode (preferred) - Flink has 2 deployment models - Integrated (active): Native K8s, YARN, (Mesos) - Flink requests resources from the resource manager as needed - Standalone (passive): well suited for K8s, bare metal, local deployment, DIY - Resources are provided to Flink from the outside world #3 Advice: Deploy one job per cluster, use standalone mode
Jobs share a JobManager JobManager Job1 Application Mode One Job per JobManager, planned on the JobManager JobManager Job1 Per-Job Mode One Job per JobManager, planned outside the JobManager Recommended as default
of clusters for a given workload - Understand the amount of data you have incoming and outgoing - How much network bandwidth do you have? How much throughput does your Kafka have? - Understand the amount of state you’ll need in Flink - Which state backend do you use? - How much memory / disk space do you have (per instance, in your cluster) available? - How fast is your connection to your state backup (e.g. S3)? This will give you a baseline for the checkpointing times
napkin calculation of your use-case in your environment - … assuming normal operation (“baseline”). Include a buffer for spiky loads (failure recovery, …)
per machine: 40 bytes * 5 windows * 100,000,000 keys = 20 GB We checkpoint every minute, so: 20 GB / 60 seconds = 333 MB/s How is the Window operator accessing state on disk? For each key-value access, we need to retrieve 40 bytes from disk, update the aggregates and put 40 bytes back per machine: 40 bytes * 5 windows * 200,000 msg/sec = 40 MB/s
the napkin” approximation! Real world results will differ! - Ignored network factors - Protocol overheads (Ethernet, IP, TCP, …) - RPC (Flink‘s own RPC, Kafka, checkpoint store) - Checkpointing causes network bursts - A window emission causes bursts - Other systems using the network - CPU, memory, disk access speed have not been considered
http://app.decodable.co • Read the docs http://docs.decodable.co • Watch demos on our YouTube Channel • Join our community Slack channel • Join us for future Demo Days and Webinars!