Tiny Flink: Minimizing the Memory Footprint of Apache Flink

Apache Flink has been designed for, and is mostly used with large-scale real-time data processing use-cases. Companies report about TBs of data being processed per second, or TBs of state in huge clusters.

But what if you need to process low-throughput streams? Running a full, distributed Flink cluster might be an overkill, as there’s quite a bit of overhead for distributed coordination.

In this talk, we’ll explore options to reduce your resource footprint. We’ll dive deeper into Flink’s MiniCluster, allowing you to run Flink in-JVM for integration tests, as a micro service or just a small processor for your data in Kubernetes. We will also discuss lessons learned from running MiniCluster in production for a service offering Flink SQL in the cloud.

Robert Metzger

June 20, 2023

  2. What is Apache Flink? Stateful Computations over Data Streams •

    Highly Scalable • Exactly-once processing semantics • Event time semantics and watermarks • Layered APIs: Streaming SQL (easy to use) ↔ DataStream (expressive)
  3. Motivation 1: Minimum Flink cluster size (with the K8s operator

    and Flink config defaults). 650mb + 1024mb = 1.67gb minimum memory of a Flink cluster Motivation 2: Deploy like other JVM-based microservices or apps. No need to use a cluster manager. → unified deployment, monitoring and operations for all services → Not all use-cases process millions of events per second Motivation 3: Local Development & Testing. Flink in your IDE, integration tests and on your CI system. No datacenter required for local development. Why Minimizing Flink?
  4. Kubernetes Docker Do it yourself (Cluster) Operator Native Kubernetes Flink

    manages resources on K8s Standalone Kubernetes Resources are fixed/managed externally Deployment Options for Apache Flink MiniCluster
  5. Enter Flink MiniCluster • Similar utilities ◦ KafkaServer: launch a

    Kafka broker in your JVM ◦ MiniDFSCluster: Launch Hadoop Distributed File System in your JVM public static void main(String[] args) throws Exception { MiniClusterConfiguration clusterConfig = new MiniClusterConfiguration.Builder() .setNumTaskManagers(1) .setNumSlotsPerTaskManager(1) .build(); try (var cluster = new MiniCluster(clusterConfig)) { cluster.start(); cluster.submitJob(/*TODO*/); } }
  6. Flink MiniCluster: What’s the Size of my JVM? → JVM

    doesn’t start with -Xmx60m, but it does with -Xmx65m. Exception in thread "main" java.lang.OutOfMemoryError: Could not allocate enough memory segments for NetworkBufferPool (required (MB): 64, allocated (MB): 61, missing (MB): 3). Cause: Direct buffer memory. The direct out-of-memory error has occurred. This can mean two things: either job(s) require(s) a larger size of JVM direct memory or there is a direct memory leak. The direct memory can be allocated by user code or some of its dependencies. In this case 'taskmanager.memory.task.off-heap.size' configuration option should be increased. Flink framework and its dependencies also consume the direct memory, mostly for network communication. The most of network memory is managed by Flink and should not result in out-of-memory error. In certain special cases, in particular for jobs with high parallelism, the framework may require more direct memory which is not managed by Flink. In this case 'taskmanager.memory.framework.off-heap.size' configuration option should be increased. If the error persists then there is probably a direct memory leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown... at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.<init>(NetworkBufferPool.java:149) at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:173) at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:128) at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:97) at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createShuffleEnvironment(NettyShuffleServiceFactory.java:78) at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createShuffleEnvironment(NettyShuffleServiceFactory.java:57) at org.apache.flink.runtime.taskexecutor.TaskManagerServices.createShuffleEnvironment(TaskManagerServices.java:446) at org.apache.flink.runtime.taskexecutor.TaskManagerServices.fromConfiguration(TaskManagerServices.java:304) at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:631) at org.apache.flink.runtime.minicluster.MiniCluster.startTaskManager(MiniCluster.java:755) at org.apache.flink.runtime.minicluster.MiniCluster.startTaskManagers(MiniCluster.java:736) at org.apache.flink.runtime.minicluster.MiniCluster.start(MiniCluster.java:457) at co.decodable.PassthroughKafka.initMiniClusterWithEnv(PassthroughKafka.java:62) at co.decodable.PassthroughKafka.main(PassthroughKafka.java:44)
  7. Flink MiniCluster: Going even smaller → Optimizing configuration: Reduce network

    buffer size from 64mb to 8mb allows setting -Xmx20m. public static void main(String[] args) throws Exception { var flinkConfig = new Configuration(); flinkConfig.set(TaskManagerOptions.NETWORK_MEMORY_MIN, MemorySize.parse("8m")); flinkConfig.set(TaskManagerOptions.NETWORK_MEMORY_MAX, MemorySize.parse("8m")); MiniClusterConfiguration clusterConfig = new MiniClusterConfiguration .Builder() .setNumTaskManagers( 1) .setNumSlotsPerTaskManager( 1) .setConfiguration(flinkConfig) .build();
  8. MiniCluster: What have we learned so far? → 20mb heap

    space are sufficient to run an empty Flink cluster. Final process size is ~80mb. Open items: • Heap size, process size with a job running • Throughput • Heap and process size discrepancy
  9. Flink MiniCluster: Making it real → Run a small Flink

    job reading from Kafka, filtering 1% of the data, writing to Kafka Empty MiniCluster MiniCluster with Kafka source and sink Configured Heap -Xmx20m -Xmx35m Approx Process Size 80mb 190mb ⚠ Caveats • Checkpointing is not enabled • We are using a Flink job with minimal state, no RocksDB
  10. Benchmarking the Throughput: Setup Hardware: 10-core Apple M1 Max, 32GB

    memory, 1TB SSD Versions: Flink 1.17.0, Kafka 3.2.3
  11. Benchmarking the Throughput: Results Heap limit Throughput Real memory 35

    mb Out of Memory under light load (3mb/s) 50 mb 25 mb/s 200 mb 100 mb 97 mb/s 260 mb
  12. Heap vs Native Memory Heap limit Throughput Real memory 35

    mb Out of Memory under light load (3mb/s) 50 mb 25 mb/s 200 mb 100 mb 97 mb/s 260 mb Enabling JVM native memory tracking and getting a report [1]: • JVM argument: -XX:NativeMemoryTracking=summary • jcmd <pid> VM.native_memory baseline • jcmd <pid> VM.native_memory summary.diff [1] https://docs.oracle.com/en/java/javase/17/vm/native-memory-tracking.html
  13. Heap vs Native Nemory • Reduce thread stack size from

    default 1mb to 256kb: -Xss256k ◦ Higher risk of stack overflow exceptions ◦ We have ~100 threads • GC is using 56mb. Using the more lightweight Serial GC: -XX:+UseSerialGC reduces to 390 bytes (GC structures and fewer threads) ◦ This reduces the throughput of Flink Further reading: https://shipilev.net/jvm/anatomy-quarks/12-native-memory-tracking/ Heap (106mb) Thread (27mb) Metaspace (56mb)
  14. • 🔬 We can scale Flink down to a process

    size of ~250mb. • 🚀 One MiniCluster is still able to process ~100mb/s • MiniCluster runs the same code as a distributed cluster ◦ Migrate from MiniCluster to distributed cluster (restoring latest checkpoint or from a savepoint) ◦ Supports HA (e.g. you can kill the process, it will continue where it left off) ◦ Supports metrics and logging integrations ◦ Supports the Flink Web UI (and REST API) • All code examples: https://github.com/rmetzger/tiny-flink-talk Conclusions Switch icons created by Gregor Cresnar - Flaticon
  16. 100 Threads ?! Heavy hitters: • Flink rest server worker

    (20) • Akka (10) • Common pool (10) • Io threads (4) • Restcluster client (4) •