Tiny Flink: Minimizing the Memory Footprint of Apache Flink

Tiny Flink: Minimizing the Memory Footprint of Apache Flink Robert
Metzger, Staff Engineer @ Decodable Apache Flink Committer and PMC Chair

What is Apache Flink? Stateful Computations over Data Streams •
Highly Scalable • Exactly-once processing semantics • Event time semantics and watermarks • Layered APIs: Streaming SQL (easy to use) ↔ DataStream (expressive)

Motivation 1: Minimum Flink cluster size (with the K8s operator
and Flink config defaults). 650mb + 1024mb = 1.67gb minimum memory of a Flink cluster Motivation 2: Deploy like other JVM-based microservices or apps. No need to use a cluster manager. → unified deployment, monitoring and operations for all services → Not all use-cases process millions of events per second Motivation 3: Local Development & Testing. Flink in your IDE, integration tests and on your CI system. No datacenter required for local development. Why Minimizing Flink?

Kubernetes Docker Do it yourself (Cluster) Operator Native Kubernetes Flink
manages resources on K8s Standalone Kubernetes Resources are fixed/managed externally Deployment Options for Apache Flink MiniCluster

Enter Flink MiniCluster • Similar utilities ◦ KafkaServer: launch a
Kafka broker in your JVM ◦ MiniDFSCluster: Launch Hadoop Distributed File System in your JVM public static void main(String[] args) throws Exception { MiniClusterConfiguration clusterConfig = new MiniClusterConfiguration.Builder() .setNumTaskManagers(1) .setNumSlotsPerTaskManager(1) .build(); try (var cluster = new MiniCluster(clusterConfig)) { cluster.start(); cluster.submitJob(/*TODO*/); } }

Flink MiniCluster: What’s the Size of my JVM? → JVM
doesn’t start with -Xmx60m, but it does with -Xmx65m. Exception in thread "main" java.lang.OutOfMemoryError: Could not allocate enough memory segments for NetworkBufferPool (required (MB): 64, allocated (MB): 61, missing (MB): 3). Cause: Direct buffer memory. The direct out-of-memory error has occurred. This can mean two things: either job(s) require(s) a larger size of JVM direct memory or there is a direct memory leak. The direct memory can be allocated by user code or some of its dependencies. In this case 'taskmanager.memory.task.off-heap.size' configuration option should be increased. Flink framework and its dependencies also consume the direct memory, mostly for network communication. The most of network memory is managed by Flink and should not result in out-of-memory error. In certain special cases, in particular for jobs with high parallelism, the framework may require more direct memory which is not managed by Flink. In this case 'taskmanager.memory.framework.off-heap.size' configuration option should be increased. If the error persists then there is probably a direct memory leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown... at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.<init>(NetworkBufferPool.java:149) at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:173) at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:128) at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:97) at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createShuffleEnvironment(NettyShuffleServiceFactory.java:78) at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createShuffleEnvironment(NettyShuffleServiceFactory.java:57) at org.apache.flink.runtime.taskexecutor.TaskManagerServices.createShuffleEnvironment(TaskManagerServices.java:446) at org.apache.flink.runtime.taskexecutor.TaskManagerServices.fromConfiguration(TaskManagerServices.java:304) at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:631) at org.apache.flink.runtime.minicluster.MiniCluster.startTaskManager(MiniCluster.java:755) at org.apache.flink.runtime.minicluster.MiniCluster.startTaskManagers(MiniCluster.java:736) at org.apache.flink.runtime.minicluster.MiniCluster.start(MiniCluster.java:457) at co.decodable.PassthroughKafka.initMiniClusterWithEnv(PassthroughKafka.java:62) at co.decodable.PassthroughKafka.main(PassthroughKafka.java:44)

Flink MiniCluster: Going even smaller → Optimizing configuration: Reduce network
buffer size from 64mb to 8mb allows setting -Xmx20m. public static void main(String[] args) throws Exception { var flinkConfig = new Configuration(); flinkConfig.set(TaskManagerOptions.NETWORK_MEMORY_MIN, MemorySize.parse("8m")); flinkConfig.set(TaskManagerOptions.NETWORK_MEMORY_MAX, MemorySize.parse("8m")); MiniClusterConfiguration clusterConfig = new MiniClusterConfiguration .Builder() .setNumTaskManagers( 1) .setNumSlotsPerTaskManager( 1) .setConfiguration(flinkConfig) .build();

MiniCluster: What have we learned so far? → 20mb heap
space are sufficient to run an empty Flink cluster. Final process size is ~80mb. Open items: • Heap size, process size with a job running • Throughput • Heap and process size discrepancy

Flink MiniCluster: Making it real → Run a small Flink
job reading from Kafka, filtering 1% of the data, writing to Kafka Empty MiniCluster MiniCluster with Kafka source and sink Configured Heap -Xmx20m -Xmx35m Approx Process Size 80mb 190mb ⚠ Caveats • Checkpointing is not enabled • We are using a Flink job with minimal state, no RocksDB

Benchmarking the Throughput: Setup Hardware: 10-core Apple M1 Max, 32GB
memory, 1TB SSD Versions: Flink 1.17.0, Kafka 3.2.3

Benchmarking the Throughput: Results Heap limit Throughput Real memory 35
mb Out of Memory under light load (3mb/s) 50 mb 25 mb/s 200 mb 100 mb 97 mb/s 260 mb

6mb/s 12mb/s 25 mb/s 36 mb/s Benchmarks in Detail: -Xmx50m
Heap Limit

Benchmarks in Detail: -Xmx100m Heap Limit ~100 mb/s

Heap vs Native Memory Heap limit Throughput Real memory 35
mb Out of Memory under light load (3mb/s) 50 mb 25 mb/s 200 mb 100 mb 97 mb/s 260 mb Enabling JVM native memory tracking and getting a report [1]: • JVM argument: -XX:NativeMemoryTracking=summary • jcmd <pid> VM.native_memory baseline • jcmd <pid> VM.native_memory summary.diff [1] https://docs.oracle.com/en/java/javase/17/vm/native-memory-tracking.html

Heap vs Native Nemory Heap (106mb) Thread (217mb) Metaspace (58mb)
GC (56mb)

Heap vs Native Nemory • Reduce thread stack size from
default 1mb to 256kb: -Xss256k ◦ Higher risk of stack overflow exceptions ◦ We have ~100 threads • GC is using 56mb. Using the more lightweight Serial GC: -XX:+UseSerialGC reduces to 390 bytes (GC structures and fewer threads) ◦ This reduces the throughput of Flink Further reading: https://shipilev.net/jvm/anatomy-quarks/12-native-memory-tracking/ Heap (106mb) Thread (27mb) Metaspace (56mb)

• 🔬 We can scale Flink down to a process
size of ~250mb. • 🚀 One MiniCluster is still able to process ~100mb/s • MiniCluster runs the same code as a distributed cluster ◦ Migrate from MiniCluster to distributed cluster (restoring latest checkpoint or from a savepoint) ◦ Supports HA (e.g. you can kill the process, it will continue where it left off) ◦ Supports metrics and logging integrations ◦ Supports the Flink Web UI (and REST API) • All code examples: https://github.com/rmetzger/tiny-flink-talk Conclusions Switch icons created by Gregor Cresnar - Flaticon

Tiny Flink: Minimizing the Memory Footprint of Apache Flink Robert
Metzger, Staff Engineer @ Decodable Apache Flink Committer and PMC Chair Q&A Follow me on @rmetzger_

100 Threads ?! Heavy hitters: • Flink rest server worker
(20) • Akka (10) • Common pool (10) • Io threads (4) • Restcluster client (4) •

Tiny Flink: Minimizing the Memory Footprint of ...

Tiny Flink: Minimizing the Memory Footprint of Apache Flink

Robert Metzger

More Decks by Robert Metzger

Other Decks in Technology

Featured

Transcript

Tiny Flink: Minimizing the Memory Footprint of Apache Flink Robert

What is Apache Flink? Stateful Computations over Data Streams •

Motivation 1: Minimum Flink cluster size (with the K8s operator

Kubernetes Docker Do it yourself (Cluster) Operator Native Kubernetes Flink

Enter Flink MiniCluster • Similar utilities ◦ KafkaServer: launch a

Flink MiniCluster: What’s the Size of my JVM? → JVM

Flink MiniCluster: Going even smaller → Optimizing configuration: Reduce network

MiniCluster: What have we learned so far? → 20mb heap

Flink MiniCluster: Making it real → Run a small Flink

Benchmarking the Throughput: Setup Hardware: 10-core Apple M1 Max, 32GB

Benchmarking the Throughput: Results Heap limit Throughput Real memory 35

6mb/s 12mb/s 25 mb/s 36 mb/s Benchmarks in Detail: -Xmx50m

Benchmarks in Detail: -Xmx100m Heap Limit ~100 mb/s

Heap vs Native Memory Heap limit Throughput Real memory 35

Heap vs Native Nemory Heap (106mb) Thread (217mb) Metaspace (58mb)

Heap vs Native Nemory • Reduce thread stack size from

• 🔬 We can scale Flink down to a process

Tiny Flink: Minimizing the Memory Footprint of Apache Flink Robert

100 Threads ?! Heavy hitters: • Flink rest server worker