Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tiny Flink: Minimizing the Memory Footprint of Apache Flink

Tiny Flink: Minimizing the Memory Footprint of Apache Flink

Apache Flink has been designed for, and is mostly used with large-scale real-time data processing use-cases. Companies report about TBs of data being processed per second, or TBs of state in huge clusters.

But what if you need to process low-throughput streams? Running a full, distributed Flink cluster might be an overkill, as there’s quite a bit of overhead for distributed coordination.

In this talk, we’ll explore options to reduce your resource footprint. We’ll dive deeper into Flink’s MiniCluster, allowing you to run Flink in-JVM for integration tests, as a micro service or just a small processor for your data in Kubernetes. We will also discuss lessons learned from running MiniCluster in production for a service offering Flink SQL in the cloud.

Robert Metzger

June 20, 2023
Tweet

More Decks by Robert Metzger

Other Decks in Technology

Transcript

  1. Tiny Flink: Minimizing
    the Memory Footprint
    of Apache Flink
    Robert Metzger, Staff Engineer @ Decodable
    Apache Flink Committer and PMC Chair

    View Slide

  2. What is Apache Flink?
    Stateful Computations over Data Streams
    ● Highly Scalable
    ● Exactly-once processing semantics
    ● Event time semantics and watermarks
    ● Layered APIs: Streaming SQL (easy to use) ↔ DataStream (expressive)

    View Slide

  3. Motivation 1: Minimum Flink
    cluster size (with the K8s
    operator and Flink config
    defaults).
    650mb + 1024mb = 1.67gb
    minimum memory of a Flink
    cluster
    Motivation 2: Deploy like other
    JVM-based microservices or
    apps. No need to use a cluster
    manager.
    → unified deployment,
    monitoring and operations for all
    services
    → Not all use-cases process
    millions of events per second
    Motivation 3: Local
    Development & Testing.
    Flink in your IDE, integration
    tests and on your CI system.
    No datacenter required for local
    development.
    Why Minimizing Flink?

    View Slide

  4. Kubernetes Docker
    Do it yourself
    (Cluster)
    Operator
    Native
    Kubernetes
    Flink manages
    resources on K8s
    Standalone
    Kubernetes
    Resources are
    fixed/managed
    externally
    Deployment Options for Apache Flink
    MiniCluster

    View Slide

  5. Enter Flink MiniCluster
    ● Similar utilities
    ○ KafkaServer: launch a Kafka broker in your JVM
    ○ MiniDFSCluster: Launch Hadoop Distributed File System in your JVM
    public static void main(String[] args) throws Exception {
    MiniClusterConfiguration clusterConfig =
    new MiniClusterConfiguration.Builder()
    .setNumTaskManagers(1)
    .setNumSlotsPerTaskManager(1)
    .build();
    try (var cluster = new MiniCluster(clusterConfig)) {
    cluster.start();
    cluster.submitJob(/*TODO*/);
    }
    }

    View Slide

  6. Flink MiniCluster: What’s the Size of my JVM?
    → JVM doesn’t start with -Xmx60m, but it does with -Xmx65m.
    Exception in thread "main" java.lang.OutOfMemoryError: Could not allocate enough memory segments for NetworkBufferPool (required (MB): 64,
    allocated (MB): 61, missing (MB): 3). Cause: Direct buffer memory. The direct out-of-memory error has occurred. This can mean two things:
    either job(s) require(s) a larger size of JVM direct memory or there is a direct memory leak. The direct memory can be allocated by user
    code or some of its dependencies. In this case 'taskmanager.memory.task.off-heap.size' configuration option should be increased. Flink
    framework and its dependencies also consume the direct memory, mostly for network communication. The most of network memory is managed by
    Flink and should not result in out-of-memory error. In certain special cases, in particular for jobs with high parallelism, the framework
    may require more direct memory which is not managed by Flink. In this case 'taskmanager.memory.framework.off-heap.size' configuration
    option should be increased. If the error persists then there is probably a direct memory leak in user code or some of its dependencies
    which has to be investigated and fixed. The task executor has to be shutdown...
    at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.(NetworkBufferPool.java:149)
    at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:173)
    at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:128)
    at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createNettyShuffleEnvironment(NettyShuffleServiceFactory.java:97)
    at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createShuffleEnvironment(NettyShuffleServiceFactory.java:78)
    at org.apache.flink.runtime.io.network.NettyShuffleServiceFactory.createShuffleEnvironment(NettyShuffleServiceFactory.java:57)
    at org.apache.flink.runtime.taskexecutor.TaskManagerServices.createShuffleEnvironment(TaskManagerServices.java:446)
    at org.apache.flink.runtime.taskexecutor.TaskManagerServices.fromConfiguration(TaskManagerServices.java:304)
    at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:631)
    at org.apache.flink.runtime.minicluster.MiniCluster.startTaskManager(MiniCluster.java:755)
    at org.apache.flink.runtime.minicluster.MiniCluster.startTaskManagers(MiniCluster.java:736)
    at org.apache.flink.runtime.minicluster.MiniCluster.start(MiniCluster.java:457)
    at co.decodable.PassthroughKafka.initMiniClusterWithEnv(PassthroughKafka.java:62)
    at co.decodable.PassthroughKafka.main(PassthroughKafka.java:44)

    View Slide

  7. Flink MiniCluster: Going even smaller
    → Optimizing configuration: Reduce network buffer size from 64mb to 8mb
    allows setting -Xmx20m.
    public static void main(String[] args) throws Exception {
    var flinkConfig = new Configuration();
    flinkConfig.set(TaskManagerOptions.NETWORK_MEMORY_MIN, MemorySize.parse("8m"));
    flinkConfig.set(TaskManagerOptions.NETWORK_MEMORY_MAX, MemorySize.parse("8m"));
    MiniClusterConfiguration clusterConfig =
    new MiniClusterConfiguration .Builder()
    .setNumTaskManagers( 1)
    .setNumSlotsPerTaskManager( 1)
    .setConfiguration(flinkConfig)
    .build();

    View Slide

  8. MiniCluster: What have we learned so far?
    → 20mb heap space are sufficient to run an empty Flink cluster.
    Final process size is ~80mb.
    Open items:
    ● Heap size, process size with a job running
    ● Throughput
    ● Heap and process size discrepancy

    View Slide

  9. Flink MiniCluster: Making it real
    → Run a small Flink job reading from Kafka, filtering 1% of the data, writing to Kafka
    Empty MiniCluster MiniCluster with Kafka source and sink
    Configured Heap -Xmx20m -Xmx35m
    Approx Process Size 80mb 190mb
    ⚠ Caveats
    ● Checkpointing is not enabled
    ● We are using a Flink job with minimal state, no RocksDB

    View Slide

  10. Benchmarking the Throughput: Setup
    Hardware: 10-core Apple M1 Max, 32GB memory, 1TB SSD
    Versions: Flink 1.17.0, Kafka 3.2.3

    View Slide

  11. Benchmarking the Throughput: Results
    Heap limit Throughput Real memory
    35 mb Out of Memory under light load (3mb/s)
    50 mb 25 mb/s 200 mb
    100 mb 97 mb/s 260 mb

    View Slide

  12. 6mb/s
    12mb/s 25 mb/s 36 mb/s
    Benchmarks in Detail: -Xmx50m Heap Limit

    View Slide

  13. Benchmarks in Detail: -Xmx100m Heap Limit
    ~100 mb/s

    View Slide

  14. Heap vs Native Memory
    Heap limit Throughput Real memory
    35 mb Out of Memory under light load (3mb/s)
    50 mb 25 mb/s 200 mb
    100 mb 97 mb/s 260 mb
    Enabling JVM native memory tracking and getting a report [1]:
    ● JVM argument: -XX:NativeMemoryTracking=summary
    ● jcmd VM.native_memory baseline
    ● jcmd VM.native_memory summary.diff
    [1] https://docs.oracle.com/en/java/javase/17/vm/native-memory-tracking.html

    View Slide

  15. Heap vs Native Nemory
    Heap (106mb)
    Thread (217mb)
    Metaspace (58mb)
    GC (56mb)

    View Slide

  16. Heap vs Native Nemory
    ● Reduce thread stack size from default 1mb to 256kb: -Xss256k
    ○ Higher risk of stack overflow exceptions
    ○ We have ~100 threads
    ● GC is using 56mb. Using the more lightweight Serial GC: -XX:+UseSerialGC
    reduces to 390 bytes (GC structures and fewer threads)
    ○ This reduces the throughput of Flink
    Further reading: https://shipilev.net/jvm/anatomy-quarks/12-native-memory-tracking/
    Heap (106mb)
    Thread (27mb)
    Metaspace (56mb)

    View Slide

  17. ● 🔬 We can scale Flink down to a process size of ~250mb.
    ● 🚀 One MiniCluster is still able to process ~100mb/s
    ● MiniCluster runs the same code as a distributed cluster
    ○ Migrate from MiniCluster to distributed cluster (restoring
    latest checkpoint or from a savepoint)
    ○ Supports HA (e.g. you can kill the process, it will continue
    where it left off)
    ○ Supports metrics and logging integrations
    ○ Supports the Flink Web UI (and REST API)
    ● All code examples: https://github.com/rmetzger/tiny-flink-talk
    Conclusions
    Switch icons created by Gregor Cresnar - Flaticon

    View Slide

  18. Tiny Flink: Minimizing the Memory Footprint of Apache Flink
    Robert Metzger, Staff Engineer @ Decodable
    Apache Flink Committer and PMC Chair
    Q&A
    Follow me on @rmetzger_

    View Slide

  19. View Slide

  20. 100 Threads ?!
    Heavy hitters:
    ● Flink rest server worker
    (20)
    ● Akka (10)
    ● Common pool (10)
    ● Io threads (4)
    ● Restcluster client (4)

    View Slide