Slide 1

Slide 1 text

3 Flink Mistakes We Made So You Wonʼt Have To Robert Metzger, Staff Engineer @ Decodable Apache Flink Committer and PMC Chair Sharon Xie, Founding Engineer @ Decodable

Slide 2

Slide 2 text

What weʼll be talking about today #1 Data Loss with Flink Exactly-Once Delivery to Kafka #2 Inefficient Memory Configuration #3 Inefficient Checkpointing Config

Slide 3

Slide 3 text

#1 Data Loss with Flink Exactly-Once Delivery to Kafka

Slide 4

Slide 4 text

Two Phase Commit for EO  Happy Path

Slide 5

Slide 5 text

Two Phase Commit for EO  Phase 1 Failure

Slide 6

Slide 6 text

Two Phase Commit for EO  Phase 2 Failure

Slide 7

Slide 7 text

Life is doomed when… Phase 2 can’t be successful 💣🔥

Slide 8

Slide 8 text

Important Kafka Broker Configurations transaction.max.timeout.ms ● Default: 900000 (15 minutes) transactional.id.expiration.ms ● Default: 604800000 (7 days)

Slide 9

Slide 9 text

Timeout Causes Data Loss

Slide 10

Slide 10 text

● Flink Kafka Producer creates a new transaction id for each checkpoint per task ● transactional.id.expiration.ms = 604800000 (7 days) Excessive Memory Usage

Slide 11

Slide 11 text

● transaction.max.timeout.ms = 604800000 (7 days) ○ From default: 15min ● transactional.id.expiration.ms = 3600000 (1 hour) ○ From default: 7 days Better Kafka Transaction Configuration

Slide 12

Slide 12 text

When a checkpoint/savepoint to restore is over 1 hour (the new transactional.id.expiration.ms) old org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id. InvalidPidMappingException

Slide 13

Slide 13 text

Short-term: Ignore InvalidPidMappingException 😇 ● ONLY when transaction.timeout.ms (Kafka client configuration in Flink) > transactional.id.expiration.ms Long-term: 🤝 ● KIP-939: Support Participation in 2PC ● FLIP-319: Integrate with Kafka's Support for Proper 2PC Participation Fix InvalidPidMappingException

Slide 14

Slide 14 text

What weʼll be talking about today #1 Data Loss with Flink Exactly-Once Delivery to Kafka ✅ #2 Inefficient Memory Configuration #3 Inefficient Checkpointing Config

Slide 15

Slide 15 text

#2 Inefficient Memory Configuration

Slide 16

Slide 16 text

How to Tune TaskManager Memory ● Flink automatically computes memory budgets Just provide total process size. ● Main memory consumers ○ Framework + Task heap ○ RocksDB State backend (off-heap) ○ Network stack (off-heap) ○ JVM internal structures [metaspace, thread stacks] (off-heap)

Slide 17

Slide 17 text

How to Tune TaskManager Memory ● Example: taskmanager.memory.process.size: 8gb JVM internal structures [metaspace, thread stacks] (off-heap) Framework + Task heap RocksDB State backend (off-heap) Network stack (off-heap)

Slide 18

Slide 18 text

How to Tune TaskManager Memory ● Letʼs tune for this particular job 150mb 700mb 2300mb = 3150mb unused memory

Slide 19

Slide 19 text

How to Tune TaskManager Memory ● Give as much memory as possible to Managed Memory = RocksDB taskmanager.memory.task.heap.size: 1 gb taskmanager.memory.managed.size: 5800 mb taskmanager.memory.network.min: 32 mb taskmanager.memory.network.max: 32 mb taskmanager.memory.jvm-metaspace.size: 120 mb

Slide 20

Slide 20 text

● Stateful workloads with RocksDB benefit most from as much memory as possible → Check out the full documentation: https://nightlies.apache.org/flink/flink-doc s-master/docs/deployment/memory/mem_ setup/ Memory Configuration Wrap Up

Slide 21

Slide 21 text

What weʼll be talking about today #1 Data Loss with Flink Exactly-Once Delivery to Kafka ✅ #2 Inefficient Memory Configuration ✅ #3 Inefficient Checkpointing Config

Slide 22

Slide 22 text

execution.checkpointing.interval: 10s execution.checkpointing.min-pause: 10s Make sure your job is not spending all the time checkpointing Image source: https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/large_state_tuning/#tuning-checkpointing #3 Reliable, Fast Checkpointing

Slide 23

Slide 23 text

state.backend: rocksdb state.backend.incremental: true Only upload the diff to the last checkpoint #33 full #34 incremental #35 incremental Reliable, Fast Checkpointing

Slide 24

Slide 24 text

state.backend.local-recovery: true Local recovery: Only re-download the state on failed machines After a failure without local recovery: All TaskManagers download the state TM1 TM2 TM3 TM4 1  TM4 fails TM1 TM2 TM3 TM4 2  Recovery With local recovery: Most machines use local disks, only one needs to download TM1 TM2 TM3 TM4 1  TM4 fails TM1 TM2 TM3 TM4 2  Recovery Reliable, Fast Checkpointing

Slide 25

Slide 25 text

Fast Checkpointing and State Put your RocksDB state on the fastest available disk. Typically a local SSD. TaskManager Your Flink Worker Remote EBS Volume Your Flink Worker TaskManager Local SSD

Slide 26

Slide 26 text

The End – Q&A Robert Metzger, Staff Engineer @ Decodable Apache Flink Committer and PMC Chair Sharon Xie, Founding Engineer @ Decodable Get your free decodable.co account today if you want us to handle the issues discussed in the talk. Visit the Decodable Booth 201) for any Flink related questions.

Slide 27

Slide 27 text

Fast Checkpointing and State ● RocksDB stores your state on the /tmp directory ● On AWS Kubernetes, thatʼs by default an EBS volume Type Size IOPS (max) Throughput Price per Month io1 950 GB 64000 $4278 io2 block express 950 GB 256000 $9769 gp3 950 GB 16000 1000 mb/s $176 M6gd.4xlarge 64g | 16c 950 GB Read: 93000 Write: 222000 $78 per instance for a local NVMe SSD → Using an instance type with a local SSD gives you by far the best performance per $ We just mount the entire Docker working directory on the local SSD.

Slide 28

Slide 28 text

● Flink EO with Kafka can still cause data loss ● Transaction timeout is the key ● Flink EO implementation can consume excessive memory from Kafka ● A better approach with Flink + Kafka is under way Recap