Exactly-Once Semantics Revisited: Distributed Transactions across Flink and Kafka

Slide 1

Slide 1 text

Tzu-Li (Gordon) Tai Staff Software Engineer Conﬂuent Exactly-Once Semantics Revisited: Distributed Transactions across Flink and Kafka Alexander Sorokoumov Staff Software Engineer Conﬂuent

Slide 2

Slide 2 text

Why “Revisit” EOS? 2

Slide 3

Slide 3 text

01 02 03 04 Primer: How Flink achieves EOS Flink’s KafkaSink: Current state and issues Enter KIP-939: Kafka’s support for 2PC Putting things together with FLIP-319 Agenda 3

Slide 4

Slide 4 text

Primer: End-to-End EOS with Flink 4

Slide 5

Slide 5 text

5 End-to-End EOS with Apache Flink … data sources data pipeline data sinks

Slide 6

Slide 6 text

6 End-to-End EOS with Apache Flink … data sources data pipeline data sinks checkpoints (blob storage, e.g. S3) ● internal compute state ● external transaction identiﬁers

Slide 7

Slide 7 text

7 End-to-End EOS with Apache Flink … data sources data pipeline data sinks ● internal compute state ● external transaction identiﬁers Distributed transaction across all data sinks and Flink internal state!

Slide 8

Slide 8 text

8 End-to-End EOS with Apache Flink … data sources data sinks ● internal compute state ● external transaction identiﬁers Distributed transaction across all data sinks and Flink internal state! … and Flink is the transaction coordinator

Slide 9

Slide 9 text

9 Distributed Transactions via 2PC Transaction Coordinator participant A participant B write write participant C write

Slide 10

Slide 10 text

10 Distributed Transactions via 2PC Transaction Coordinator participant A participant B participant C prepare prepare prepare Phase #1: Prepare / Voting

Slide 11

Slide 11 text

11 Distributed Transactions via 2PC Transaction Coordinator participant A participant B participant C prepare prepare prepare FLUSH FLUSH FLUSH Phase #1: Prepare / Voting

Slide 12

Slide 12 text

12 Distributed Transactions via 2PC Transaction Coordinator participant A participant B participant C YES FLUSH FLUSH FLUSH Phase #1: Prepare / Voting

Slide 13

Slide 13 text

13 Distributed Transactions via 2PC Transaction Coordinator participant A participant B participant C YES FLUSH FLUSH FLUSH YES YES persist phase 1 decision (COMMIT) Phase #1: Prepare / Voting

Slide 14

Slide 14 text

14 Distributed Transactions via 2PC Transaction Coordinator participant A participant B participant C YES FLUSH FLUSH FLUSH N O persist phase 1 decision (ABORT) Phase #1: Prepare / Voting

Slide 15

Slide 15 text

15 Distributed Transactions via 2PC Transaction Coordinator participant A participant B participant C CO M M IT / A B O R T COMMIT / ABORT C O M M IT / A B O R T Phase #2: Commit / Abort

Slide 16

Slide 16 text

16 Driving 2PC with Asynchronous Barrier Snapshotting ● Flink generates checkpoints periodically using asynchronous barrier snapshotting ● Each checkpoint attempt can be seen as a 2PC attempt … data sources data sinks ● internal compute state ● external transaction identiﬁers Flink (txn coordinator)

Slide 17

Slide 17 text

17 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress

Slide 18

Slide 18 text

18 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress PARTICIPANTS

Slide 19

Slide 19 text

19 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress start checkpoint Checkpoint In-Progress (Phase #1: Voting)

Slide 20

Slide 20 text

20 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress inject barrier Checkpoint In-Progress (Phase #1: Voting)

Slide 21

Slide 21 text

21 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets 1. ASYNC WRITE 2. ACK (“YES”) Checkpoint In-Progress (Phase #1: Voting)

Slide 22

Slide 22 text

22 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets Checkpoint In-Progress (Phase #1: Voting)

Slide 23

Slide 23 text

23 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state Checkpoint In-Progress (Phase #1: Voting)

Slide 24

Slide 24 text

24 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state FLUSH FLUSH Checkpoint In-Progress (Phase #1: Voting)

Slide 25

Slide 25 text

25 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TXNS PREPARED Checkpoint In-Progress (Phase #1: Voting)

Slide 26

Slide 26 text

26 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TXNS PREPARED TIDs Checkpoint In-Progress (Phase #1: Voting)

Slide 27

Slide 27 text

27 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TXNS PREPARED TIDs Checkpoint In-Progress (Phase #1: Voting) Consistent view of the world at checkpoint N

Slide 28

Slide 28 text

28 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TXNS PREPARED TIDs Voting Decision Made REGISTER CHECKPOINT

Slide 29

Slide 29 text

29 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TIDs Checkpoint Success (Phase #2: Commit) COMMIT! COMMIT COMMIT

Slide 30

Slide 30 text

30 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TIDs What happens in case of a failure? (post-checkpoint) COMMIT! COMMIT COMMIT

Slide 31

Slide 31 text

31 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TIDs 1. Restart job

Slide 32

Slide 32 text

32 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TIDs 2. Restore last checkpoint READ READ

Slide 33

Slide 33 text

33 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TIDs 2. Restore last checkpoint READ READ READ RESUME & COMMIT!

Slide 34

Slide 34 text

KafkaSink: Current Issues with EOS 34

Slide 35

Slide 35 text

35 Problem #1: In-doubt transactions can be aborted by Kafka, outside of Flink’s control

Slide 36

Slide 36 text

36 transaction.timeout.ms Kafka conﬁg ● Timeout period after the ﬁrst write to an open transaction, before it gets auto aborted ● Default value: 15 minutes ● Provides the means to prevent LSO getting stuck due to permanently failed producers

Slide 37

Slide 37 text

37 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TXNS PREPARED TIDs Voting Decision Made REGISTER CHECKPOINT

Slide 38

Slide 38 text

38 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TIDs Checkpoint Success (Phase #2: Commit) COMMIT! COMMIT COMMIT TXNS ALREADY TIMED OUT!

Slide 39

Slide 39 text

39 Suggested mitigations (so far) ● Set transaction.timeout.ms to be as large as possible (capped by broker-side conﬁg) ● No matter how large you set it, there’s always some possibility of inconsistency

Slide 40

Slide 40 text

40 Problem #2: In-doubt transactions can not be recovered

Slide 41

Slide 41 text

41 checkpoints (blob storage, e.g. S3) Kafka Source Kafka Source JobManager (txn coordinator) CheckpointMetastore (Zookeeper / etcd) … Kafka Sink 0 Kafka Sink N … … … Window 0 Window N … … : records of stream partition 0 : records of stream partition N : uncommitted records : committed records : current progress offsets state TIDs What happens in case of a failure? (post-checkpoint) COMMIT! COMMIT COMMIT

Slide 42

Slide 42 text

42 ● When a producer client instance restarts, it is expected to always issue InitProducerId to obtain its producer ID and epoch ● The protocol was always only assuming local transactions by a single producer ○ If producer fails mid-transaction, roll back the transaction InitProducerId request always aborts previous txns

Slide 43

Slide 43 text

43 Bypassing the protocol with Java Reflections (YUCK!) ● Flink persists {transaction id, producer ID, epoch} as part of its checkpoints ○ Obtained via reflection ● Upon restore from checkpoint and KafkaSink restart: ○ Inject producer ID and epoch into Kafka producer client (again, reflection) ○ Commit the transaction

Slide 44

Slide 44 text

KIP-939: Support Participation in 2PC 44

Slide 45

Slide 45 text

45 Example Application Scenario ? ? App contains event logs contains app state ?

Slide 46

Slide 46 text

46 Scenario 1: App->DB->Kafka App contains event logs contains app state w CDC

Slide 47

Slide 47 text

47 Scenario 2: App->Kafka->DB w w contains app state contains event logs App

Slide 48

Slide 48 text

48 Scenario 3: Dual Write w w App contains event logs contains app state

Slide 49

Slide 49 text

49 Better Solution: Coordinated Dual Write w w ATOMIC COMMIT App contains event logs contains app state

Slide 50

Slide 50 text

50 Why can’t we do external 2PC with Kafka right now? Kafka brokers automatically abort a transaction regardless of its status if: 1. A producer (re)starts with the same transactional.id 2. If a transaction is running longer than transaction.timeout.ms KafkaProducer#commitTransaction combines VOTING and COMMIT phases: 1. KafkaProducer ﬂushes data for all registered partitions. Successful ﬂush is an implicit YES vote in 2PC VOTING phase. 2. Right after that, Kafka brokers automatically commits the transaction.

Slide 51

Slide 51 text

51 KIP-939: Support Participation in 2PC KafkaProducer changes: ● class PreparedTxnState describing the state of a prepared transaction ● KafkaProducer#initTransactions(boolean keepPreparedTxn) that allows resuming txns ● KafkaProducer#prepareTransaction that returns PreparedTransactionState ● KafkaProducer#completeTransaction(PreparedTransactionState) that commits or abort the txn AdminClient changes: ● ListTransactionsOptions#runningLongerThanMs(long runningLongerThanMs) ● ListTransactionsOptions#runningLongerThanMs() ● Admin#forceTerminateTransaction(String transactionalId) ACL Changes: ● New AclOperation: TWO_PHASE_COMMIT Client/Broker conﬁguration: ● transaction.two.phase.commit.enable: false

Slide 52

Slide 52 text

52 Solution: App atomically commits Kafka and DB txns Coordinated dual-write to Kafka and DB: 1. Start new Kafka and DB txns, write application data 2. 2PC voting phase: a. KafkaProducer#prepareTransaction, get PreparedTxnState b. Write PreparedTxnState to the database 3. Commit database txn 4. Commit Kafka txn contains event logs 2PC state app state 2a 4 App 1 1 2b 3

Slide 53

Slide 53 text

53 Solution: App atomically commits Kafka and DB txns r2 r1 r3 Recovery 1. Retrieve Kafka txn state from DB, if any (represents latest recorded 2PC decision) 2. KafkaProducer#initTransactions(true) to keep previous txn if there is prepared state. Otherwise ﬁnish recovery 3. KafkaProducer#completeTransaction to roll forward previous Kafka txn(s) if retrieved state matches what is in Kafka cluster(s); otherwise roll back Coordinated dual-write to Kafka and DB: 1. Start new Kafka and DB txns, write application data 2. 2PC voting phase: a. KafkaProducer#prepareTransaction, get PreparedTxnState b. Write PreparedTxnState to the database 3. Commit database txn 4. Commit Kafka txn 2a 4 1 1 2b 3 contains event logs 2PC state app state App

Slide 54

Slide 54 text

54 Failure modes and recovery Coordinated dual-write to Kafka and DB: 1. Start new Kafka and DB txns, write application data 2. 2PC voting phase: a. KafkaProducer#prepareTransaction, get PreparedTxnState b. Write PreparedTxnState to the database 3. Commit database txn 4. Commit Kafka txn ● Kafka transaction was not yet prepared ● DB transaction did not commit Recovery: rollback both transactions FAILURE! Recovery 1. Retrieve Kafka txn state from DB, if any (represents latest recorded 2PC decision) 2. KafkaProducer#initTransactions(true) to keep previous txn if there is prepared state. Otherwise ﬁnish recovery 3. KafkaProducer#completeTransaction to roll forward previous Kafka txn(s) if retrieved state matches what is in Kafka cluster(s); otherwise roll back

Slide 55

Slide 55 text

55 Failure modes and recovery Coordinated dual-write to Kafka and DB: 1. Start new Kafka and DB txns, write application data 2. 2PC voting phase: a. KafkaProducer#prepareTransaction, get PreparedTxnState b. Write PreparedTxnState to the database 3. Commit database txn 4. Commit Kafka txn ● Kafka transaction was prepared ● DB transaction did not commit Recovery: rollback prepared Kafka transaction Recovery 1. Retrieve Kafka txn state from DB, if any (represents latest recorded 2PC decision) 2. KafkaProducer#initTransactions(true) to keep previous txn if there is prepared state. Otherwise ﬁnish recovery 3. KafkaProducer#completeTransaction to roll forward previous Kafka txn(s) if retrieved state matches what is in Kafka cluster(s); otherwise roll back FAILURE!

Slide 56

Slide 56 text

56 Failure modes and recovery Coordinated dual-write to Kafka and DB: 1. Start new Kafka and DB txns, write application data 2. 2PC voting phase: a. KafkaProducer#prepareTransaction, get PreparedTxnState b. Write PreparedTxnState to the database 3. Commit database txn 4. Commit Kafka txn ● Kafka transaction was prepared ● DB transaction did not commit PreparedTxnState Recovery: rollback prepared Kafka transaction Recovery 1. Retrieve Kafka txn state from DB, if any (represents latest recorded 2PC decision) 2. KafkaProducer#initTransactions(true) to keep previous txn if there is prepared state. Otherwise ﬁnish recovery 3. KafkaProducer#completeTransaction to roll forward previous Kafka txn(s) if retrieved state matches what is in Kafka cluster(s); otherwise roll back FAILURE!

Slide 57

Slide 57 text

57 Failure modes and recovery ● Kafka transaction was prepared ● DB transaction was committed; the new 2PC decision was recorded. Recovery: commit prepared Kafka transaction Recovery 1. Retrieve Kafka txn state from DB, if any (represents latest recorded 2PC decision) 2. KafkaProducer#initTransactions(true) to keep previous txn if there is prepared state. Otherwise ﬁnish recovery 3. KafkaProducer#completeTransaction to roll forward previous Kafka txn(s) if retrieved state matches what is in Kafka cluster(s); otherwise roll back Coordinated dual-write to Kafka and DB: 1. Start new Kafka and DB txns, write application data 2. 2PC voting phase: a. KafkaProducer#prepareTransaction, get PreparedTxnState b. Write PreparedTxnState to the database 3. Commit database txn 4. Commit Kafka txn FAILURE!

Slide 58

Slide 58 text

58 Failure modes and recovery ● All changes are committed, nothing to do! Recovery: no-op! Recovery 1. Retrieve Kafka txn state from DB, if any (represents latest recorded 2PC decision) 2. KafkaProducer#initTransactions(true) to keep previous txn if there is prepared state. Otherwise ﬁnish recovery 3. KafkaProducer#completeTransaction to roll forward previous Kafka txn(s) if retrieved state matches what is in Kafka cluster(s); otherwise roll back Coordinated dual-write to Kafka and DB: 1. Start new Kafka and DB txns, write application data 2. 2PC voting phase: a. KafkaProducer#prepareTransaction, get PreparedTxnState b. Write PreparedTxnState to the database 3. Commit database txn 4. Commit Kafka txn

Slide 59

Slide 59 text

Enable external coordination for 2PC 59 ● Client AND Broker conﬁguration: transaction.two.phase.commit.enable: true ● ACL Operation on Transactional ID: TWO_PHASE_COMMIT and WRITE on Transactional ID

Slide 60

Slide 60 text

Putting things together with FLIP-319 60

Slide 61

Slide 61 text

FLIP-319: Integrate with Kafka's Support for Proper 2PC Participation 61 data sources data pipeline data sinks Checkpoints r1 r2 r3 1 4 2a 2b 3 Recovery 1. Retrieve Kafka txn state from the last checkpoint, if any (represents latest recorded 2PC decision) 2. KafkaProducer#initTransactions(true) to keep previous txn if there is prepared state. Otherwise ﬁnish recovery 3. KafkaProducer#completeTransaction to roll forward previous Kafka txn(s) if retrieved state matches what is in Kafka cluster(s); otherwise roll back 1. Start new Kafka txn, write process incoming rows 2. 2PC voting phase: a. KafkaProducer#prepareTransaction, get PreparedTxnState b. Write PreparedTxnState to the checkpoint 3. Persist the checkpoint 4. Commit Kafka txn

Slide 62

Slide 62 text

FLIP-319: Upgrade path 62 1. Set transaction.two.phase.commit.enable: true on the broker. 2. Upgrade Kafka cluster version to a minimum version that supports KIP-939. 3. Enable TWO_PHASE_COMMIT ACL on the Transactional ID resource for respective users if authentication is enabled. 4. Stop the Flink job while taking a savepoint. 5. Upgrade their job application code to use the new KafkaSink version. a. No code changes are required from the user b. simply upgrade the ﬂink-connector-kafka dependency and recompile the job jar. 6. Submit the upgraded job jar, conﬁgured to restore from the savepoint taken in step 4.

Slide 63

Slide 63 text

FLIP-319: Summary 63 ● No more consistency violations under Exactly-Once! ● Using public APIs → no reﬂection → happy maintainers and easier upgrades ● Stabilizes production usage

Slide 64

Slide 64 text

Conclusion 64

Slide 65

Slide 65 text

Conclusion 65 ● KIP-939 enables external 2PC transaction coordination. ● With FLIP-319, Apache Flink is the first application that makes use of that capability. ● KIP-939 and FLIP-319 are in discussion on the corresponding mailing lists. KIP-939: ● Proposal: https://cwiki.apache.org/confluence/display/KAFKA/KIP-939%3A+Support+Participation+i n+2PC ● Discussion thread: https://lists.apache.org/thread/wbs9sqs3z1tdm7ptw5j4o9osmx9s41nf FLIP-319: ● Proposal https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255071710 ● Discussion thread: https://lists.apache.org/thread/p0z40w60qgyrmwjttbxx7qncjdohqtrc