universal transaction manager – Provide a database-agnostic ACID transaction – The architecture is inspired by Deuteronomy [CIDR’09,11] 4 https://github.com/scalar-labs/scalardb
abstraction • Transaction manager for non-transactional databases (NoSQLs) • Transaction manager for heterogeneous databases 6 MySQL Amazon DynamoDB Scalar DB Scalar DB App App Enables database migration without modifying the App Apache Cassandra Scalar DB App Adds transaction capability to non-transactional databases PostgreSQL Azure Cosmos DB Scalar DB App Achieves transaction over multiple different databases Database abstraction Transaction manager for NoSQLs Transaction manager for heterogeneous databases
Approach • Universal – Can work on various database systems • Non-invasive – Any modifications to the underlying databases are not required • Flexible Scalability – Transaction layer and storage layer can be independently scaled 6 • Slower than Distributed SQLs – More abstraction layers and storage-oblivious transaction manager • Hard to optimize – Transaction manager has not much information about storage • No SQL support – A transaction has to be written procedurally with a programming language – (Now working on SQL I/F)
DB transaction library (transaction logic) Command execution / HTTP Databases Scalar DB Client Command execution / HTTP Databases Scalar DB Server (transaction logic) Application program • Scalar DB can be used in two ways: Application program Database- specific protocol Database- specific protocol
put, get, scan (partition-level), delete • Begin and commit semantics – Arbitrary number of operations can be handled 8 DistributedTranasctionManager manager = …; DistributedTransaction transaction = manager.start(); Get get = createGet(); Optional<Result> result = transaction.get(get); Pub put = createPut(result); transaction.put(put); transaction.commit();
on Cherry Garcia [ICDE’15] – Two phase commit with linearizable operations (for Atomicity) – Protocol correction is our extended work – Distributed WAL records (for Atomicity and Durability) – Single version optimistic concurrency control (for Isolation) – Serializability support is our extended work • Requirements in underlining databases/storages – Linearizable read and linearizable conditional/CAS write – An ability to store metadata for each record 10
Two phase commit protocol (2PC) with linearizable operations – Similar to Paxos Commit [TODS’06] – Two phase commit on distributed records • The protocol – Prepare phase: prepare records – Commit phase 1: commit status record – This is where a transaction is regarded as committed or aborted – Commit phase 2: commit records • Lazy recovery – Uncommitted records will be rollforwarded or rollbacked based on the status of a transaction when the records are read 11
• WAL (Write-Ahead Logging) is distributed into records 12 Application data Transaction metadata After image Before image Application data (Before) Transaction metadata (Before) Status Version TxID Status (before) Version (before) TxID (before) TxID Status Other metadata Status Record in coordinator table User/Application Record in user tables Application data (managed by users) Transaction metadata (managed by Scalar DB)
version OCC – Simple implementation of Snapshot Isolation – Conflicts are detected by linearizable conditional write – No clock dependency, no use of HLC (Hybrid Logical Clock) • Supported isolation level – Read-committed Snapshot Isolation (RCSI) – Read-skew, write-skew, read-only, phantom anomalies could happen – Serializable – No anomalies (Strict Serializability) – RCSI-based but non-serializable schedules are aborted 13
14 Tx1 Tx1’s memory space Database Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
14 Tx1 Tx1’s memory space Database Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 80 C 5 TxID XXX 2 120 C 4 YYY P 6 Tx1 P 5 Tx1
14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Tx2 UserID Balance Status Version 1 100 C 5 Tx2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 80 C 5 TxID XXX 2 120 C 4 YYY P 6 Tx1 P 5 Tx1
14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read Fail due to the condition mismatch UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Tx2 UserID Balance Status Version 1 100 C 5 Tx2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 80 C 5 TxID XXX 2 120 C 4 YYY P 6 Tx1 P 5 Tx1
1 15 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Linearizable conditional write Update if the Tx1 does not exist Tx1 Database
2 16 Database UserID Balance Status Version 1 80 C 6 TxID Tx1 2 120 C 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Linearizable conditional write Update status if the record is prepared by the Tx1 Tx1
Commit Phase2 TX1 • Recovery is lazily done when a record is read Nothing is needed (local memory space needs to be cleared) Recovery process Rollbacked by another TX lazily using before image Roll-forwarded by another TX lazily updating status to C No need for recovery Crash
anomalies – Read-skew, write-skew, read-only, and phantom anomalies • Basic strategy to make RCSI serializable – Avoid anti/rw-dependency dangerous structure [TODS’05] – No use of SSI [SIGMOD’08] or its variant [EuroSys’12] – Many linearizable operations for managing in/outConflicts or correct clock are required – Two implementations: Extra-write and Extra-read 18
18 R(X) W(Y) P(Y) C C(Y) R(X) W(Y) P(Y) C C(Y) P(X) C(X) • Extra-write – Convert read into write. Extra care is done if a record doesn’t exist. • Extra-read – Check read-set after prepared to see if it is not updated by other transactions Write the same record R(X) W(Y) P(Y) C C(Y) R(X) W(Y) P(Y) C C(Y) V(X) Re-read (validate) the record and abort if it is changed Extra-write Extra-read
DB achieves ACID transaction spanning multiple different databases • Two types of interfaces: – One-phase and two-phase 18 MySQL Cassandra Scalar DB Application MySQL Cassandra Scalar DB Microservice1 Scalar DB Microservice2 TxID One-phase Two-phase
Scalar DB has been heavily tested with Jepsen and Elle [VLDB’21] – Jepsen tests are created and conducted by Scalar – See https://github.com/scalar-labs/scalar-jepsen for more detail • Transaction commit protocol is verified with TLA+ – See https://github.com/scalar-labs/scalardb/tree/master/tla%2B/consensus-commit 20 Jepsen Passed TLA+ Passed