Slide 1

Slide 1 text

© 2020 Scalar, inc. Scalar DB: Universal Transaction Manager 20 Jan, 2022 at Big Data System class in Keio University Hiroyuki Yamada CTO&CEO at Scalar, Inc. 1

Slide 2

Slide 2 text

© 2020 Scalar, inc. What is Scalar DB • A universal transaction manager – Provide a database-agnostic ACID transaction – The architecture is inspired by Deuteronomy [CIDR’09,11] 4 https://github.com/scalar-labs/scalardb

Slide 3

Slide 3 text

© 2020 Scalar, inc. Motivation / Use Cases • Database abstraction • Transaction manager for non-transactional databases (NoSQLs) • Transaction manager for heterogeneous databases 6 MySQL Amazon DynamoDB Scalar DB Scalar DB App App Enables database migration without modifying the App Apache Cassandra Scalar DB App Adds transaction capability to non-transactional databases PostgreSQL Azure Cosmos DB Scalar DB App Achieves transaction over multiple different databases Database abstraction Transaction manager for NoSQLs Transaction manager for heterogeneous databases

Slide 4

Slide 4 text

© 2020 Scalar, inc. Pros and Cons of Scalar DB Approach • Universal – Can work on various database systems • Non-invasive – Any modifications to the underlying databases are not required • Flexible Scalability – Transaction layer and storage layer can be independently scaled 6 • Slower than Distributed SQLs – More abstraction layers and storage-oblivious transaction manager • Hard to optimize – Transaction manager has not much information about storage • No SQL support – A transaction has to be written procedurally with a programming language – (Now working on SQL I/F)

Slide 5

Slide 5 text

© 2020 Scalar, inc. System Architecture 6 gRPC (HTTP/2) Scalar DB transaction library (transaction logic) Command execution / HTTP Databases Scalar DB Client Command execution / HTTP Databases Scalar DB Server (transaction logic) Application program • Scalar DB can be used in two ways: Application program Database- specific protocol Database- specific protocol

Slide 6

Slide 6 text

© 2020 Scalar, inc. Programming Interface • CRUD interface – put, get, scan (partition-level), delete • Begin and commit semantics – Arbitrary number of operations can be handled 8 DistributedTranasctionManager manager = …; DistributedTransaction transaction = manager.start(); Get get = createGet(); Optional result = transaction.get(get); Pub put = createPut(result); transaction.put(put); transaction.commit();

Slide 7

Slide 7 text

© 2020 Scalar, inc. Data Model • Multi-dimensional map [OSDI’06] – (partition-key, clustering-key, value-name) -> value-content – Assumed to be hash partitioned 9

Slide 8

Slide 8 text

© 2020 Scalar, inc. Transaction Management - Overview • Based on Cherry Garcia [ICDE’15] – Two phase commit with linearizable operations (for Atomicity) – Protocol correction is our extended work – Distributed WAL records (for Atomicity and Durability) – Single version optimistic concurrency control (for Isolation) – Serializability support is our extended work • Requirements in underlining databases/storages – Linearizable read and linearizable conditional/CAS write – An ability to store metadata for each record 10

Slide 9

Slide 9 text

© 2020 Scalar, inc. Transaction Commit Protocol (for Atomicity) • Two phase commit protocol (2PC) with linearizable operations – Similar to Paxos Commit [TODS’06] – Two phase commit on distributed records • The protocol – Prepare phase: prepare records – Commit phase 1: commit status record – This is where a transaction is regarded as committed or aborted – Commit phase 2: commit records • Lazy recovery – Uncommitted records will be rollforwarded or rollbacked based on the status of a transaction when the records are read 11

Slide 10

Slide 10 text

© 2020 Scalar, inc. Distributed WAL (for Atomicity and Durability) • WAL (Write-Ahead Logging) is distributed into records 12 Application data Transaction metadata After image Before image Application data (Before) Transaction metadata (Before) Status Version TxID Status (before) Version (before) TxID (before) TxID Status Other metadata Status Record in coordinator table User/Application Record in user tables Application data (managed by users) Transaction metadata (managed by Scalar DB)

Slide 11

Slide 11 text

© 2020 Scalar, inc. Concurrency Control (for Isolation) • Single version OCC – Simple implementation of Snapshot Isolation – Conflicts are detected by linearizable conditional write – No clock dependency, no use of HLC (Hybrid Logical Clock) • Supported isolation level – Read-committed Snapshot Isolation (RCSI) – Read-skew, write-skew, read-only, phantom anomalies could happen – Serializable – No anomalies (Strict Serializability) – RCSI-based but non-serializable schedules are aborted 13

Slide 12

Slide 12 text

© 2020 Scalar, inc. Transaction with Example – Before Prepare 14 Tx1 Tx1’s memory space Database UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY

Slide 13

Slide 13 text

© 2020 Scalar, inc. Transaction with Example – Before Prepare 14 Tx1 Tx1’s memory space Database Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY

Slide 14

Slide 14 text

© 2020 Scalar, inc. Transaction with Example – Before Prepare 14 Tx1 Tx1’s memory space Database Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY

Slide 15

Slide 15 text

© 2020 Scalar, inc. Transaction with Example – Before Prepare 14 Tx1 Tx1’s memory space Database Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY

Slide 16

Slide 16 text

© 2020 Scalar, inc. Transaction with Example – Prepare Phase 14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY

Slide 17

Slide 17 text

© 2020 Scalar, inc. Transaction with Example – Prepare Phase 14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 80 C 5 TxID XXX 2 120 C 4 YYY P 6 Tx1 P 5 Tx1

Slide 18

Slide 18 text

© 2020 Scalar, inc. Transaction with Example – Prepare Phase 14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Tx2 UserID Balance Status Version 1 100 C 5 Tx2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 80 C 5 TxID XXX 2 120 C 4 YYY P 6 Tx1 P 5 Tx1

Slide 19

Slide 19 text

© 2020 Scalar, inc. Transaction with Example – Prepare Phase 14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read Fail due to the condition mismatch UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Tx2 UserID Balance Status Version 1 100 C 5 Tx2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 80 C 5 TxID XXX 2 120 C 4 YYY P 6 Tx1 P 5 Tx1

Slide 20

Slide 20 text

© 2020 Scalar, inc. Transaction with Example – Commit Phase 1 15 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ Tx1 Database

Slide 21

Slide 21 text

© 2020 Scalar, inc. Transaction with Example – Commit Phase 1 15 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Linearizable conditional write Update if the Tx1 does not exist Tx1 Database

Slide 22

Slide 22 text

© 2020 Scalar, inc. Transaction with Example – Commit Phase 2 16 Database UserID Balance Status Version 1 80 C 6 TxID Tx1 2 120 C 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Linearizable conditional write Update status if the record is prepared by the Tx1 Tx1

Slide 23

Slide 23 text

© 2020 Scalar, inc. Recovery 17 Prepare Phase Commit Phase1 Commit Phase2 TX1 • Recovery is lazily done when a record is read Nothing is needed (local memory space needs to be cleared) Recovery process Rollbacked by another TX lazily using before image Roll-forwarded by another TX lazily updating status to C No need for recovery Crash

Slide 24

Slide 24 text

© 2020 Scalar, inc. Performance Optimization – Parallel Commit and Deferred Commit 18 W(X) W(Y) P(X) P(Y) C C(X) C(Y) Prepare Phase Commit Phase1 Commit Phase2 Parallel Commit Deferred Commit W(X) W(Y) P(X) P(Y) C C(X) C(Y) W(X) W(Y) P(X) P(Y) C C(X) C(Y) • Parallel Commit – Parallelize prepare-records and commit-records • Deferred Commit – Return to a caller without committing records Executed after the TX returns

Slide 25

Slide 25 text

© 2020 Scalar, inc. Serializable Strategy • RCSI causes some anomalies – Read-skew, write-skew, read-only, and phantom anomalies • Basic strategy to make RCSI serializable – Avoid anti/rw-dependency dangerous structure [TODS’05] – No use of SSI [SIGMOD’08] or its variant [EuroSys’12] – Many linearizable operations for managing in/outConflicts or correct clock are required – Two implementations: Extra-write and Extra-read 18

Slide 26

Slide 26 text

© 2020 Scalar, inc. Serializable Strategy – Extra-write and Extra-read 18 R(X) W(Y) P(Y) C C(Y) R(X) W(Y) P(Y) C C(Y) P(X) C(X) • Extra-write – Convert read into write. Extra care is done if a record doesn’t exist. • Extra-read – Check read-set after prepared to see if it is not updated by other transactions Write the same record R(X) W(Y) P(Y) C C(Y) R(X) W(Y) P(Y) C C(Y) V(X) Re-read (validate) the record and abort if it is changed Extra-write Extra-read

Slide 27

Slide 27 text

© 2020 Scalar, inc. Transactions on Heterogeneous Databases • Scalar DB achieves ACID transaction spanning multiple different databases • Two types of interfaces: – One-phase and two-phase 18 MySQL Cassandra Scalar DB Application MySQL Cassandra Scalar DB Microservice1 Scalar DB Microservice2 TxID One-phase Two-phase

Slide 28

Slide 28 text

© 2020 Scalar, inc. Benchmark Results with Scalar DB on Cassandra 19 Workload2 (Evidence) Workload1 (Payment) Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3 • Achieved 90 % scalability in 100-node cluster (Compared to the Ideal TPS based on the performance of 3-node cluster)

Slide 29

Slide 29 text

© 2020 Scalar, inc. Verification Results for Scalar DB • Scalar DB has been heavily tested with Jepsen and Elle [VLDB’21] – Jepsen tests are created and conducted by Scalar – See https://github.com/scalar-labs/scalar-jepsen for more detail • Transaction commit protocol is verified with TLA+ – See https://github.com/scalar-labs/scalardb/tree/master/tla%2B/consensus-commit 20 Jepsen Passed TLA+ Passed

Slide 30

Slide 30 text

© 2020 Scalar, inc. Summary • Scalar DB is a universal transaction manager – Provide database-agnostic transactions on various databases – Cassandra, HBase, Amazon DynamoDB, Azure Cosmos DB, MySQL, PostgreSQL, Oracle Database, SQL Server, Amazon RDS, Amazon Aurora, ScyllaDB – Achieve transactions spanning heterogeneous databases – Enhanced to guarantee strict Serializability – Transaction consistency and scalability are verified extensively • Future work – GraphQL I/F, SQL I/F, More adaptors (mongodb, Kafka…) 18