Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalar DB: Universal Transaction Manager

Scalar DB: Universal Transaction Manager

Scalar DB is a universal transaction manager that provides database-agnostic transactions on various heterogeneous databases.

- https://github.com/scalar-labs/scalardb

Scalar, Inc.

January 21, 2022
Tweet

More Decks by Scalar, Inc.

Other Decks in Technology

Transcript

  1. © 2020 Scalar, inc. Scalar DB: Universal Transaction Manager 20

    Jan, 2022 at Big Data System class in Keio University Hiroyuki Yamada CTO&CEO at Scalar, Inc. 1
  2. © 2020 Scalar, inc. What is Scalar DB • A

    universal transaction manager – Provide a database-agnostic ACID transaction – The architecture is inspired by Deuteronomy [CIDR’09,11] 4 https://github.com/scalar-labs/scalardb
  3. © 2020 Scalar, inc. Motivation / Use Cases • Database

    abstraction • Transaction manager for non-transactional databases (NoSQLs) • Transaction manager for heterogeneous databases 6 MySQL Amazon DynamoDB Scalar DB Scalar DB App App Enables database migration without modifying the App Apache Cassandra Scalar DB App Adds transaction capability to non-transactional databases PostgreSQL Azure Cosmos DB Scalar DB App Achieves transaction over multiple different databases Database abstraction Transaction manager for NoSQLs Transaction manager for heterogeneous databases
  4. © 2020 Scalar, inc. Pros and Cons of Scalar DB

    Approach • Universal – Can work on various database systems • Non-invasive – Any modifications to the underlying databases are not required • Flexible Scalability – Transaction layer and storage layer can be independently scaled 6 • Slower than Distributed SQLs – More abstraction layers and storage-oblivious transaction manager • Hard to optimize – Transaction manager has not much information about storage • No SQL support – A transaction has to be written procedurally with a programming language – (Now working on SQL I/F)
  5. © 2020 Scalar, inc. System Architecture 6 gRPC (HTTP/2) Scalar

    DB transaction library (transaction logic) Command execution / HTTP Databases Scalar DB Client Command execution / HTTP Databases Scalar DB Server (transaction logic) Application program • Scalar DB can be used in two ways: Application program Database- specific protocol Database- specific protocol
  6. © 2020 Scalar, inc. Programming Interface • CRUD interface –

    put, get, scan (partition-level), delete • Begin and commit semantics – Arbitrary number of operations can be handled 8 DistributedTranasctionManager manager = …; DistributedTransaction transaction = manager.start(); Get get = createGet(); Optional<Result> result = transaction.get(get); Pub put = createPut(result); transaction.put(put); transaction.commit();
  7. © 2020 Scalar, inc. Data Model • Multi-dimensional map [OSDI’06]

    – (partition-key, clustering-key, value-name) -> value-content – Assumed to be hash partitioned 9
  8. © 2020 Scalar, inc. Transaction Management - Overview • Based

    on Cherry Garcia [ICDE’15] – Two phase commit with linearizable operations (for Atomicity) – Protocol correction is our extended work – Distributed WAL records (for Atomicity and Durability) – Single version optimistic concurrency control (for Isolation) – Serializability support is our extended work • Requirements in underlining databases/storages – Linearizable read and linearizable conditional/CAS write – An ability to store metadata for each record 10
  9. © 2020 Scalar, inc. Transaction Commit Protocol (for Atomicity) •

    Two phase commit protocol (2PC) with linearizable operations – Similar to Paxos Commit [TODS’06] – Two phase commit on distributed records • The protocol – Prepare phase: prepare records – Commit phase 1: commit status record – This is where a transaction is regarded as committed or aborted – Commit phase 2: commit records • Lazy recovery – Uncommitted records will be rollforwarded or rollbacked based on the status of a transaction when the records are read 11
  10. © 2020 Scalar, inc. Distributed WAL (for Atomicity and Durability)

    • WAL (Write-Ahead Logging) is distributed into records 12 Application data Transaction metadata After image Before image Application data (Before) Transaction metadata (Before) Status Version TxID Status (before) Version (before) TxID (before) TxID Status Other metadata Status Record in coordinator table User/Application Record in user tables Application data (managed by users) Transaction metadata (managed by Scalar DB)
  11. © 2020 Scalar, inc. Concurrency Control (for Isolation) • Single

    version OCC – Simple implementation of Snapshot Isolation – Conflicts are detected by linearizable conditional write – No clock dependency, no use of HLC (Hybrid Logical Clock) • Supported isolation level – Read-committed Snapshot Isolation (RCSI) – Read-skew, write-skew, read-only, phantom anomalies could happen – Serializable – No anomalies (Strict Serializability) – RCSI-based but non-serializable schedules are aborted 13
  12. © 2020 Scalar, inc. Transaction with Example – Before Prepare

    14 Tx1 Tx1’s memory space Database UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  13. © 2020 Scalar, inc. Transaction with Example – Before Prepare

    14 Tx1 Tx1’s memory space Database Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  14. © 2020 Scalar, inc. Transaction with Example – Before Prepare

    14 Tx1 Tx1’s memory space Database Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  15. © 2020 Scalar, inc. Transaction with Example – Before Prepare

    14 Tx1 Tx1’s memory space Database Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  16. © 2020 Scalar, inc. Transaction with Example – Prepare Phase

    14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  17. © 2020 Scalar, inc. Transaction with Example – Prepare Phase

    14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 80 C 5 TxID XXX 2 120 C 4 YYY P 6 Tx1 P 5 Tx1
  18. © 2020 Scalar, inc. Transaction with Example – Prepare Phase

    14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Tx2 UserID Balance Status Version 1 100 C 5 Tx2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 80 C 5 TxID XXX 2 120 C 4 YYY P 6 Tx1 P 5 Tx1
  19. © 2020 Scalar, inc. Transaction with Example – Prepare Phase

    14 Tx1 Tx1’s memory space Database Read Linearizable conditional write Update only if the versions and the TxIDs are the same as the ones it read Fail due to the condition mismatch UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Tx2 UserID Balance Status Version 1 100 C 5 Tx2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 80 C 5 TxID XXX 2 120 C 4 YYY P 6 Tx1 P 5 Tx1
  20. © 2020 Scalar, inc. Transaction with Example – Commit Phase

    1 15 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ Tx1 Database
  21. © 2020 Scalar, inc. Transaction with Example – Commit Phase

    1 15 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Linearizable conditional write Update if the Tx1 does not exist Tx1 Database
  22. © 2020 Scalar, inc. Transaction with Example – Commit Phase

    2 16 Database UserID Balance Status Version 1 80 C 6 TxID Tx1 2 120 C 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Linearizable conditional write Update status if the record is prepared by the Tx1 Tx1
  23. © 2020 Scalar, inc. Recovery 17 Prepare Phase Commit Phase1

    Commit Phase2 TX1 • Recovery is lazily done when a record is read Nothing is needed (local memory space needs to be cleared) Recovery process Rollbacked by another TX lazily using before image Roll-forwarded by another TX lazily updating status to C No need for recovery Crash
  24. © 2020 Scalar, inc. Performance Optimization – Parallel Commit and

    Deferred Commit 18 W(X) W(Y) P(X) P(Y) C C(X) C(Y) Prepare Phase Commit Phase1 Commit Phase2 Parallel Commit Deferred Commit W(X) W(Y) P(X) P(Y) C C(X) C(Y) W(X) W(Y) P(X) P(Y) C C(X) C(Y) • Parallel Commit – Parallelize prepare-records and commit-records • Deferred Commit – Return to a caller without committing records Executed after the TX returns
  25. © 2020 Scalar, inc. Serializable Strategy • RCSI causes some

    anomalies – Read-skew, write-skew, read-only, and phantom anomalies • Basic strategy to make RCSI serializable – Avoid anti/rw-dependency dangerous structure [TODS’05] – No use of SSI [SIGMOD’08] or its variant [EuroSys’12] – Many linearizable operations for managing in/outConflicts or correct clock are required – Two implementations: Extra-write and Extra-read 18
  26. © 2020 Scalar, inc. Serializable Strategy – Extra-write and Extra-read

    18 R(X) W(Y) P(Y) C C(Y) R(X) W(Y) P(Y) C C(Y) P(X) C(X) • Extra-write – Convert read into write. Extra care is done if a record doesn’t exist. • Extra-read – Check read-set after prepared to see if it is not updated by other transactions Write the same record R(X) W(Y) P(Y) C C(Y) R(X) W(Y) P(Y) C C(Y) V(X) Re-read (validate) the record and abort if it is changed Extra-write Extra-read
  27. © 2020 Scalar, inc. Transactions on Heterogeneous Databases • Scalar

    DB achieves ACID transaction spanning multiple different databases • Two types of interfaces: – One-phase and two-phase 18 MySQL Cassandra Scalar DB Application MySQL Cassandra Scalar DB Microservice1 Scalar DB Microservice2 TxID One-phase Two-phase
  28. © 2020 Scalar, inc. Benchmark Results with Scalar DB on

    Cassandra 19 Workload2 (Evidence) Workload1 (Payment) Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3 • Achieved 90 % scalability in 100-node cluster (Compared to the Ideal TPS based on the performance of 3-node cluster)
  29. © 2020 Scalar, inc. Verification Results for Scalar DB •

    Scalar DB has been heavily tested with Jepsen and Elle [VLDB’21] – Jepsen tests are created and conducted by Scalar – See https://github.com/scalar-labs/scalar-jepsen for more detail • Transaction commit protocol is verified with TLA+ – See https://github.com/scalar-labs/scalardb/tree/master/tla%2B/consensus-commit 20 Jepsen Passed TLA+ Passed
  30. © 2020 Scalar, inc. Summary • Scalar DB is a

    universal transaction manager – Provide database-agnostic transactions on various databases – Cassandra, HBase, Amazon DynamoDB, Azure Cosmos DB, MySQL, PostgreSQL, Oracle Database, SQL Server, Amazon RDS, Amazon Aurora, ScyllaDB – Achieve transactions spanning heterogeneous databases – Enhanced to guarantee strict Serializability – Transaction consistency and scalability are verified extensively • Future work – GraphQL I/F, SQL I/F, More adaptors (mongodb, Kafka…) 18