Slide 1

Slide 1 text

Scalar DL: Scalable and Practical Byzantine Fault Detection for Transactional Database Systems Hiroyuki Yamada, Jun Nemoto Scalar, Inc.

Slide 2

Slide 2 text

Towards a reliable database system ● We live in a data-driven / data-centric world. ○ Data needs to be reliable and trustful. ○ Database systems need to be reliable and trustful. ● Dealing with Byzantine faults in a database system is one of the key factors. ○ Byzantine faults: software errors, data tampering, (internal) malicious attacks. Our Goal: A database system that deals with Byzantine faults in a practical and scalable way.

Slide 3

Slide 3 text

Dealing with Byzantine faults ● Basic principle: find discrepancies between replicas. ● Byzantine fault tolerance (BFT). ○ N > 3f, N: # of replicas, f: # of faulty replicas. ○ SMR: PBFT [OSDI’99], BFT-SMaRt [DSN’14], HotStuff [PODC’19] … ○ Database: HRDB [SOSP’07], Byzantium [EuroSys’11], Hyperledger fabric [EuroSys’18], Basil [SOSP’21] ● Byzantine fault detection (BFD). ○ N > f, N: # of replicas, f: # of faulty replicas. ○ SMR: PeerReview [SOSP’07] Are existing solutions practical and scalable enough for a database system?

Slide 4

Slide 4 text

BFT is ideal, but may not be practical for database systems ● At least 4 administrative domains (ADs) are required for correctness. ○ Malicious attacks are likely to be dependent in an AD. ● BFT might not fit well with enterprise database systems. ○ Many enterprise database systems are managed by a single AD or a few ADs. An AD is a collection of nodes and networks operated by a single organization or administrative authority.

Slide 5

Slide 5 text

BFT is ideal, but may not be practical for database systems ● At least 4 administrative domains (ADs) are required for correctness. ○ Malicious attacks are likely to be dependent in an AD. ● BFT might not fit well with enterprise database systems. ○ Many enterprise database systems are managed by a single AD or a few ADs. An AD is a collection of nodes and networks operated by a single organization or administrative authority.

Slide 6

Slide 6 text

BFT is ideal, but may not be practical for database systems ● At least 4 administrative domains (ADs) are required for correctness. ○ Malicious attacks are likely to be dependent in an AD. ● BFT might not fit well with enterprise database systems. ○ Many enterprise database systems are managed by a single AD or a few ADs. AD-1 AD-2 AD-3 AD-4 An AD is a collection of nodes and networks operated by a single organization or administrative authority.

Slide 7

Slide 7 text

BFT is ideal, but may not be practical for database systems ● At least 4 administrative domains (ADs) are required for correctness. ○ Malicious attacks are likely to be dependent in an AD. ● BFT might not fit well with enterprise database systems. ○ Many enterprise database systems are managed by a single AD or a few ADs. AD-1 AD-2 AD-3 AD-4 4 ADs is at least required to mask 1 fault. An AD is a collection of nodes and networks operated by a single organization or administrative authority.

Slide 8

Slide 8 text

BFD is a promising approach for database systems ● Require only 2 ADs for correctness. ○ 2 is the lower bound for the number of replicas in dealing with Byzantine faults. ● Many use cases that require only BFD or tamper evidence. ○ Regulations on data protection and privacy (e.g., GDPR and CCPA), prior user right for IP, and vehicle regulations around software updates with OTA in WP.29. ● Existing solutions are not designed for transactional database systems. ○ Cannot run transactions in parallel (i.e., not scalable) 1 faulty AD can be detected as long as there are 2 ADs. AD-1 AD-2

Slide 9

Slide 9 text

Challenge: Scalable BFD for a database system deployed to a 2-AD environment BFT BFD SMR (run transactions sequentially) DB (run transactions concurrently) BFT SMR PBFT, BFT-SMaRt, HotStuff, Tendermint BFD SMR PeerReview BFT DB HRDB, Byzantium, Basil, Hyperledger Fabric BFD DB No existing work

Slide 10

Slide 10 text

Challenge: Scalable BFD for a database system deployed to a 2-AD environment BFT BFD SMR (run transactions sequentially) DB (run transactions concurrently) BFT SMR PBFT, BFT-SMaRt, HotStuff, Tendermint BFD SMR PeerReview BFT DB HRDB, Byzantium, Basil, Hyperledger Fabric BFD DB No existing work Not practical from an administrative perspective

Slide 11

Slide 11 text

Challenge: Scalable BFD for a database system deployed to a 2-AD environment BFT BFD SMR (run transactions sequentially) DB (run transactions concurrently) BFT SMR PBFT, BFT-SMaRt, HotStuff, Tendermint BFD SMR PeerReview BFT DB HRDB, Byzantium, Basil, Hyperledger Fabric BFD DB No existing work Not practical from an administrative perspective Not designed for database transactions

Slide 12

Slide 12 text

Challenge: Scalable BFD for a database system deployed to a 2-AD environment BFT BFD SMR (run transactions sequentially) DB (run transactions concurrently) BFT SMR PBFT, BFT-SMaRt, HotStuff, Tendermint BFD SMR PeerReview BFT DB HRDB, Byzantium, Basil, Hyperledger Fabric BFD DB No existing work Not practical from an administrative perspective Not designed for database transactions

Slide 13

Slide 13 text

Challenge: Scalable BFD for a database system deployed to a 2-AD environment BFT BFD SMR (run transactions sequentially) DB (run transactions concurrently) BFT SMR PBFT, BFT-SMaRt, HotStuff, Tendermint BFD SMR PeerReview BFT DB HRDB, Byzantium, Basil, Hyperledger Fabric BFD DB No existing work Not practical from an administrative perspective Not designed for database transactions

Slide 14

Slide 14 text

BFT DB => BFD DB ● Can we realize BFD by splitting up replicas into 2 ADs? ○ No. ● 1 Byzantine-faulty replica will exceed the predefined threshold for correctness because Byzantine faults are dependent in an AD. ○ Need to accept the fault, i.e., data will be tampered.

Slide 15

Slide 15 text

BFT DB => BFD DB ● Can we realize BFD by splitting up replicas into 2 ADs? ○ No. ● 1 Byzantine-faulty replica will exceed the predefined threshold for correctness because Byzantine faults are dependent in an AD. ○ Need to accept the fault, i.e., data will be tampered.

Slide 16

Slide 16 text

BFT DB => BFD DB ● Can we realize BFD by splitting up replicas into 2 ADs? ○ No. ● 1 Byzantine-faulty replica will exceed the predefined threshold for correctness because Byzantine faults are dependent in an AD. ○ Need to accept the fault, i.e., data will be tampered. AD-1 AD-2

Slide 17

Slide 17 text

BFT DB => BFD DB ● Can we realize BFD by splitting up replicas into 2 ADs? ○ No. ● 1 Byzantine-faulty replica will exceed the predefined threshold for correctness because Byzantine faults are dependent in an AD. ○ Need to accept the fault, i.e., data will be tampered. AD-1 AD-2

Slide 18

Slide 18 text

BFT DB => BFD DB ● Can we realize BFD by splitting up replicas into 2 ADs? ○ No. ● 1 Byzantine-faulty replica will exceed the predefined threshold for correctness because Byzantine faults are dependent in an AD. ○ Need to accept the fault, i.e., data will be tampered. AD-1 AD-2

Slide 19

Slide 19 text

BFT DB => BFD DB ● Can we realize BFD by splitting up replicas into 2 ADs? ○ No. ● 1 Byzantine-faulty replica will exceed the predefined threshold for correctness because Byzantine faults are dependent in an AD. ○ Need to accept the fault, i.e., data will be tampered. BFT DB cannot trivially be extended to realize BFD DB AD-1 AD-2 N=4, f=2 => N>3f

Slide 20

Slide 20 text

BFD SMR => BFD DB ● Can we make BFD SMR (PeerReview) run transactions concurrently? ○ Yes, but only partially. ○ We could apply a concurrency control in a primary-side processing. ● Require sequential execution of hash-chained log in a witness-side for correctness (i.e., strict serializability), which limits the overall scalability. ○ Running transactions in parallel could cause time-travel anomalies. AD-1 AD-2 T1 T2 T2 T1 hash-chained log Primary Witness (Auditor) Witness-side execution has to be sequential for correctness.

Slide 21

Slide 21 text

Challenge: Scalable BFD for a database system deployed to a 2-AD environment BFT BFD SMR (run transactions sequentially) DB (run transactions concurrently) BFT SMR PBFT, BFT-SMaRt, HotStuff, Tendermint BFD SMR PeerReview BFT DB HRDB, Byzantium, Basil, Hyperledger Fabric BFD DB NONE Not possible (as it is) Possible but not scalable

Slide 22

Slide 22 text

Scalar DL: A scalable and practical BFD approach ● Scalable and practical BFD middleware for transactional database systems. ○ Manage two types of servers and databases in separate ADs internally. ○ Database-agnostic by depending only on common database operations. ● Execute non-conflicting transactions in parallel while guaranteeing correctness. Primary Secondary Scalar DL Primary Servers Primary Database AD1 Scalar DL Clients Applications Scalar DL Secondary Servers Secondary Database AD2 Database System • Provide safety (strict serializability) and liveness if no fault. • Provide safety (correct clients can detect a Byzantine fault) if one AD is faulty. Correctness:

Slide 23

Slide 23 text

The BFD protocol - Overview ● Key idea: Make an agreement on the partial ordering of transactions in a decentralized and concurrent way ○ Either primary or secondary cannot selfishly order/commit transactions. ● 3-phase protocol: Ordering -> Commit -> Validation. ○ The protocol assumes one-shot request model. Client Secondary Primary Ordering Commit Validation

Slide 24

Slide 24 text

The BFD protocol - Ordering phase ● Order transactions in a strict serializable manner with a variant of 2PL. ○ Simulate a transaction and identify the read/write sets of the transaction. ○ Acquire R/W locks using underlying database’s linearizable operations. ○ Go to the commit phase once all the required locks are acquired. ● Why not using multi-version concurrency control (MVCC)? ○ A primary and a secondary could derive different serialization orders without sharing explicit order dependencies (e.g., conflict graph). Primary key Version Lock count Lock mode Lock holders (TxIDs) Input dependencies Lock entry: A set of . Client Secondary Primary Ordering Commit Validation Indicate the partial order of transactions

Slide 25

Slide 25 text

The BFD protocol - Commit phase ● Execute transactions in an ACID way in an arbitrary order. ○ Also write a transaction status with a transaction ID as a key for recovery. ○ This is where a transaction is regarded as committed or aborted. ● Create proofs that indicate what records are read and written. ● The input dependencies indicate the partial order of transactions Primary key Version TxID Input dependencies MAC Proof entry: Client Secondary Primary Ordering Commit Validation Indicate the partial order of transactions

Slide 26

Slide 26 text

The BFD protocol - Validation phase ● Validate if the commit order is the same as the one the secondary expects. ○ Compare the lock entries and proofs. ● Execute transactions in the secondary once validated and create proofs. ● A client compares the results and proofs from the primary and the secondary to find discrepancies (i.e., Byzantine faults). Primary Secondary Result Proofs Result Proofs 2. Commit phase 3. Validation phase Compare =? Compare lock table =? Pre-validation Client Client Secondary Primary Ordering Commit Validation

Slide 27

Slide 27 text

Evaluation - Benchmarked systems and workloads ● Benchmarked Systems: ○ PeerReviewTx: an extended version of PeerReview, which runs TXs in parallel in a primary side. ○ Scalar DL: use Scalar DB to execute transactions on non-transactional databases. ○ Both PeerReviewTx and Scalar DL servers are placed in database instances. ○ PostgreSQL and Cassandra as backend database systems. ● Workloads ○ YCSB: F and C. 100M records with 100 bytes payload and uniform distribution. ○ TPC-C: 50/50 ratio of NewOrder and Payment. 100 - 1000 warehouses.

Slide 28

Slide 28 text

Evaluation - Experimental setup ● Environment ○ AWS. c5d.4xlarge for each database instance (8 cores, 32GB DRAM, NVMe SSD). c5.9xlarge for a client. ○ 2 ADs in different VPCs. PostgreSQL Scalar DL C* DL … PostgreSQL Scalar DL C* DL C* DL C* DL … C* DL C* DL Clients Clients AD AD AD AD

Slide 29

Slide 29 text

Throughput on PostgreSQL YCSB-F TPC-C (NP) Scalar DL scaled as the number of client threads increased, whereas PeerReviewTx didn’t scale as much. The benefit of Scalar DL comes from its concurrency control.

Slide 30

Slide 30 text

Throughput on Cassandra (3 nodes per AD, RF=3) YCSB-F TPC-C (NP) The results were similar results to the one on PostgreSQL. The database-agnostic property was also verified.

Slide 31

Slide 31 text

Scalability (with TPC-C) Scalar DL scaled near-linearly as the number of nodes increased in each AD

Slide 32

Slide 32 text

Summary ● Scalar DL is scalable and practical BFD middleware for transactional database systems. ● Key contribution: Byzantine fault detection protocol that executes non- conflicting transactions in parallel while guaranteeing correctness. ● Achieve up to 10 times speedup compared to the state-of-the-art BFD approach and near-linear (91%) node scalability. ● Scalar DL is a real product, not a research prototype. ○ See https://github.com/scalar-labs/scalardl