Scylla: NoSQL at Ludicrous Speed

Slide 1

Slide 1 text

Scylla: NoSQL at Ludicrous Speed Duarte Nunes @duarte_nunes

Slide 2

Slide 2 text

Agenda 2 o Introducing Scylla o System Architecture o Seastar o Resource Management o Autonomous Database o Closing

Slide 3

Slide 3 text

3 + Clustered NoSQL database compatible with Apache Cassandra + ~10X performance on same hardware + Low latency, especially higher percentiles + Autonomous + Mechanically sympathetic C++17 Scylla 10000 foot view

Slide 4

Slide 4 text

4 YCSB Benchmark Throughput 3 Scylla 30 Cassandra 3 Cassandra 9 Cassandra 15 Cassandra

Slide 5

Slide 5 text

5 YCSB Benchmark Latency 3 Scylla 30 Cassandra 3 Cassandra

Slide 6

Slide 6 text

6 + CQL language and protocol + Legacy Thrift protocol + SStable file format + Configuration file format + JMX management protocol + Management command line tool Cassandra Compatibility + Sharing the ecosystem + Spark + Presto + JanusGraph + KairosDB + Kong

Slide 7

Slide 7 text

7 Monitoring 3 Scylla 30 Cassandra 3 Cassandra

Slide 8

Slide 8 text

8 System Architecture

Slide 9

Slide 9 text

9 Data Model Partition Key Clustering Key Values Vladimir Nabokov 1962, Pale Fire pages: 315, genre: ‘fiction’ Vladimir Nabokov 1972, Transparent Things pages: 144, genre: ‘fiction’ Haruki Murakami 1997, Wind-Up Bird Chronicle pages: 607, genre: ‘fiction’, translator: ‘Jay Rubin’ David Foster Wallace 1996, Infinite Jest pages: 1,079, genre: ‘fiction’ Fyodor Dostoevsky 1869, The Eternal Husband pages: 130, genre: ‘fiction’, translator: ‘Constance Garnett’ Fyodor Dostoevsky 1869, The Idiot pages: 720, genre: ‘fiction’, translator: ‘Pevear and Volokhonsky’ Fyodor Dostoevsky 1880, The Brothers Karamazov pages: 825, genre: ‘fiction’, translator: ‘Pevear and Volokhonsky’

Slide 10

Slide 10 text

10 Partitioning Fyodor Dostoevsky -2729816749760468891 David Foster Wallace 519370469886941605

Slide 11

Slide 11 text

11 Partitioning Fyodor Dostoevsky -2729816749760468891 David Foster Wallace 519370469886941605

Slide 12

Slide 12 text

12 Partitioning Fyodor Dostoevsky -2729816749760468891 David Foster Wallace 519370469886941605

Slide 13

Slide 13 text

13 Replication CLIENT REQUEST COORDINATOR REPLICA 1 REPLICA 2 REPLICA 3 cql rpc rpc rpc

Slide 14

Slide 14 text

14 Consistency Last write wins

Slide 15

Slide 15 text

15 Consistency Consistency à la carte ✓ ✓

Slide 16

Slide 16 text

16 Consistency Consistency à la carte ✓ ✓ ✓

Slide 17

Slide 17 text

17 Consistency Anti-entropy ✓ ✓ ✓ + Reconciliation + Request path + Read repair + Request path, probabilistic + Repair + Background process

Slide 18

Slide 18 text

18 + High availability + Masterless system + Latency + Guaranteed to get the lowest the system can provide at any time Eventual consistency

Slide 19

Slide 19 text

19 Storage Log-Structured Merge Tree SStable 1 Time

Slide 20

Slide 20 text

20 Storage Log-Structured Merge Tree SStable 1 SStable 2 Time

Slide 21

Slide 21 text

21 Storage Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time

Slide 22

Slide 22 text

22 Storage Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4

Slide 23

Slide 23 text

23 Storage Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4 SStable 1+2+3

Slide 24

Slide 24 text

24 Storage Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4 SStable 5 SStable 1+2+3

Slide 25

Slide 25 text

25 Storage Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4 SStable 5 SStable 1+2+3 Foreground Job Background Job

Slide 26

Slide 26 text

26 Request Path Reads Memtable SStables

Slide 27

Slide 27 text

27 Request Path Reads Memtable Reads SStables

Slide 28

Slide 28 text

28 Request Path Reads Memtable Reads Commit Log Writes SStables

Slide 29

Slide 29 text

29 + Efficiency + Make the most out of every cycle + Utilization + Squeeze every cycle from the machine + Control + Spend the cycles on what we want, when we want Implementation Goals

Slide 30

Slide 30 text

Seastar http://seastar.io 30

Slide 31

Slide 31 text

31 + Thread-per-core design + No blocking. Ever. + A “shard” in seastar lingo. Seastar C++ framework for high-performance servers

Slide 32

Slide 32 text

32 + Thread-per-core design + No blocking. Ever. + A “shard” in seastar lingo. + Asynchronous + Networking Seastar C++ framework for high-performance servers

Slide 33

Slide 33 text

33 + Thread-per-core design + No blocking. Ever. + A “shard” in seastar lingo. + Asynchronous + Networking + File I/O Seastar C++ framework for high-performance servers

Slide 34

Slide 34 text

34 + Thread-per-core design + No blocking. Ever. + A “shard” in seastar lingo. + Asynchronous + Networking + File I/O + Multicore Seastar C++ framework for high-performance servers

Slide 35

Slide 35 text

35 + Thread-per-core design + No blocking. Ever. + A “shard” in seastar lingo. + Asynchronous + Networking + File I/O + Multicore + Future/promise based APIs Seastar C++ framework for high-performance servers

Slide 36

Slide 36 text

36 + Thread-per-core design + No blocking. Ever. + A “shard” in seastar lingo. + Asynchronous + Networking + File I/O + Multicore + Future/promise based APIs + Optional user-mode TCP/IP stack included Seastar C++ framework for high-performance servers

Slide 37

Slide 37 text

37 Seastar task scheduler Traditional Stack Seastar Stack Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes Context switch cost is high. Large stacks pollutes the caches No sharing, millions of parallel events

Slide 38

Slide 38 text

38 future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures

Slide 39

Slide 39 text

39 future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures

Slide 40

Slide 40 text

40 future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures

Slide 41

Slide 41 text

41 future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures

Slide 42

Slide 42 text

42 future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures

Slide 43

Slide 43 text

43 future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures

Slide 44

Slide 44 text

44 Memory allocator + Non-Thread safe! + Each core gets a private memory pool

Slide 45

Slide 45 text

45 Memory allocator + Non-Thread safe! + Each core gets a private memory pool + Allocation back pressure + Allocator calls a callback when low on memory + Scylla evicts cache in response

Slide 46

Slide 46 text

46 Memory allocator + Non-Thread safe! + Each core gets a private memory pool + Allocation back pressure + Allocator calls a callback when low on memory + Scylla evicts cache in response + Inter-core free() through message passing

Slide 47

Slide 47 text

47 Resource Management

Slide 48

Slide 48 text

48 User-mode I/O Scheduler Query Commitlog Compaction Queue Queue Queue Userspace I/O Scheduler Disk

Slide 49

Slide 49 text

49 Measuring optimal concurrency

Slide 50

Slide 50 text

50 Measuring optimal concurrency Max useful disk concurrency

Slide 51

Slide 51 text

51 Linux page cache Linux page cache SSTables + 4k granularity

Slide 52

Slide 52 text

52 Linux page cache Linux page cache SSTables + Parasitic rows Page (4k) Your data (300b)

Slide 53

Slide 53 text

53 Linux page cache Linux page cache SSTables + 4k granularity + Thread-safe

Slide 54

Slide 54 text

54 Linux page cache Linux page cache SSTables + 4k granularity + Thread-safe + Synchronous APIs

Slide 55

Slide 55 text

55 Linux page cache Linux page cache SSTables + 4k granularity + Thread-safe + Synchronous APIs App thread Kernel SSD Page fault Suspend thread Initiate I/O Context switch I/O completes Interrupt Context switch Map page Resume thread

Slide 56

Slide 56 text

56 Linux page cache Linux page cache SSTables + 4k granularity + Thread-safe + Synchronous APIs + General-purpose

Slide 57

Slide 57 text

57 Linux page cache Linux page cache SSTables + 4k granularity + Thread-safe + Synchronous APIs + General-purpose + Lack of control

Slide 58

Slide 58 text

58 Linux page cache Linux page cache SSTables + 4k granularity + Thread-safe + Synchronous APIs + General-purpose + Lack of control + Lack of control

Slide 59

Slide 59 text

59 Linux page cache Linux page cache SSTables + 4k granularity + Thread-safe + Synchronous APIs + General-purpose + Lack of control + Lack of control + ...on the other hand + Exists + Hundreds of man-years + Handles lots of edge cases

Slide 60

Slide 60 text

60 Unified cache Linux page cache SSTables Key cache Row cache On-heap / Off-heap Complex Tuning Cassandra

Slide 61

Slide 61 text

61 Unified cache Linux page cache SSTables Key cache Row cache On-heap / Off-heap Complex Tuning Scylla Unified cache SSTables Cassandra

Slide 62

Slide 62 text

Heat-Weighted Load Balancing RF=3, CL=QUORUM CLIENT REQUEST COORDINATOR RESTART REPLICA 1 REPLICA 2 REPLICA 3 cql rpc rpc

Slide 63

Slide 63 text

Total Requests Served after Node Restart Before: After:

Slide 64

Slide 64 text

Cold cache warmup A replica with cold cache is sent less requests

Slide 65

Slide 65 text

65 + Memory gets fragmented over time + If the workload changes sizes of allocated objects + Allocating a large contiguous block requires evicting most of cache Yet another allocator Issues with malloc/free

Slide 66

Slide 66 text

Memory 66

Slide 67

Slide 67 text

Memory 67

Slide 68

Slide 68 text

Memory 68

Slide 69

Slide 69 text

Memory 69

Slide 70

Slide 70 text

Memory 70

Slide 71

Slide 71 text

Memory 71

Slide 72

Slide 72 text

Memory 72

Slide 73

Slide 73 text

Memory 73

Slide 74

Slide 74 text

Memory 74

Slide 75

Slide 75 text

Memory 75

Slide 76

Slide 76 text

Memory 76

Slide 77

Slide 77 text

Memory 77

Slide 78

Slide 78 text

Memory 78

Slide 79

Slide 79 text

Memory 79

Slide 80

Slide 80 text

Memory 80

Slide 81

Slide 81 text

Memory 81

Slide 82

Slide 82 text

Memory OOM :( 82

Slide 83

Slide 83 text

Memory OOM :( 83 OOM :(

Slide 84

Slide 84 text

Memory OOM :( 84 OOM :(

Slide 85

Slide 85 text

Memory OOM :( 85 OOM :(

Slide 86

Slide 86 text

Memory OOM :( 86 OOM :(

Slide 87

Slide 87 text

Memory OOM :( 87 OOM :(

Slide 88

Slide 88 text

Memory OOM :( 88 OOM :(

Slide 89

Slide 89 text

Memory 89

Slide 90

Slide 90 text

90 + Bump-pointer allocation to current segment + Frees leave holes in segments + Compaction will try to solve this Log-structured memory allocation

Slide 91

Slide 91 text

91 + Teach allocator how to move objects around + Updating back-pointers + Garbage collect Compact! + Starting with the most sparse segments + Lock to pin objects + Used mostly for the cache and memtables + Large majority of memory allocated + Small subset of allocation sites Log-structured memory allocation

Slide 92

Slide 92 text

Autonomous Database 92

Slide 93

Slide 93 text

93 Autonomous Control theory in practice Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD WAN CPU

Slide 94

Slide 94 text

94 Autonomous Control theory in practice Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Controller WAN CPU

Slide 95

Slide 95 text

95 Autonomous Control theory in practice Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Controller Adjust priority WAN CPU

Slide 96

Slide 96 text

96 Autonomous Control theory in practice Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Controller Memory Controller Adjust priority WAN CPU

Slide 97

Slide 97 text

97 Autonomous Control theory in practice Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Controller Memory Controller Adjust priority Adjust priority WAN CPU

Slide 98

Slide 98 text

98 Steady state Compaction backlog controller in practice

Slide 99

Slide 99 text

99 Increased ingestion rate Compaction backlog controller in practice

Slide 100

Slide 100 text

100 New steady state Compaction backlog controller in practice

Slide 101

Slide 101 text

101 Latencies Compaction backlog controller in practice

Slide 102

Slide 102 text

102 Closing

Slide 103

Slide 103 text

103 + Careful system design and control of the software stack can maximize throughput + Without sacrificing latency + Without requiring complex end-user tuning + While having a lot of fun Conclusions

Slide 104

Slide 104 text

104 + Download: http://www.scylladb.com + Twitter: @ScyllaDB + Source: http://github.com/scylladb/scylla + Mailing lists: [email protected] + Slack: ScyllaDB-Users + Blog: http://www.scylladb.com/blog + Join: http://www.scylladb.com/company/careers How to interact A potpourri of links

Slide 105

Slide 105 text

United States 1900 Embarcadero Road Palo Alto, CA 94303 Israel 11 Galgalei Haplada Herzelia, Israel www.scylladb.com @scylladb Thank You!

Slide 106

Slide 106 text

Questions? 106 [email protected] @duarte_nunes