Slide 1

Slide 1 text

SCYLLA: NoSQL at Ludicrous Speed Duarte Nunes @duarte_nunes

Slide 2

Slide 2 text

❏ Introducing ScyllaDB ❏ Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA

Slide 3

Slide 3 text

ScyllaDB ● Clustered NoSQL database compatible with Apache Cassandra ● ~10X performance on same hardware ● Low latency, esp. higher percentiles ● Self tuning ● Mechanically sympathetic C++14

Slide 4

Slide 4 text

YCSB Benchmark: 3 node Scylla cluster vs 3, 9, 15, 30 Cassandra machines 3 Scylla 30 Cassandra 3 Cassandra 3 Scylla 30 Cassandra 3 Cassandra

Slide 5

Slide 5 text

Scylla vs Cassandra - CL:LOCAL_QUORUM, Outbrain Case Study Scylla and Cassandra handling the full load (peak of ~12M RPM) 200ms 10ms 20x Lower Latency 5

Slide 6

Slide 6 text

Scylla benchmark by Samsung op/s Full report: http://tinyurl.com/msl-scylladb

Slide 7

Slide 7 text

Dynamo-based system

Slide 8

Slide 8 text

Data model Partition Key1 Clustering Key1 Clustering Key1 Clustering Key2 Clustering Key2 ... ... ... ... ... CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id )); INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’'); Sorted by Primary Key

Slide 9

Slide 9 text

Log-Structured Merge Tree SStable 1 Time

Slide 10

Slide 10 text

Log-Structured Merge Tree SStable 1 SStable 2 Time

Slide 11

Slide 11 text

Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time

Slide 12

Slide 12 text

Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4

Slide 13

Slide 13 text

Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4 SStable 1+2+3

Slide 14

Slide 14 text

Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4 SStable 5 SStable 1+2+3

Slide 15

Slide 15 text

Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4 SStable 5 SStable 1+2+3 Foreground Job Background Job

Slide 16

Slide 16 text

Request path SSTable Memtable

Slide 17

Slide 17 text

Request path SSTable Memtable Reads

Slide 18

Slide 18 text

Request path SSTable Memtable Reads Commit Log Writes

Slide 19

Slide 19 text

Implementation Goals ● Efficiency: ○ Make the most out of every cycle ● Utilization: ○ Squeeze every cycle from the machine ● Control ○ Spend the cycles on what we want, when we want

Slide 20

Slide 20 text

❏ Introducing ScyllaDB ❏ System Architecture ❏ Node Architecture ❏ Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA

Slide 21

Slide 21 text

● Thread-per-core design (shard) ○ No blocking. Ever. Enter Seastar www.seastar-project.org

Slide 22

Slide 22 text

Enter Seastar www.seastar-project.org ● Thread-per-core design (shard) ○ No blocking. Ever. ● Asynchronous networking, file I/O, multicore

Slide 23

Slide 23 text

Enter Seastar www.seastar-project.org ● Thread-per-core design (shard) ○ No blocking. Ever. ● Asynchronous networking, file I/O, multicore ● Future/promise based APIs

Slide 24

Slide 24 text

Enter Seastar www.seastar-project.org ● Thread-per-core design (shard) ○ No blocking. Ever. ● Asynchronous networking, file I/O, multicore ● Future/promise based APIs ● Usermode TCP/IP stack included in the box

Slide 25

Slide 25 text

Seastar task scheduler Traditional stack Seastar stack Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes Context switch cost is high. Large stacks pollutes the caches No sharing, millions of parallel events

Slide 26

Slide 26 text

Seastar memcached

Slide 27

Slide 27 text

Pedis https://github.com/fastio/pedis

Slide 28

Slide 28 text

Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });

Slide 29

Slide 29 text

Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });

Slide 30

Slide 30 text

Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });

Slide 31

Slide 31 text

Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });

Slide 32

Slide 32 text

Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });

Slide 33

Slide 33 text

Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });

Slide 34

Slide 34 text

No escaping the monad future<> f = …; f.get(); // not allowed

Slide 35

Slide 35 text

Unless... future<> f = seastar::async([&] { future<> f = …; f.get(); });

Slide 36

Slide 36 text

Unless... future<> f = seastar::async([&] { future<> f = …; f.get(); });

Slide 37

Slide 37 text

Seastar memory allocator ● Non-Thread safe! ○ Each core gets a private memory pool

Slide 38

Slide 38 text

Seastar memory allocator ● Non-Thread safe! ○ Each core gets a private memory pool ● Allocation back pressure ○ Allocator calls a callback when low on memory ○ Scylla evicts cache in response

Slide 39

Slide 39 text

Seastar memory allocator ● Non-Thread safe! ○ Each core gets a private memory pool ● Allocation back pressure ○ Allocator calls a callback when low on memory ○ Scylla evicts cache in response ● Inter-core free() through message passing

Slide 40

Slide 40 text

❏ Introducing ScyllaDB ❏ System Architecture ❏ Node Architecture ❏ Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA

Slide 41

Slide 41 text

Usermode I/O scheduler Storage Block Layer Filesystem

Slide 42

Slide 42 text

Usermode I/O scheduler Storage Block Layer Filesystem Disk I/O Scheduler Class A Class B

Slide 43

Slide 43 text

Usermode I/O scheduler Query Commitlog Compaction Queue Queue Queue Userspace I/O Scheduler Disk

Slide 44

Slide 44 text

Figuring out optimal disk concurrency Max useful disk concurrency

Slide 45

Slide 45 text

Cassandra cache Linux page cache SSTables ● 4k granularity ● Thread-safe ● Synchronous APIs ● General-purpose ● Lack of control2 ● ...on the other hand ○ Exists ○ Hundreds of man-years ○ Handling lots of edge cases

Slide 46

Slide 46 text

Cassandra cache Linux page cache SSTables ● Parasitic rows SSTable page (4k) Your data (300b)

Slide 47

Slide 47 text

Cassandra cache Linux page cache SSTables ● Page faults Page fault Suspend thread Initiate I/O Context switch I/O completes Context switch Interrupt Map page Resume thread App thread Kernel SSD

Slide 48

Slide 48 text

Cassandra cache Key cache Row cache On-heap / Off-heap Linux page cache SSTables ● Complex tuning

Slide 49

Slide 49 text

Scylla cache Unified cache SSTables

Slide 50

Slide 50 text

Probabilistic Cache Warmup ● A replica with a cold cache should be sent less requests ●

Slide 51

Slide 51 text

Yet another allocator (Problems with malloc/free) ● Memory gets fragmented over time ○ If the workload changes sizes of allocated objects ○ Allocating a large contiguous block requires evicting most of cache

Slide 52

Slide 52 text

Memory

Slide 53

Slide 53 text

Memory

Slide 54

Slide 54 text

Memory

Slide 55

Slide 55 text

Memory

Slide 56

Slide 56 text

Memory

Slide 57

Slide 57 text

Memory

Slide 58

Slide 58 text

Memory

Slide 59

Slide 59 text

Memory

Slide 60

Slide 60 text

Memory

Slide 61

Slide 61 text

Memory

Slide 62

Slide 62 text

Memory

Slide 63

Slide 63 text

Memory

Slide 64

Slide 64 text

Memory

Slide 65

Slide 65 text

Memory

Slide 66

Slide 66 text

Memory

Slide 67

Slide 67 text

Memory

Slide 68

Slide 68 text

Memory OOM :(

Slide 69

Slide 69 text

Memory OOM :(

Slide 70

Slide 70 text

Memory OOM :(

Slide 71

Slide 71 text

Memory OOM :(

Slide 72

Slide 72 text

Memory OOM :(

Slide 73

Slide 73 text

Memory OOM :(

Slide 74

Slide 74 text

Memory OOM :(

Slide 75

Slide 75 text

Memory

Slide 76

Slide 76 text

Log-structured memory allocation ● Bump-pointer allocation to current segment ● Frees leave holes in segments ● Compaction will try to solve this

Slide 77

Slide 77 text

Compacting LSA ● Teach allocator how to move objects around ○ Updating references ● Garbage collect Compact! ○ Starting with the most sparse segments ○ Lock to pin objects ● Used mostly for the cache ○ Large majority of memory allocated ○ Small subset of allocation sites

Slide 78

Slide 78 text

❏ Introducing ScyllaDB ❏ System Architecture ❏ Node Architecture ❏ Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA

Slide 79

Slide 79 text

● Internal feedback loops to balance competing loads ○ Consume what you export Workload Conditioning

Slide 80

Slide 80 text

Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD WAN CPU Workload Conditioning

Slide 81

Slide 81 text

Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Monitor WAN CPU Workload Conditioning

Slide 82

Slide 82 text

Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Monitor WAN CPU Workload Conditioning Adjust priority

Slide 83

Slide 83 text

Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Memory Monitor WAN CPU Workload Conditioning

Slide 84

Slide 84 text

Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Memory Monitor Adjust priority WAN CPU Workload Conditioning

Slide 85

Slide 85 text

❏ Introducing ScyllaDB ❏ System Architecture ❏ Node Architecture ❏ Seastar ❏ Workload Conditioning ❏ Closing AGENDA

Slide 86

Slide 86 text

● Careful system design and control of the software stack can maximize throughput ● Without sacrificing latency ● Without requiring complex end-user tuning ● While having a lot of fun Conclusions

Slide 87

Slide 87 text

● Download: http://www.scylladb.com ● Twitter: @ScyllaDB ● Source: http://github.com/scylladb/scylla ● Mailing lists: scylladb-user @ groups.google.com ● Slack: ScyllaDB-Users ● Blog: http://www.scylladb.com/blog ● Join: http://www.scylladb.com/company/careers ● Me: duarte@scylladb.com How to interact

Slide 88

Slide 88 text

SCYLLA, NoSQL at Ludicrous Speed Thank you. @duarte_nunes