Scylla: Faster than a Speeding Byte

Slide 1

Slide 1 text

SCYLLA: Faster than a Speeding Byte! Duarte Nunes @duarte_nunes

Slide 2

Slide 2 text

❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA

Slide 3

Slide 3 text

Scylla ● Clustered NoSQL database compatible with Apache Cassandra ● ~10X performance on same hardware ● Low latency, esp. higher percentiles ● Self tuning ● Mechanically sympathetic C++14

Slide 4

Slide 4 text

YCSB Benchmark: 3 node Scylla cluster vs 3, 9, 15, 30 Nodes Cassandra clusters 3 Scylla 30 Cassandra 3 Cassandra 3 Scylla 30 Cassandra 3 Cassandra

Slide 5

Slide 5 text

Scylla vs Cassandra - CL:LOCAL_QUORUM, Outbrain Case Study Scylla and Cassandra handling the full load (peak of ~12M RPM) 200ms 10ms 20x Lower Latency

Slide 6

Slide 6 text

Scylla benchmark by Samsung op/s Full report: http://tinyurl.com/msl-scylladb

Slide 7

Slide 7 text

Cassandra Compatibility ● CQL language and protocol ● Legacy Thrift protocol

Slide 8

Slide 8 text

Cassandra Compatibility ● CQL language and protocol ● Legacy Thrift protocol ● SStable file format

Slide 9

Slide 9 text

Cassandra Compatibility ● CQL language and protocol ● Legacy Thrift protocol ● SStable file format ● Configuration file format ● JMX management protocol ● Management command line tool

Slide 10

Slide 10 text

Sharing the Ecosystem ● Spark ● Presto ● JanusGraph ● KairosDB

Slide 11

Slide 11 text

Monitoring

Slide 12

Slide 12 text

❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA

Slide 13

Slide 13 text

Dynamo-based system

Slide 14

Slide 14 text

Dynamo-based system ● Masterless

Slide 15

Slide 15 text

Dynamo-based system ● Masterless ● Data is replicated across a set of replicas

Slide 16

Slide 16 text

Dynamo-based system ● Masterless ● Data is replicated across a set of replicas ● Data is partitioned across all nodes

Slide 17

Slide 17 text

Dynamo-based system ● Masterless ● Data is replicated across a set of replicas ● Data is partitioned across all nodes ● An operation can specify a Consistency Level

Slide 18

Slide 18 text

CAP Theorem ● Consistent under partitions ○ i.e., Spanner, Zookeeper ○ Unavailable ○ Linearizability, single system image ○ Expensive due to coordination

Slide 19

Slide 19 text

CAP Theorem ● Available under partitions ○ i.e., Scylla, Cassandra, Dynamo ○ Local operations, asynchronous propagation ○ Anomalies ○ Requires repair ○ More difficult to program ○ Fast and highly available

Slide 20

Slide 20 text

Concurrent Updates R1 R2 1 2 2 1 set(1, ‘a’) set(1, ‘b’)

Slide 21

Slide 21 text

Concurrent Updates R1 R2 1 2 ? ? set(1, ‘a’) set(1, ‘b’) How to make concurrent updates commute?

Slide 22

Slide 22 text

Concurrent Updates R1 R2 1 2 2 2 set(1, ‘a’) set(1, ‘b’) max(ts) max(ts)

Slide 23

Slide 23 text

Diverging replicas & anti-entropy R1 R2 1 2 1 2 set(1, ‘a’) set(1, ‘b’)

Slide 24

Slide 24 text

Data model Partition Key1 Clustering Key1 Clustering Key1 Clustering Key2 Clustering Key2 ... ... ... ... ... CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id)); INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’'); Sorted by Primary Key

Slide 25

Slide 25 text

Log-Structured Merge Tree SStable 1 Time

Slide 26

Slide 26 text

Log-Structured Merge Tree SStable 1 SStable 2 Time

Slide 27

Slide 27 text

Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time

Slide 28

Slide 28 text

Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4

Slide 29

Slide 29 text

Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4 SStable 1+2+3

Slide 30

Slide 30 text

Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4 SStable 5 SStable 1+2+3

Slide 31

Slide 31 text

Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4 SStable 5 SStable 1+2+3 Foreground Job Background Job

Slide 32

Slide 32 text

Request path SSTable Memtable

Slide 33

Slide 33 text

Request path SSTable Memtable Reads

Slide 34

Slide 34 text

Request path SSTable Memtable Reads Commit Log Writes

Slide 35

Slide 35 text

Implementation Goals ● Efficiency: ○ Make the most out of every cycle ● Utilization: ○ Squeeze every cycle from the machine ● Control ○ Spend the cycles on what we want, when we want

Slide 36

Slide 36 text

❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA

Slide 37

Slide 37 text

● Thread-per-core design (shard) ○ No blocking. Ever. Enter Seastar www.seastar-project.org

Slide 38

Slide 38 text

Enter Seastar www.seastar-project.org ● Thread-per-core design (shard) ○ No blocking. Ever. ● Asynchronous networking, file I/O, multicore

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

Enter Seastar www.seastar-project.org ● Thread-per-core design (shard) ○ No blocking. Ever. ● Asynchronous networking, file I/O, multicore ● Future/promise based APIs

Slide 41

Slide 41 text

Enter Seastar www.seastar-project.org ● Thread-per-core design (shard) ○ No blocking. Ever. ● Asynchronous networking, file I/O, multicore ● Future/promise based APIs ● Usermode TCP/IP stack included in the box

Slide 42

Slide 42 text

Seastar task scheduler Traditional stack Seastar stack Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes Context switch cost is high. Large stacks pollutes the caches No sharing, millions of parallel events

Slide 43

Slide 43 text

Seastar memcached

Slide 44

Slide 44 text

Pedis https://github.com/fastio/pedis

Slide 45

Slide 45 text

Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Seastar memory allocator ● Non-Thread safe! ○ Each core gets a private memory pool

Slide 52

Slide 52 text

Seastar memory allocator ● Non-Thread safe! ○ Each core gets a private memory pool ● Allocation back pressure ○ Allocator calls a callback when low on memory ○ Scylla evicts cache in response

Slide 53

Slide 53 text

Slide 54

Slide 54 text

❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA

Slide 55

Slide 55 text

Usermode I/O scheduler Query Commitlog Compaction Queue Queue Queue Userspace I/O Scheduler Disk

Slide 56

Slide 56 text

Figuring out optimal disk concurrency Max useful disk concurrency

Slide 57

Slide 57 text

Linux page cache Linux page cache SSTables ● 4k granularity

Slide 58

Slide 58 text

Linux page cache Linux page cache SSTables ● Parasitic rows Page (4k) Your data (300b)

Slide 59

Slide 59 text

Linux page cache Linux page cache SSTables ● 4k granularity ● Thread-safe

Slide 60

Slide 60 text

Linux page cache Linux page cache SSTables ● 4k granularity ● Thread-safe ● Synchronous APIs

Slide 61

Slide 61 text

Linux page cache Linux page cache SSTables ● Page faults Page fault Suspend thread Initiate I/O Context switch I/O completes Context switch Interrupt Map page Resume thread App thread Kernel SSD

Slide 62

Slide 62 text

Linux page cache Linux page cache SSTables ● 4k granularity ● Thread-safe ● Synchronous APIs ● General-purpose

Slide 63

Slide 63 text

Linux page cache Linux page cache SSTables ● 4k granularity ● Thread-safe ● Synchronous APIs ● General-purpose ● Lack of control

Slide 64

Slide 64 text

Linux page cache Linux page cache SSTables ● 4k granularity ● Thread-safe ● Synchronous APIs ● General-purpose ● Lack of control ● Lack of control

Slide 65

Slide 65 text

Linux page cache Linux page cache SSTables ● 4k granularity ● Thread-safe ● Synchronous APIs ● General-purpose ● Lack of control ● Lack of control ● ...on the other hand ○ Exists ○ Hundreds of man-years ○ Handling lots of edge cases

Slide 66

Slide 66 text

Cassandra cache Key cache Row cache On-heap / Off-heap Linux page cache SSTables ● Complex tuning

Slide 67

Slide 67 text

Scylla cache Unified cache SSTables

Slide 68

Slide 68 text

Probabilistic Cache Warmup ● A replica with a cold cache should be sent less requests ●

Slide 69

Slide 69 text

Yet another allocator (Problems with malloc/free) ● Memory gets fragmented over time ○ If the workload changes sizes of allocated objects ○ Allocating a large contiguous block requires evicting most of cache

Slide 70

Slide 70 text

Memory

Slide 71

Slide 71 text

Memory

Slide 72

Slide 72 text

Memory

Slide 73

Slide 73 text

Memory

Slide 74

Slide 74 text

Memory

Slide 75

Slide 75 text

Memory

Slide 76

Slide 76 text

Memory

Slide 77

Slide 77 text

Memory

Slide 78

Slide 78 text

Memory

Slide 79

Slide 79 text

Memory

Slide 80

Slide 80 text

Memory

Slide 81

Slide 81 text

Memory

Slide 82

Slide 82 text

Memory

Slide 83

Slide 83 text

Memory

Slide 84

Slide 84 text

Memory

Slide 85

Slide 85 text

Memory

Slide 86

Slide 86 text

Memory OOM :(

Slide 87

Slide 87 text

Memory OOM :(

Slide 88

Slide 88 text

Memory OOM :(

Slide 89

Slide 89 text

Memory OOM :(

Slide 90

Slide 90 text

Memory OOM :(

Slide 91

Slide 91 text

Memory OOM :(

Slide 92

Slide 92 text

Memory OOM :(

Slide 93

Slide 93 text

Memory

Slide 94

Slide 94 text

Log-structured memory allocation ● Bump-pointer allocation to current segment ● Frees leave holes in segments ● Compaction will try to solve this

Slide 95

Slide 95 text

Compacting LSA ● Teach allocator how to move objects around ○ Updating references ● Garbage collect Compact! ○ Starting with the most sparse segments ○ Lock to pin objects ● Used mostly for the cache ○ Large majority of memory allocated ○ Small subset of allocation sites

Slide 96

Slide 96 text

❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA

Slide 97

Slide 97 text

● Internal feedback loops to balance competing loads ○ Consume what you export Workload Conditioning

Slide 98

Slide 98 text

Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD WAN CPU Workload Conditioning

Slide 99

Slide 99 text

Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Monitor WAN CPU Workload Conditioning

Slide 100

Slide 100 text

Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Monitor WAN CPU Workload Conditioning Adjust priority

Slide 101

Slide 101 text

Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Memory Monitor WAN CPU Workload Conditioning

Slide 102

Slide 102 text

Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Memory Monitor Adjust priority WAN CPU Workload Conditioning

Slide 103

Slide 103 text

❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA

Slide 104

Slide 104 text

● Careful system design and control of the software stack can maximize throughput ● Without sacrificing latency ● Without requiring complex end-user tuning ● While having a lot of fun Conclusions

Slide 105

Slide 105 text

● Download: http://www.scylladb.com ● Twitter: @ScyllaDB ● Source: http://github.com/scylladb/scylla ● Mailing lists: scylladb-user @ groups.google.com ● Slack: ScyllaDB-Users ● Blog: http://www.scylladb.com/blog ● Join: http://www.scylladb.com/company/careers ● Me: [email protected] How to interact

Slide 106

Slide 106 text

Questions? @duarte_nunes