Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scylla: Faster than a Speeding Byte

Scylla: Faster than a Speeding Byte

Scylla is a NoSQL database providing Apache Cassandra compatibility, distinguishing itself by supporting millions of operations per second, per node, with predictably low latency, on similar hardware. This talk will introduce Scylla, highlight some of its achievements, and lift the veil on the many design decisions, from programming model to the low-level, mechanical sympathetic details of its memory allocators.

Duarte Nunes

June 09, 2017
Tweet

More Decks by Duarte Nunes

Other Decks in Programming

Transcript

  1. ❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource

    Management ❏ Workload Conditioning ❏ Closing AGENDA
  2. Scylla • Clustered NoSQL database compatible with Apache Cassandra •

    ~10X performance on same hardware • Low latency, esp. higher percentiles • Self tuning • Mechanically sympathetic C++14
  3. YCSB Benchmark: 3 node Scylla cluster vs 3, 9, 15,

    30 Nodes Cassandra clusters 3 Scylla 30 Cassandra 3 Cassandra 3 Scylla 30 Cassandra 3 Cassandra
  4. Scylla vs Cassandra - CL:LOCAL_QUORUM, Outbrain Case Study Scylla and

    Cassandra handling the full load (peak of ~12M RPM) 200ms 10ms 20x Lower Latency
  5. Cassandra Compatibility • CQL language and protocol • Legacy Thrift

    protocol • SStable file format • Configuration file format • JMX management protocol • Management command line tool
  6. ❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource

    Management ❏ Workload Conditioning ❏ Closing AGENDA
  7. Dynamo-based system • Masterless • Data is replicated across a

    set of replicas • Data is partitioned across all nodes
  8. Dynamo-based system • Masterless • Data is replicated across a

    set of replicas • Data is partitioned across all nodes • An operation can specify a Consistency Level
  9. CAP Theorem • Consistent under partitions ◦ i.e., Spanner, Zookeeper

    ◦ Unavailable ◦ Linearizability, single system image ◦ Expensive due to coordination
  10. CAP Theorem • Available under partitions ◦ i.e., Scylla, Cassandra,

    Dynamo ◦ Local operations, asynchronous propagation ◦ Anomalies ◦ Requires repair ◦ More difficult to program ◦ Fast and highly available
  11. Concurrent Updates R1 R2 1 2 ? ? set(1, ‘a’)

    set(1, ‘b’) How to make concurrent updates commute?
  12. Concurrent Updates R1 R2 1 2 2 2 set(1, ‘a’)

    set(1, ‘b’) max(ts) max(ts)
  13. Diverging replicas & anti-entropy R1 R2 1 2 1 2

    set(1, ‘a’) set(1, ‘b’)
  14. Data model Partition Key1 Clustering Key1 Clustering Key1 Clustering Key2

    Clustering Key2 ... ... ... ... ... CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id)); INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’'); Sorted by Primary Key
  15. Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time

    SStable 4 SStable 5 SStable 1+2+3 Foreground Job Background Job
  16. Implementation Goals • Efficiency: ◦ Make the most out of

    every cycle • Utilization: ◦ Squeeze every cycle from the machine • Control ◦ Spend the cycles on what we want, when we want
  17. ❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource

    Management ❏ Workload Conditioning ❏ Closing AGENDA
  18. Enter Seastar www.seastar-project.org • Thread-per-core design (shard) ◦ No blocking.

    Ever. • Asynchronous networking, file I/O, multicore • Future/promise based APIs
  19. Enter Seastar www.seastar-project.org • Thread-per-core design (shard) ◦ No blocking.

    Ever. • Asynchronous networking, file I/O, multicore • Future/promise based APIs • Usermode TCP/IP stack included in the box
  20. Seastar task scheduler Traditional stack Seastar stack Promise Task Promise

    Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes Context switch cost is high. Large stacks pollutes the caches No sharing, millions of parallel events
  21. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  22. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  23. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  24. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  25. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  26. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  27. Seastar memory allocator • Non-Thread safe! ◦ Each core gets

    a private memory pool • Allocation back pressure ◦ Allocator calls a callback when low on memory ◦ Scylla evicts cache in response
  28. Seastar memory allocator • Non-Thread safe! ◦ Each core gets

    a private memory pool • Allocation back pressure ◦ Allocator calls a callback when low on memory ◦ Scylla evicts cache in response • Inter-core free() through message passing
  29. ❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource

    Management ❏ Workload Conditioning ❏ Closing AGENDA
  30. Linux page cache Linux page cache SSTables • Page faults

    Page fault Suspend thread Initiate I/O Context switch I/O completes Context switch Interrupt Map page Resume thread App thread Kernel SSD
  31. Linux page cache Linux page cache SSTables • 4k granularity

    • Thread-safe • Synchronous APIs • General-purpose
  32. Linux page cache Linux page cache SSTables • 4k granularity

    • Thread-safe • Synchronous APIs • General-purpose • Lack of control
  33. Linux page cache Linux page cache SSTables • 4k granularity

    • Thread-safe • Synchronous APIs • General-purpose • Lack of control • Lack of control
  34. Linux page cache Linux page cache SSTables • 4k granularity

    • Thread-safe • Synchronous APIs • General-purpose • Lack of control • Lack of control • ...on the other hand ◦ Exists ◦ Hundreds of man-years ◦ Handling lots of edge cases
  35. Cassandra cache Key cache Row cache On-heap / Off-heap Linux

    page cache SSTables • Complex tuning
  36. Yet another allocator (Problems with malloc/free) • Memory gets fragmented

    over time ◦ If the workload changes sizes of allocated objects ◦ Allocating a large contiguous block requires evicting most of cache
  37. Log-structured memory allocation • Bump-pointer allocation to current segment •

    Frees leave holes in segments • Compaction will try to solve this
  38. Compacting LSA • Teach allocator how to move objects around

    ◦ Updating references • Garbage collect Compact! ◦ Starting with the most sparse segments ◦ Lock to pin objects • Used mostly for the cache ◦ Large majority of memory allocated ◦ Small subset of allocation sites
  39. ❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource

    Management ❏ Workload Conditioning ❏ Closing AGENDA
  40. Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog

    Monitor WAN CPU Workload Conditioning Adjust priority
  41. ❏ Introducing Scylla ❏ System Overview ❏ Seastar ❏ Resource

    Management ❏ Workload Conditioning ❏ Closing AGENDA
  42. • Careful system design and control of the software stack

    can maximize throughput • Without sacrificing latency • Without requiring complex end-user tuning • While having a lot of fun Conclusions
  43. • Download: http://www.scylladb.com • Twitter: @ScyllaDB • Source: http://github.com/scylladb/scylla •

    Mailing lists: scylladb-user @ groups.google.com • Slack: ScyllaDB-Users • Blog: http://www.scylladb.com/blog • Join: http://www.scylladb.com/company/careers • Me: [email protected] How to interact