Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ScyllaDB: NoSQL at Ludicrous Speed

ScyllaDB: NoSQL at Ludicrous Speed

ScyllaDB is a NoSQL database providing Apache Cassandra compatibility, distinguishing itself by supporting millions of operations per second, per node, with predictably low latency, on similar hardware. This talk will cover the design decisions and techniques that enable such an achievement.

Duarte Nunes

May 19, 2017

More Decks by Duarte Nunes

Other Decks in Programming


  1. ScyllaDB • Clustered NoSQL database compatible with Apache Cassandra •

    ~10X performance on same hardware • Low latency, esp. higher percentiles • Self tuning • Mechanically sympathetic C++14
  2. YCSB Benchmark: 3 node Scylla cluster vs 3, 9, 15,

    30 Cassandra machines 3 Scylla 30 Cassandra 3 Cassandra 3 Scylla 30 Cassandra 3 Cassandra
  3. Scylla vs Cassandra - CL:LOCAL_QUORUM, Outbrain Case Study Scylla and

    Cassandra handling the full load (peak of ~12M RPM) 200ms 10ms 20x Lower Latency 5
  4. Data model Partition Key1 Clustering Key1 Clustering Key1 Clustering Key2

    Clustering Key2 ... ... ... ... ... CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id )); INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’'); Sorted by Primary Key
  5. Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time

    SStable 4 SStable 5 SStable 1+2+3 Foreground Job Background Job
  6. Implementation Goals • Efficiency: ◦ Make the most out of

    every cycle • Utilization: ◦ Squeeze every cycle from the machine • Control ◦ Spend the cycles on what we want, when we want
  7. ❏ Introducing ScyllaDB ❏ System Architecture ❏ Node Architecture ❏

    Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA
  8. Enter Seastar www.seastar-project.org • Thread-per-core design (shard) ◦ No blocking.

    Ever. • Asynchronous networking, file I/O, multicore • Future/promise based APIs
  9. Enter Seastar www.seastar-project.org • Thread-per-core design (shard) ◦ No blocking.

    Ever. • Asynchronous networking, file I/O, multicore • Future/promise based APIs • Usermode TCP/IP stack included in the box
  10. Seastar task scheduler Traditional stack Seastar stack Promise Task Promise

    Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes Context switch cost is high. Large stacks pollutes the caches No sharing, millions of parallel events
  11. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  12. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  13. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  14. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  15. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  16. Futures future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); });
  17. Seastar memory allocator • Non-Thread safe! ◦ Each core gets

    a private memory pool • Allocation back pressure ◦ Allocator calls a callback when low on memory ◦ Scylla evicts cache in response
  18. Seastar memory allocator • Non-Thread safe! ◦ Each core gets

    a private memory pool • Allocation back pressure ◦ Allocator calls a callback when low on memory ◦ Scylla evicts cache in response • Inter-core free() through message passing
  19. ❏ Introducing ScyllaDB ❏ System Architecture ❏ Node Architecture ❏

    Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA
  20. Cassandra cache Linux page cache SSTables • 4k granularity •

    Thread-safe • Synchronous APIs • General-purpose • Lack of control2 • ...on the other hand ◦ Exists ◦ Hundreds of man-years ◦ Handling lots of edge cases
  21. Cassandra cache Linux page cache SSTables • Page faults Page

    fault Suspend thread Initiate I/O Context switch I/O completes Context switch Interrupt Map page Resume thread App thread Kernel SSD
  22. Cassandra cache Key cache Row cache On-heap / Off-heap Linux

    page cache SSTables • Complex tuning
  23. Yet another allocator (Problems with malloc/free) • Memory gets fragmented

    over time ◦ If the workload changes sizes of allocated objects ◦ Allocating a large contiguous block requires evicting most of cache
  24. Log-structured memory allocation • Bump-pointer allocation to current segment •

    Frees leave holes in segments • Compaction will try to solve this
  25. Compacting LSA • Teach allocator how to move objects around

    ◦ Updating references • Garbage collect Compact! ◦ Starting with the most sparse segments ◦ Lock to pin objects • Used mostly for the cache ◦ Large majority of memory allocated ◦ Small subset of allocation sites
  26. ❏ Introducing ScyllaDB ❏ System Architecture ❏ Node Architecture ❏

    Seastar ❏ Resource Management ❏ Workload Conditioning ❏ Closing AGENDA
  27. Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog

    Monitor WAN CPU Workload Conditioning Adjust priority
  28. ❏ Introducing ScyllaDB ❏ System Architecture ❏ Node Architecture ❏

    Seastar ❏ Workload Conditioning ❏ Closing AGENDA
  29. • Careful system design and control of the software stack

    can maximize throughput • Without sacrificing latency • Without requiring complex end-user tuning • While having a lot of fun Conclusions
  30. • Download: http://www.scylladb.com • Twitter: @ScyllaDB • Source: http://github.com/scylladb/scylla •

    Mailing lists: scylladb-user @ groups.google.com • Slack: ScyllaDB-Users • Blog: http://www.scylladb.com/blog • Join: http://www.scylladb.com/company/careers • Me: duarte@scylladb.com How to interact