Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scylla: NoSQL at Ludicrous Speed

Scylla: NoSQL at Ludicrous Speed

Scylla is a NoSQL database compatible with Apache Cassandra, distinguishing itself by supporting millions of operations per second, per node, with predictably low latency, on similar hardware.
Achieving such speed requires a great deal of mechanical sympathy: ScyllaDB employs a totally asynchronous, share-nothing programming model, relies on its own memory allocators, and meticulously schedules all its IO requests.
In this talk we will go over some low-level details of the techniques involved and how they fully utilize the underlying hardware resources.

Duarte Nunes

June 05, 2018
Tweet

More Decks by Duarte Nunes

Other Decks in Programming

Transcript

  1. Agenda 2 o Introducing Scylla o System Architecture o Seastar

    o Resource Management o Autonomous Database o Closing
  2. 3 + Clustered NoSQL database compatible with Apache Cassandra +

    ~10X performance on same hardware + Low latency, especially higher percentiles + Autonomous + Mechanically sympathetic C++17 Scylla 10000 foot view
  3. 6 + CQL language and protocol + Legacy Thrift protocol

    + SStable file format + Configuration file format + JMX management protocol + Management command line tool Cassandra Compatibility + Sharing the ecosystem + Spark + Presto + JanusGraph + KairosDB + Kong
  4. 9 Data Model Partition Key Clustering Key Values Vladimir Nabokov

    1962, Pale Fire pages: 315, genre: ‘fiction’ Vladimir Nabokov 1972, Transparent Things pages: 144, genre: ‘fiction’ Haruki Murakami 1997, Wind-Up Bird Chronicle pages: 607, genre: ‘fiction’, translator: ‘Jay Rubin’ David Foster Wallace 1996, Infinite Jest pages: 1,079, genre: ‘fiction’ Fyodor Dostoevsky 1869, The Eternal Husband pages: 130, genre: ‘fiction’, translator: ‘Constance Garnett’ Fyodor Dostoevsky 1869, The Idiot pages: 720, genre: ‘fiction’, translator: ‘Pevear and Volokhonsky’ Fyodor Dostoevsky 1880, The Brothers Karamazov pages: 825, genre: ‘fiction’, translator: ‘Pevear and Volokhonsky’
  5. 17 Consistency Anti-entropy ✓ ✓ ✓ + Reconciliation + Request

    path + Read repair + Request path, probabilistic + Repair + Background process
  6. 18 + High availability + Masterless system + Latency +

    Guaranteed to get the lowest the system can provide at any time Eventual consistency
  7. 24 Storage Log-Structured Merge Tree SStable 1 SStable 2 SStable

    3 Time SStable 4 SStable 5 SStable 1+2+3
  8. 25 Storage Log-Structured Merge Tree SStable 1 SStable 2 SStable

    3 Time SStable 4 SStable 5 SStable 1+2+3 Foreground Job Background Job
  9. 29 + Efficiency + Make the most out of every

    cycle + Utilization + Squeeze every cycle from the machine + Control + Spend the cycles on what we want, when we want Implementation Goals
  10. 31 + Thread-per-core design + No blocking. Ever. + A

    “shard” in seastar lingo. Seastar C++ framework for high-performance servers
  11. 32 + Thread-per-core design + No blocking. Ever. + A

    “shard” in seastar lingo. + Asynchronous + Networking Seastar C++ framework for high-performance servers
  12. 33 + Thread-per-core design + No blocking. Ever. + A

    “shard” in seastar lingo. + Asynchronous + Networking + File I/O Seastar C++ framework for high-performance servers
  13. 34 + Thread-per-core design + No blocking. Ever. + A

    “shard” in seastar lingo. + Asynchronous + Networking + File I/O + Multicore Seastar C++ framework for high-performance servers
  14. 35 + Thread-per-core design + No blocking. Ever. + A

    “shard” in seastar lingo. + Asynchronous + Networking + File I/O + Multicore + Future/promise based APIs Seastar C++ framework for high-performance servers
  15. 36 + Thread-per-core design + No blocking. Ever. + A

    “shard” in seastar lingo. + Asynchronous + Networking + File I/O + Multicore + Future/promise based APIs + Optional user-mode TCP/IP stack included Seastar C++ framework for high-performance servers
  16. 37 Seastar task scheduler Traditional Stack Seastar Stack Promise Task

    Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes Context switch cost is high. Large stacks pollutes the caches No sharing, millions of parallel events
  17. 38 future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures
  18. 39 future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures
  19. 40 future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures
  20. 41 future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures
  21. 42 future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures
  22. 43 future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id

    = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); }); }); Futures
  23. 45 Memory allocator + Non-Thread safe! + Each core gets

    a private memory pool + Allocation back pressure + Allocator calls a callback when low on memory + Scylla evicts cache in response
  24. 46 Memory allocator + Non-Thread safe! + Each core gets

    a private memory pool + Allocation back pressure + Allocator calls a callback when low on memory + Scylla evicts cache in response + Inter-core free() through message passing
  25. 54 Linux page cache Linux page cache SSTables + 4k

    granularity + Thread-safe + Synchronous APIs
  26. 55 Linux page cache Linux page cache SSTables + 4k

    granularity + Thread-safe + Synchronous APIs App thread Kernel SSD Page fault Suspend thread Initiate I/O Context switch I/O completes Interrupt Context switch Map page Resume thread
  27. 56 Linux page cache Linux page cache SSTables + 4k

    granularity + Thread-safe + Synchronous APIs + General-purpose
  28. 57 Linux page cache Linux page cache SSTables + 4k

    granularity + Thread-safe + Synchronous APIs + General-purpose + Lack of control
  29. 58 Linux page cache Linux page cache SSTables + 4k

    granularity + Thread-safe + Synchronous APIs + General-purpose + Lack of control + Lack of control
  30. 59 Linux page cache Linux page cache SSTables + 4k

    granularity + Thread-safe + Synchronous APIs + General-purpose + Lack of control + Lack of control + ...on the other hand + Exists + Hundreds of man-years + Handles lots of edge cases
  31. 60 Unified cache Linux page cache SSTables Key cache Row

    cache On-heap / Off-heap Complex Tuning Cassandra
  32. 61 Unified cache Linux page cache SSTables Key cache Row

    cache On-heap / Off-heap Complex Tuning Scylla Unified cache SSTables Cassandra
  33. 65 + Memory gets fragmented over time + If the

    workload changes sizes of allocated objects + Allocating a large contiguous block requires evicting most of cache Yet another allocator Issues with malloc/free
  34. 90 + Bump-pointer allocation to current segment + Frees leave

    holes in segments + Compaction will try to solve this Log-structured memory allocation
  35. 91 + Teach allocator how to move objects around +

    Updating back-pointers + Garbage collect Compact! + Starting with the most sparse segments + Lock to pin objects + Used mostly for the cache and memtables + Large majority of memory allocated + Small subset of allocation sites Log-structured memory allocation
  36. 94 Autonomous Control theory in practice Memtable Seastar Scheduler Compaction

    Query Repair Commitlog SSD Compaction Backlog Controller WAN CPU
  37. 95 Autonomous Control theory in practice Memtable Seastar Scheduler Compaction

    Query Repair Commitlog SSD Compaction Backlog Controller Adjust priority WAN CPU
  38. 96 Autonomous Control theory in practice Memtable Seastar Scheduler Compaction

    Query Repair Commitlog SSD Compaction Backlog Controller Memory Controller Adjust priority WAN CPU
  39. 97 Autonomous Control theory in practice Memtable Seastar Scheduler Compaction

    Query Repair Commitlog SSD Compaction Backlog Controller Memory Controller Adjust priority Adjust priority WAN CPU
  40. 103 + Careful system design and control of the software

    stack can maximize throughput + Without sacrificing latency + Without requiring complex end-user tuning + While having a lot of fun Conclusions
  41. 104 + Download: http://www.scylladb.com + Twitter: @ScyllaDB + Source: http://github.com/scylladb/scylla

    + Mailing lists: [email protected] + Slack: ScyllaDB-Users + Blog: http://www.scylladb.com/blog + Join: http://www.scylladb.com/company/careers How to interact A potpourri of links
  42. United States 1900 Embarcadero Road Palo Alto, CA 94303 Israel

    11 Galgalei Haplada Herzelia, Israel www.scylladb.com @scylladb Thank You!