Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Riak

Intro to Riak

A presentation looking into the core concepts of Riak,

Joel Jacobson

July 05, 2013
Tweet

More Decks by Joel Jacobson

Other Decks in Technology

Transcript

  1. PROBLEMS? • Concurrency and latency at scale • Data consistency

    • Uptime/failover • Multi Tenancy • SLA’s Friday, 26 July 13
  2. WHAT IS RIAK? • Key-Value store + extras • Distributed

    and horizontally scalable • Fault-tolerant • Highly available • Built for the web Friday, 26 July 13
  3. INSPIRED BY AMAZON DYNAMO • White paper released to describe

    a database system to be used for their shopping cart • Masterless, peer-coordinated replication • Dynamo inspired data-stores; Riak, Cassandra, Voldemort etc. • Consistent hashing - no sharding :-) • Eventually consistent Friday, 26 July 13
  4. RIAK KEY-VALUE STORE • Simple operations - GET, PUT, DELETE

    • Value is opaque, with metadata • Extras, e.g. • Secondary Indexes (2i) • MapReduce • Full text search Friday, 26 July 13
  5. HORIZONTALLY SCALABLE • Near linear scalability • Query load and

    data are spread evenly • Add more nodes and get more: • ops/second • storage capacity • compute power (for Map/Reduce) Friday, 26 July 13
  6. FAULT TOLERANT • All nodes participate equally - no single

    point of failure (SPOF) • All data is replicated • Clusters self heal - Handoff, Active Anti-Entropy • Cluster transparently survives... • node failure • network partitions • Built on Erlang/OTP (designed for FT) Friday, 26 July 13
  7. HIGHLY AVAILABLE • Any node can serve client requests •

    Fallbacks are used when nodes are down • Always accepts read and write requests • Per-request quorums Friday, 26 July 13
  8. QUORUMS - N/R/W • Tunable down to bucket level •

    n_val = 3 by default • w / r = 2 by default • w = 1 - Quicker response time, read could be inconsistent in short term • w = all - Slower response, increased data consistency Friday, 26 July 13
  9. CAP THEOREM • C = Consistency • A = Availability

    • P = Partition Tolerance • Cap theorem states that a distributed shared data system can at most support 2 out of these 3 properties DB DB DB Client Client Network/Data Partition Friday, 26 July 13
  10. REPLICATION • Replicated to 3 nodes by default (n_val =3,

    which is configurable) Friday, 26 July 13
  11. DISASTER SCENARIO • Node fails • Request goes to fallback

    • Node comes back • Handoff - data retuned to recovered node • Normal operations resume automatically Friday, 26 July 13
  12. DISASTER SCENARIO • Node fails • Request goes to fallback

    • Node comes back • Handoff - data retuned to recovered node • Normal operations resume automatically hash(“user_id”) Friday, 26 July 13
  13. ACTIVE ANTI-ENTROPY • Automatically repair inconsistencies in data • Active

    Anti-Entropy was new in 1.3.0 and uses Merkle trees to compare data in partitions and periodically ensure consistency • Active Anti-Entropy runs as a background process • Can also be configured as a manual process Friday, 26 July 13
  14. CONFLICT RESOLUTION • Network partitions and concurrent actors modifying the

    same data cause data divergence • Riak provides two solutions to manage this that can be set on bucket level: • Last Write Wins - an approach used for some use cases • Vector Clocks - Retain “sibling” copies of data for merging Friday, 26 July 13
  15. VECTOR CLOCKS • Every node has an ID • Send

    last-seen vector clock in every “put” request • Can be viewed as ‘commit history’ e.g Git • Lets you decide conflicts Friday, 26 July 13
  16. SIBLING CREATION 0 3 2 1 Object v1 Object v1

    [{a,3}] [{a,2},{b,1}] 1) 2) [{a,3}] [{a,2},{b,1}] 0 3 2 1 Object v1 Object v1 Object v1 • Siblings can be created by: • Simultaneous writes (based on same object version) • Network partitions • Writes to existing key without submitting vector clock Friday, 26 July 13
  17. BITCASK • A fast, append-only key-value store • In memory

    key lookup table (key_dir) data on disk • Closed files are immutable • Merging cleans up old data • Developed by Basho Technologies • Suitable for bounded data, e.g. reference data Friday, 26 July 13
  18. LEVELDB • Key-Value storage developed by Google • Append-only for

    very large data sets • Multiple levels of SSTable-like data structures • Allows for more advanced querying (2i) • It includes compression (Snappy algorithm) • Suitable for unbounded data or advanced querying Friday, 26 July 13
  19. MEMORY • Data is never persisted to disk • Typically

    used for “test” databases (unit tests... etc) • Definable memory limits per vnode • Configurable object expiry • Useful for highly transient data Friday, 26 July 13
  20. MULTI • Configure multiple storage engines for different types of

    data • Configure the “default” storage engine • Choose storage engine on per bucket basis • No reason not to use it Friday, 26 July 13
  21. CLIENT APIS • Riak supports two main client types: •

    REST based HTTP Interface • Easy to use from command line and simple scripts • Useful if using intermediate caching layer, e.g. Varnish • Protocol Buffers • Optimized binary encoding standard developed by Google • More performant than HTTP interface Friday, 26 July 13
  22. CLIENT LIBRARIES • Client libraries supported by Basho: • Community

    supported languages and frameworks: • C/C++, Clojure, Common Lisp, Dart, Django, Go, Grails, Griffon, Groovy, Erlang, Haskell, Java, .NET, Node.js, OCaml , Perl, PHP, Play, Python, Racket, Ruby, Scala, Smalltalk Friday, 26 July 13
  23. • Using Riak as datastore for all back-end systems supporting

    Angry Birds • Game-state storage, ID/Login, Payments, Push notifications, analytics, advertisements • 9 clusters in use with over 100 nodes • 263 million active monthly users Friday, 26 July 13
  24. • Spine2 project - storing patient data (80 million+) •

    500 complex messages per second • 20,000 integrated end points • 0 data loss • 99.9% availability SLA Friday, 26 July 13
  25. • Push to talk application • Billions of requests daily

    • > 50 dedicated servers • Everything stored in Riak • https://github.com/mranney/node_riak Friday, 26 July 13
  26. MULTI DATACENTER REPLICATION (MDC) • Allows data to be replicated

    between clusters in different data centers. Can handle larger latencies. • Two synchronization modes that can be used together: real- time and full sync • Set up as uni-directional or bi-directional replication • Can be used for global load-balancing, business continuity and back-ups Friday, 26 July 13
  27. RIAK-CS • Built on top of Riak and supports MDC

    • S3 compatible object storage • Supports multi-tenancy • Per-tenant usage data and statistics on network I/O • Supports Objects of Arbitrary Content Type Up to 5TB • Often used to build private cloud storage Friday, 26 July 13