Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dynamo

Avatar for Dimos Raptis Dimos Raptis
October 23, 2018
820

 Dynamo

During the last decade, Internet companies have achieved unprecedented growth. As a result of this, their systems needed to cope with workloads they had never experienced before. The existing data storage technologies were not capable of supporting these workloads, because of inherent limitations, such as single point of failures, vertical scaling etc.
In an effort to solve these problems, companies started re-evaluating the architecture of existing systems and started building a new generation of systems, focused on a more distributed, scalable and highly available architecture. Of course, this came with a whole new set of technical challenges that they needed to address.
In this talk, we will be looking at a set of core techniques Amazon used to build their key-value store, referred to as Dynamo in the related paper. We will explain what are the problems they had to address and how these techniques helped them.
Given that this was amongst the seminal papers in the space of distributed systems, we will also visit some examples of open-source systems that leveraged some of these techniques.

Avatar for Dimos Raptis

Dimos Raptis

October 23, 2018
Tweet

Transcript

  1. Problems • Writes could still scale only vertically • Read-friendly

    (but write-unfriendly) architecture • High availability • Reduced latency • Needless overhead for simple K-V operations • Far from auto-scaling
  2. Simplified architecture A - F G - K L -

    Q R - Z App get(”Bob”) / put(“Bob”, 15)
  3. Techniques Problem Technique Partitioning Consistent Hashing High Availability for writes

    Vector clocks with conflict resolution on read Handling failures (temporary) Hinted handoff Recovering from failures (permanent) Anti-entropy with Merkle trees Membership & failure detection Gossip-based protocol
  4. Partitioning - Consistent Hashing Key Hash Chosen server (N=3) Chosen

    server (N=4) Bob 16334285 62 1 2 Alice 75946347 39 0 3 Nick 50007991 25 2 1 George 97871733 43 1 3 Key k Chosen server à Hash(k) mod N Each server si assigned location in the ring [0, L] Chosen server à next server in the ring from hash(k) mod L S1 (90) S2 (180) S4 (360) S3 (270) k = ”Bob” Hash(k) = 360018072 Hash(k) mod 360 = 72 Server à s1 S5 (140)
  5. Quorums S1 (90) S2 (180) S4 (360) S3 (270) S5

    (140) N: replication factor R: read quorum W: write quorum R + W > N N = 5 R = 3 W = 3
  6. Handling failures Temporary failures à Hinted handoff Permanent failures à

    Sync using Merkle trees O(logN) data transfer worst case, instead of O(N)
  7. Membership & Failure detection Nodes ping random nodes periodically At

    every epoch, each node transmits its membership list to b nodes If a node is stated as unavailable for more than M epochs, it’s announced dead Given no additional changes, protocol converges in O(logb N) steps b = 2
  8. Dynamo in the wild Technique Apache Cassandra Riak Consistent Hashing

    with virtual nodes ✓ ✓ Vector clocks ✗ (last-write-wins) ✓ Hinted handoff ✓ ✓ Anti-entropy with Merkle trees ✓ (manual repair) ✓ Gossip-based protocol ✓ ✓