Summary of "Dynamo: Amazon’s Highly Available Key-value Store"

Introduction to “Designing Data-Intensive Applications” ~ Use Case: “Dynamo” by
Amazon ~ January 17th, 2019 Kenju Wagatsuma

“Data-Intensive Application” - data-intensive vs compute-intensive - the complexity of
data is bottleneck - e.g. database, cache, search index, stream/batch processing

Image Area Image Area Image Area 信頼性ハードの故障ソフトの故障人為的ミス
拡張性高負荷高パフォーマンス高トラフィック保守性運用エラー検知自己修復 3 Key Factors when Designing Data-Intensive Applications Reliability Scalability Maintainabi lity

From “ACID” to “BASE” - ACID - Atomicity（原子性） - Consistency（一貫性）
- Isolation（独立性） - Durability（永続性） - BASE - Basically Available（基本的可用性） - 「殆どの場合、機能性は損なわれずに使える」 - Soft state（柔軟な状態変化） - 「システムの状態は頻繁に変わりうる、そしてそれを許容する」 - Eventual consistency（結果整合性） - 「最終的に整合性が取れていれば、一時的に不整合な状態も許容する」

But, how? - Replication - single-leader/multi-leader/leaderless - synchronous/asynchronous/semi-synchronous - Partitioning
- secondary indexes - rebalancing/request-routing - Transactions - two-phase locking/predicate locks/index range locks - MVCC (multi-version concurrency control) - Consensus - quorum/membership services

Use Case: “Dynamo” by AWS

“Dynamo: Amazon’s Highly Available KVS” - Amazon において高可用性・拡張性を兼ね備えた分散 KVS
として開発 - ショッピングカート機能のバックエンドなどに利用 - AWS DynamoDB は、Dynamo をもとに開発された AWS のサービス

Data Distribution among Partitions https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.Partitions.html

Sort Key for Order Guranteee https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.Partitions.html

Problems Problem Why? Partitioning 世界中の複数リージョンでデータを分散させたい High Availability for Writes ショッピングカートの性質上、書き込みは必ず成功させたい（ユー
ザーに何度も購買ボタンをクリックさせたくない） Handling temporary failures 一時的にノードが死んでも、いい感じに複製したい Recovering from permanent failures 完全にノードが死んでも、システム全体で可用性を保ちたい Membership & failure detection 中央集権サーバーを持たずに、障害を検知したい

Features Problem Technique Why? Partitioning Consistent Hashing 徐々に scale-in/scale-out できる拡
張性を担保する High Availability for Writes Vector clocks with reconciliation during reads 書き込みを必ず保証し、結果整合性を保証する Handling temporary failures Sloppy Quorum & hinted handoff 一部のノードが死んでもシステム全体の機能性を確保する Recovering from permanent failures Anti-entropy using Merkle trees バックグラウンドでデータを複製する Membership & failure detection Gossip-based membership protocol 中央集権サーバー不要で各ノードの障害検知をする（P2P-like）

Consistent Hashing - あるノードが死んでデータを移し替えたいとき、scale-out でノードが増えてデータを移したいときの、移動効率を高めるための手法 - 分散 KVS の他、Load
Balancing などに応用されたり、分散ハッシュテーブル（DHT）など他アルゴリズムのベースとなったり

Consistent Hashing nodeX nodeZ nodeY nodeW

Consistent Hashing nodeX nodeZ nodeY nodeW a a

Consistent Hashing nodeX nodeZ nodeY nodeW a a b b

c c

c c d d

c c d d e e f f g g h h

When Node Crashes... nodeX nodeZ nodeY nodeW a a b
b c c d d e e f f g g h h

When Node Crashes... nodeX nodeY nodeW a a b b
c c d d e e f f g g h h

When Node Added... nodeX nodeY nodeW a a b b
c d d e e f f g g h h nodeU c

Virtual Node nodeX nodeY nodeW a a b b c
d d e e f f g g h h nodeU c VNodeW

Vector Clocks (a.k.a. Version Vectors) - Eventual Consistency な Dynamo
では、オブジェクトを Versioning することによって各変更を保存しておく - このとき、オブジェクト間の因果律（変更の原因と結果）をもとに、最終的にはあるべき状態になるように各 Version の変更を適用していく手法が、Vector Clocks

Vector Clocks nodeX nodeY nodeX nodeY α:1 β:0 γ:0 α:2
β:0 γ:0 α:0 β:1 γ:0 α:2 β:1 γ:0 α:2 β:1 γ:0

Sloppy Quorum and hinted handoff - Sloppy Quorumm = “いいかげんな”クォーラム
- 書き込みの可用性を高めるために使われる - ネットワークが疎通されていなくても、ローカルのストレージに書き込みをして「成功」とみなす。ネットワークが再開されたら、 hinted handoff を付与して他のノードに書き込み済みのオブジェクト情報を送信する - hinted handoff = ヒント付き引き継ぎ - ネットワークが疎通されていたら、ローカルに書き込まれているオブジェクトは本来どのノードに渡されていたかを記録しておくメタデータみたいなもの

Merkle trees - 別名 “Hash Tree” - あるデータとデータの整合性を、効率的に検証するための手法 - 分散システムの他、P2P
や Blockchain におけるデータの改竄検知, Git における差分検知に用いられる

Merkle trees https://en.wikipedia.org/wiki/Merkle_tree

Git trees as Merkle trees https://blog.sourced.tech/post/difftree/

Dynamo and Merkle trees - “To detect the inconsistencies between
replicas faster and and to minimize the amount of transferred data, Dynamo uses Merkle tree” - [Pros] Hash 値だけ検証すればいいので、レプリカ間の整合性チェックがデータ全体を比較するよりも速い - [Pros] Hash 値だけを転送すればいいので、ネットワーク転送効率も高い

How Dynamo uses Merkle trees for anti-entropy 1. Each node
maintains a separate Merkle tree for each key range 2. Two nodes exchange the root of the Merkle tree (corresponding to the key ranges that they host in common) 3. Traverse the tree and find if there are any differences 4. If any differences found, start replication

How Dynamo uses Merkle trees for anti-entropy nodeX nodeY hash(x)
= α hash(x’) = β != α

Gossip-based membership protocol - 各ノードが「どのノードにリクエストを送信すればいいか」を知っている - 実装は大変だが、中央集権的ノードは不要で、クライアント側にもどのノードに送信すべきかを判断するロジックを実装しなくて済む - “DDIA”
によれば、Cassandra / Riak もこの手法を用いる

Gossip-based membership protocol “Designing Data-Intensive Applications”, p. 215

Summary of "Dynamo: Amazon’s Highly Available K...

Summary of "Dynamo: Amazon’s Highly Available Key-value Store"

Ken Wagatsuma

More Decks by Ken Wagatsuma

Other Decks in Programming

Featured

Transcript

Introduction to “Designing Data-Intensive Applications” ~ Use Case: “Dynamo” by

“Data-Intensive Application” - data-intensive vs compute-intensive - the complexity of

Image Area Image Area Image Area 信頼性ハードの故障ソフトの故障人為的ミス

From “ACID” to “BASE” - ACID - Atomicity（原子性） - Consistency（一貫性）

But, how? - Replication - single-leader/multi-leader/leaderless - synchronous/asynchronous/semi-synchronous - Partitioning

Use Case: “Dynamo” by AWS

“Dynamo: Amazon’s Highly Available KVS” - Amazon において高可用性・拡張性を兼ね備えた分散 KVS

Data Distribution among Partitions https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.Partitions.html

Sort Key for Order Guranteee https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.Partitions.html

Problems Problem Why? Partitioning 世界中の複数リージョンでデータを分散させたい High Availability for Writes ショッピングカートの性質上、書き込みは必ず成功させたい（ユー

Features Problem Technique Why? Partitioning Consistent Hashing 徐々に scale-in/scale-out できる拡

Consistent Hashing - あるノードが死んでデータを移し替えたいとき、scale-out でノードが増えてデータを移したいときの、移動効率を高めるための手法 - 分散 KVS の他、Load

Consistent Hashing nodeX nodeZ nodeY nodeW

Consistent Hashing nodeX nodeZ nodeY nodeW a a

Consistent Hashing nodeX nodeZ nodeY nodeW a a b b

Consistent Hashing nodeX nodeZ nodeY nodeW a a b b

Consistent Hashing nodeX nodeZ nodeY nodeW a a b b

Consistent Hashing nodeX nodeZ nodeY nodeW a a b b

When Node Crashes... nodeX nodeZ nodeY nodeW a a b

When Node Crashes... nodeX nodeY nodeW a a b b

When Node Added... nodeX nodeY nodeW a a b b

Virtual Node nodeX nodeY nodeW a a b b c

Vector Clocks (a.k.a. Version Vectors) - Eventual Consistency な Dynamo

Vector Clocks nodeX nodeY nodeX nodeY α:1 β:0 γ:0 α:2

Sloppy Quorum and hinted handoff - Sloppy Quorumm = “いいかげんな”クォーラム

Merkle trees - 別名 “Hash Tree” - あるデータとデータの整合性を、効率的に検証するための手法 - 分散システムの他、P2P

Merkle trees https://en.wikipedia.org/wiki/Merkle_tree

Git trees as Merkle trees https://blog.sourced.tech/post/difftree/

Dynamo and Merkle trees - “To detect the inconsistencies between

How Dynamo uses Merkle trees for anti-entropy 1. Each node

How Dynamo uses Merkle trees for anti-entropy nodeX nodeY hash(x)

How Dynamo uses Merkle trees for anti-entropy nodeX nodeY hash(x)

Gossip-based membership protocol “Designing Data-Intensive Applications”, p. 215