Slide 1

Slide 1 text

Brandon Philips @BrandonPhilips | [email protected] etcd - mission-critical key-value store

Slide 2

Slide 2 text

Uncoordinated Upgrades

Slide 3

Slide 3 text

... ... ... ... ... ... Unavailable Uncoordinated Upgrades

Slide 4

Slide 4 text

Motivation CoreOS cluster reboot lock - Decrement a semaphore key atomically - Reboot and wait... - After reboot increment the semaphore key

Slide 5

Slide 5 text

3 CoreOS updates coordination

Slide 6

Slide 6 text

CoreOS updates coordination 3

Slide 7

Slide 7 text

... CoreOS updates coordination 2

Slide 8

Slide 8 text

... ... ... CoreOS updates coordination 0

Slide 9

Slide 9 text

... ... ... CoreOS updates coordination 0

Slide 10

Slide 10 text

... ... CoreOS updates coordination 0

Slide 11

Slide 11 text

... ... CoreOS updates coordination 0

Slide 12

Slide 12 text

... ... CoreOS updates coordination 0

Slide 13

Slide 13 text

... ... ... CoreOS updates coordination 0

Slide 14

Slide 14 text

CoreOS updates coordination

Slide 15

Slide 15 text

Store Application Configuration config

Slide 16

Slide 16 text

config Start / Restart Start / Restart Store Application Configuration

Slide 17

Slide 17 text

config Update Store Application Configuration

Slide 18

Slide 18 text

config Unavailable Store Application Configuration

Slide 19

Slide 19 text

Requirements Strong Consistency - mutual exclusive at any time for locking purpose Highly Available - resilient to single points of failure & network partitions Watchable - push configuration updates to application

Slide 20

Slide 20 text

Requirements CAP - We want CP - We want something like Paxos

Slide 21

Slide 21 text

Common problem GFS Paxos Big Table Spanner CFS Chubby Google - “All” infrastructure relies on Paxos

Slide 22

Slide 22 text

Common problem Amazon - Replicated log powers ec2 Microsoft - Boxwood powers storage infrastructure Hadoop - ZooKeeper is the heart of the ecosystem

Slide 23

Slide 23 text

COMMON PROBLEM #GIFEE and Cloud Native Solution

Slide 24

Slide 24 text

10,000 Stars on Github 250 contributors Google, Red Hat, EMC, Cisco, Huawei, Baidu, Alibaba...

Slide 25

Slide 25 text

THE HEART OF CLOUD NATIVE Kubernetes, Cloud Foundry Diego, Project Calico, many others

Slide 26

Slide 26 text

ETCD KEY VALUE STORE Fully Replicated, Highly Available, Consistent

Slide 27

Slide 27 text

PUT(foo, bar), GET(foo), DELETE(foo) Watch(foo) CAS(foo, bar, bar1) Key-value Operations

Slide 28

Slide 28 text

DEMO play.etcd.io

Slide 29

Slide 29 text

Runtime Reconfiguration Point-in-time Backup Extensive Metrics etcd Operationality

Slide 30

Slide 30 text

ETCD v3 Successor of etcd v2

Slide 31

Slide 31 text

ETCD v3 Better Performance

Slide 32

Slide 32 text

ETCD v3 More Efficient APIs

Slide 33

Slide 33 text

Multi-Version Put(foo, bar) Put(foo, bar1) Put(foo, bar2) Get(foo) -> bar2

Slide 34

Slide 34 text

Multi-Version Put(foo, bar) Put(foo, bar1) Put(foo, bar2) Get(foo, 1) -> bar

Slide 35

Slide 35 text

Tx.If( Compare(Value("foo"), ">", "bar"), Compare(Version("foo"), "=", 2), ... ).Then( Put("ok","true")... ).Else( Put("ok","false")... ).Commit() Mini-Transactions

Slide 36

Slide 36 text

l = CreateLease(15 * second) Put(foo, bar, l) l.KeepAlive() l.Revoke() Leases

Slide 37

Slide 37 text

w = Watch(foo) for { r = w.Recv() print(r.Event) // PUT print(r.KV) // foo,bar } Streaming Watch

Slide 38

Slide 38 text

Synchronization LoC

Slide 39

Slide 39 text

ETCD v2 machine coordination -> O(10k)

Slide 40

Slide 40 text

ETCD v3 app/container coordination -> O(1M)

Slide 41

Slide 41 text

Performance 1K keys

Slide 42

Slide 42 text

Performance Snapshot caused performance degradation etcd2 - 600K keys

Slide 43

Slide 43 text

Performance etcd2 - 600K keys Snapshot triggered elections

Slide 44

Slide 44 text

ZooKeeper Performance Non-blocking full snapshot Efficient memory management

Slide 45

Slide 45 text

Performance ZooKeeper default

Slide 46

Slide 46 text

Performance Snapshot triggered election ZooKeeper default

Slide 47

Slide 47 text

Performance Snapshot ZooKeeper default

Slide 48

Slide 48 text

Performance GC ZooKeeper snapshot disabled

Slide 49

Slide 49 text

Reliable Performance - Similar to ZooKeeper with snapshot disabled - Incremental snapshot - No Garbage Collection Pauses - Off-heap storage

Slide 50

Slide 50 text

Performance etcd3 /ZooKeeper snapshot disabled

Slide 51

Slide 51 text

Performance etcd3 /ZooKeeper snapshot disabled

Slide 52

Slide 52 text

Memory 10GB 2.4GB 0.8GB 512MB data - 2M 256B keys

Slide 53

Slide 53 text

Reliability 99% at small scale is easy - Failure is infrequent and human manageable 99% at large scale is not enough - Not manageable by humans 99.99% at large scale - Reliable systems at bottom layer

Slide 54

Slide 54 text

HOW DO WE ACHIEVE RELIABILITY WAL, Snapshots, Testing

Slide 55

Slide 55 text

Write Ahead Log Append only - Simple is good Rolling CRC protected - Storage & OSes can be unreliable

Slide 56

Slide 56 text

Snapshots Torturing DBs for Fun and Profit (OSDI2014) - The simpler database is safer - LMDB was the winner Boltdb an append only B+Tree - A simpler LMDB written in Go

Slide 57

Slide 57 text

Testing Clusters Failure Inject failures into running clusters White box runtime checking - Hash state of the system - Progress of the system

Slide 58

Slide 58 text

Testing Cluster Health with Failures Issue lock operations across cluster Ensure the correctness of client library

Slide 59

Slide 59 text

TESTING CLUSTER dash.etcd.io

Slide 60

Slide 60 text

etcd/raft Reliability Designed for testability and flexibility Used by large scale db systems and others - Cockroachdb, TiKV, Dgraph

Slide 61

Slide 61 text

etcd vs others Do one thing

Slide 62

Slide 62 text

etcd vs others Only do the One Thing

Slide 63

Slide 63 text

etcd vs others Do it Really Well

Slide 64

Slide 64 text

etcd Reliability Do it Really Well

Slide 65

Slide 65 text

ETCD v3.0 BETA Efficient and Scalable

Slide 66

Slide 66 text

BETA AVAILABLE TODAY github.com/coreos/etcd

Slide 67

Slide 67 text

FUTURE WORK Proxy, Caching, Watch Coalescing, Secondary Index

Slide 68

Slide 68 text

GET INVOLVED github.com/coreos/etcd

Slide 69

Slide 69 text

Brandon Philips @BrandonPhilips | [email protected] etcd - mission-critical key-value store Thank you!