etcd: Next Steps with the Cornerstone of Distributed Systems

Slide 1

Slide 1 text

Next Steps with the Cornerstone of Distributed Systems Brandon Philips @brandonphilips | [email protected] Demo Code: http://goo.gl/R6Og3Y Free Stickers @ Podium!

Slide 2

Slide 2 text

Why build etcd? MOTIVATION

Slide 3

Slide 3 text

Uncoordinated Upgrades

Slide 4

Slide 4 text

... ... ... ... ... ... Unavailable Uncoordinated Upgrades

Slide 5

Slide 5 text

Motivation CoreOS cluster reboot lock - Decrement a semaphore key atomically - Reboot and wait... - After reboot increment the semaphore key

Slide 6

Slide 6 text

3 CoreOS updates coordination

Slide 7

Slide 7 text

CoreOS updates coordination 3

Slide 8

Slide 8 text

... CoreOS updates coordination 2

Slide 9

Slide 9 text

... ... ... CoreOS updates coordination 0

Slide 10

Slide 10 text

... ... ... CoreOS updates coordination 0

Slide 11

Slide 11 text

... ... CoreOS updates coordination 0

Slide 12

Slide 12 text

... ... CoreOS updates coordination 1

Slide 13

Slide 13 text

... ... ... CoreOS updates coordination 0

Slide 14

Slide 14 text

CoreOS updates coordination

Slide 15

Slide 15 text

Store Application Configuration config

Slide 16

Slide 16 text

config Start / Restart Start / Restart Store Application Configuration

Slide 17

Slide 17 text

config Update Store Application Configuration

Slide 18

Slide 18 text

config Unavailable Store Application Configuration

Slide 19

Slide 19 text

Requirements Strong Consistency - mutual exclusive at any time for locking purpose Highly Available - resilient to single points of failure & network partitions Watchable - push configuration updates to application

Slide 20

Slide 20 text

Requirements CAP Theorem - We want CP - We want something like Paxos

Slide 21

Slide 21 text

Common Problem Google - “All” infrastructure relies on Paxos GFS Paxos Big Table Spanner CFS Chubby

Slide 22

Slide 22 text

Common Problem Amazon - Replicated log for ec2 Microsoft - Boxwood for storage infrastructure Hadoop - ZooKeeper is the heart of the ecosystem

Slide 23

Slide 23 text

#GIFEE and Cloud Native Solution COMMON PROBLEM

Slide 24

Slide 24 text

10,000 Stars on Github 250 contributors CoreOS, Google, Red Hat, EMC, Cisco, Huawei, Baidu, Alibaba...

Slide 25

Slide 25 text

Kubernetes, Cloud Foundry Diego, Canal, many others THE HEART OF CLOUD NATIVE

Slide 26

Slide 26 text

Fully Replicated, Highly Available, Consistent ETCD KEY VALUE STORE

Slide 27

Slide 27 text

History of etcd ○ 2013.8 Alpha release (v0.x) ○ 2015.2 Stable release (v2.0+) ○ stable replication engine (new Raft implementation) ○ stable v2 API ○ 2016.6 (v3.0+) ○ efficient, powerful API ○ highly scalable backend

Slide 28

Slide 28 text

Production Ready long running failure injection tests no known data loss issues no known inconsistency issues

Slide 29

Slide 29 text

How does etcd work? ● Raft consensus algorithm ○ Using a replicated log to model a state machine ○ "In Search of an Understandable Consensus Algorithm" (Ongaro, 2014) ● Three key concepts ○ Leaders ○ Elections ○ Terms

Slide 30

Slide 30 text

How does etcd work? ● The cluster elects a leader for every given term ● All log appends (--> state machine changes) are decided by that leader and propagated to followers ● Much much more at http://raft.github.io/

Slide 31

Slide 31 text

How does etcd work? ● Written in Go, statically linked ● /bin/etcd ○ daemon ○ 2379 (client requests/HTTP + JSON API) ○ 2380 (peer-to-peer/HTTP + protobuf) ● /bin/etcdctl ○ command line client ● net/http, encoding/json, golang/protobuf, ...

Slide 32

Slide 32 text

Clustering ETCD BASICS

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

etcd cluster failure tolerance

Slide 41

Slide 41 text

locksmith ETCD APPS

Slide 42

Slide 42 text

locksmith ● cluster wide reboot lock ○ "semaphore for reboots" ● CoreOS updates happen automatically ○ prevent all the machines restarting at once...

Slide 43

Slide 43 text

Cluster Wide Reboot Lock server1 server2 server3

Slide 44

Slide 44 text

Cluster Wide Reboot Lock server1 server2 server3 needs reboot

Slide 45

Slide 45 text

Cluster Wide Reboot Lock ● Need to reboot? Decrement the semaphore key (atomically) with etcd ● manager.Reboot() and wait... ● After reboot, increment the semaphore key in etcd (atomically)

Slide 46

Slide 46 text

Cluster Wide Reboot Lock server1 server2 server3 needs reboot Sem=1

Slide 47

Slide 47 text

Cluster Wide Reboot Lock server1 server2 server3 needs reboot Sem=1 Lock()

Slide 48

Slide 48 text

Cluster Wide Reboot Lock server1 server2 server3 needs reboot Sem=0 Lock() acquired

Slide 49

Slide 49 text

Cluster Wide Reboot Lock server1 server2 server3 Sem=0 Reboot()

Slide 50

Slide 50 text

Cluster Wide Reboot Lock server1 server2 server3 Sem=0 Rebooting...

Slide 51

Slide 51 text

Cluster Wide Reboot Lock server1 server2 server3 Sem=0 Rebooting... needs reboot

Slide 52

Slide 52 text

Cluster Wide Reboot Lock server1 server2 server3 Sem=0 Rebooting... needs reboot Lock()

Slide 53

Slide 53 text

Cluster Wide Reboot Lock server1 server2 server3 Sem=0 Rebooting... needs reboot Lock() fail

Slide 54

Slide 54 text

Cluster Wide Reboot Lock server1 server2 server3 Sem=0 Rebooting... needs reboot wait...

Slide 55

Slide 55 text

Cluster Wide Reboot Lock server1 server2 server3 Rebooted! Sem=0 needs reboot wait...

Slide 56

Slide 56 text

needs reboot Cluster Wide Reboot Lock server1 server2 server3 Sem=0 Unlock() wait...

Slide 57

Slide 57 text

needs reboot Cluster Wide Reboot Lock server1 server2 server3 Sem=1 Unlock() wait...

Slide 58

Slide 58 text

needs reboot Cluster Wide Reboot Lock server1 server2 Sem=1 Lock() server3

Slide 59

Slide 59 text

needs reboot Cluster Wide Reboot Lock server1 server2 Sem=0 Lock() acquired server3

Slide 60

Slide 60 text

Reboot() Cluster Wide Reboot Lock server1 server2 Sem=0 server3

Slide 61

Slide 61 text

Canal (formerly flannel) ETCD APPS

Slide 62

Slide 62 text

192.168.1.10 192.168.1.40

Slide 63

Slide 63 text

192.168.1.10 192.168.1.40

Slide 64

Slide 64 text

10.0.0.3 10.0.0.8 10.0.1.10 10.0.1.20 192.168.1.10 192.168.1.40

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

192.168.1.10 192.168.1.40 10.0.0.0/24 10.0.1.0/24

Slide 67

Slide 67 text

routes to 192.168.1.40 192.168.1.10 192.168.1.40 10.0.0.0/24 10.0.1.0/24

Slide 68

Slide 68 text

192.168.1.40 10.0.1.0/24 192.168.1.10 routes to 192.168.1.10

Slide 69

Slide 69 text

Canal Today ● virtual (overlay) network for constrained envs ● BGP for physical environments ● Connection policies ● Built for Kubernetes useful in other systems with CNI

Slide 70

Slide 70 text

skydns ETCD APPS

Slide 71

Slide 71 text

skydns ● Service discovery and DNS server ● backed by etcd for all configuration/records

Slide 72

Slide 72 text

vulcand ETCD APPS

Slide 73

Slide 73 text

vulcand ● "programmatic, extendable proxy for microservices" ● HTTP load balancer ● etcd for storing all configuration

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

confd ETCD APPS

Slide 76

Slide 76 text

confd ● simple configuration templating ● for "dumb" applications ● watch etcd for changes, render templates with new values, reload applications

Slide 77

Slide 77 text

Kubernetes ETCD APPS

Slide 78

Slide 78 text

Simple cluster operations Secure and Simple API Friendly operational tools

Slide 79

Slide 79 text

Guides & Tools coreos.com/kubernetes

Slide 80

Slide 80 text

etcd benchmarks SCALING

Slide 81

Slide 81 text

Performance: 1k keys

Slide 82

Slide 82 text

Performance: etcd2 - 600K keys Snapshot caused performance degradation

Slide 83

Slide 83 text

Performance: etcd2 - 600K keys Snapshot triggered elections

Slide 84

Slide 84 text

Performance: Zookeeper default

Slide 85

Slide 85 text

Performance: Zookeeper default Snapshot triggered election

Slide 86

Slide 86 text

Performance: Zookeeper snapshot disabled GC

Slide 87

Slide 87 text

Performance: etcd3 vs zk snapshot disabled

Slide 88

Slide 88 text

Performance: etcd3 vs zk snapshot disabled

Slide 89

Slide 89 text

Memory: 512MB data - 2M 256B keys

Slide 90

Slide 90 text

Reliability ETCD DESIGN

Slide 91

Slide 91 text

Reliability ● 99% at small scale is easy ○ Failure is infrequent and human manageable ● 99% at large scale is not enough ○ Not manageable by humans ● 99.99% at large scale ○ Reliable systems at bottom layer

Slide 92

Slide 92 text

WAL, Snapshots, Testing HOW DO WE ACHIEVE RELIABILITY

Slide 93

Slide 93 text

Write Ahead Log ● Append only ○ Simple is good ● Rolling CRC protected ○ Storage & OSes can be unreliable

Slide 94

Slide 94 text

Snapshots ● Torturing DBs for Fun and Profit (OSDI2014) ○ The simpler database is safer ○ LMDB was the winner ● Boltdb an append only B+Tree ○ A simpler LMDB written in Go

Slide 95

Slide 95 text

Testing Clusters Failure ● Inject failures into running clusters ● White box runtime checking ○ Hash state of the system ○ Progress of the system

Slide 96

Slide 96 text

Testing Cluster Health with Failures Issue lock operations across cluster Ensure the correctness of client library

Slide 97

Slide 97 text

dash.etcd.io TESTING CLUSTER

Slide 98

Slide 98 text

etcd/raft Reliability ● Designed for testability and flexibility ● Used by large scale db systems and others ○ Cockroachdb, TiKV, Dgraph

Slide 99

Slide 99 text

Do one thing ETCD VS OTHERS

Slide 100

Slide 100 text

Only do the One Thing ETCD VS OTHERS

Slide 101

Slide 101 text

Do it really well ETCD VS OTHERS

Slide 102

Slide 102 text

Proxy, Caching, Watch Coalescing, Secondary Index FUTURE WORK

Slide 103

Slide 103 text

github.com/coreos/etcd GET INVOLVED

Slide 104

Slide 104 text

No content

Slide 105

Slide 105 text

Linux

Slide 106

Slide 106 text

No content

Slide 107

Slide 107 text

Runs on any platform Consistency across clouds Trivial dev, test, & prod Faster time to market

Slide 108

Slide 108 text

Training San Francisco September 13 & 14 New York City September 27 & 28 San Francisco October 11 & 12 New York City October 25 & 26 Seattle November 10 & 11 https://coreos.com/training

Slide 109

Slide 109 text

Build, Store and Distribute your Containers quay.io

Slide 110

Slide 110 text

Thank you! Brandon Philips @brandonphilips | [email protected] | coreos.com We’re hiring in all departments! Email: [email protected] Positions: coreos.com/ careers