Next Steps with the
Cornerstone of Distributed Systems
Brandon Philips
@brandonphilips | [email protected]
Demo Code: http://goo.gl/R6Og3Y
Free
Stickers
@
Podium!
Motivation
CoreOS cluster reboot lock
- Decrement a semaphore key atomically
- Reboot and wait...
- After reboot increment the semaphore key
Slide 6
Slide 6 text
3
CoreOS updates coordination
Slide 7
Slide 7 text
CoreOS updates coordination
3
Slide 8
Slide 8 text
...
CoreOS updates coordination
2
Slide 9
Slide 9 text
... ... ...
CoreOS updates coordination
0
Slide 10
Slide 10 text
... ... ...
CoreOS updates coordination
0
Slide 11
Slide 11 text
... ...
CoreOS updates coordination
0
Slide 12
Slide 12 text
... ...
CoreOS updates coordination
1
Slide 13
Slide 13 text
... ...
...
CoreOS updates coordination
0
Slide 14
Slide 14 text
CoreOS updates coordination
Slide 15
Slide 15 text
Store Application Configuration
config
Slide 16
Slide 16 text
config
Start / Restart
Start / Restart
Store Application Configuration
Slide 17
Slide 17 text
config
Update
Store Application Configuration
Slide 18
Slide 18 text
config
Unavailable
Store Application Configuration
Slide 19
Slide 19 text
Requirements
Strong Consistency
- mutual exclusive at any time for locking purpose
Highly Available
- resilient to single points of failure & network partitions
Watchable
- push configuration updates to application
Slide 20
Slide 20 text
Requirements
CAP Theorem
- We want CP
- We want something like Paxos
Slide 21
Slide 21 text
Common Problem
Google - “All” infrastructure relies on Paxos
GFS
Paxos
Big Table
Spanner
CFS
Chubby
Slide 22
Slide 22 text
Common Problem
Amazon - Replicated log for ec2
Microsoft - Boxwood for storage infrastructure
Hadoop - ZooKeeper is the heart of the ecosystem
Slide 23
Slide 23 text
#GIFEE and Cloud Native Solution
COMMON PROBLEM
Slide 24
Slide 24 text
10,000 Stars on Github
250 contributors
CoreOS, Google, Red Hat, EMC, Cisco,
Huawei, Baidu, Alibaba...
Slide 25
Slide 25 text
Kubernetes, Cloud Foundry Diego,
Canal, many others
THE HEART OF CLOUD NATIVE
Slide 26
Slide 26 text
Fully Replicated, Highly Available,
Consistent
ETCD KEY VALUE STORE
Slide 27
Slide 27 text
History of etcd
○ 2013.8 Alpha release (v0.x)
○ 2015.2 Stable release (v2.0+)
○ stable replication engine (new Raft implementation)
○ stable v2 API
○ 2016.6 (v3.0+)
○ efficient, powerful API
○ highly scalable backend
Slide 28
Slide 28 text
Production Ready
long running failure injection tests
no known data loss issues
no known inconsistency issues
Slide 29
Slide 29 text
How does etcd work?
● Raft consensus algorithm
○ Using a replicated log to model a state machine
○ "In Search of an Understandable Consensus Algorithm" (Ongaro, 2014)
● Three key concepts
○ Leaders
○ Elections
○ Terms
Slide 30
Slide 30 text
How does etcd work?
● The cluster elects a leader
for every given term
● All log appends (--> state
machine changes) are
decided by that leader and
propagated to followers
● Much much more at
http://raft.github.io/
Slide 31
Slide 31 text
How does etcd work?
● Written in Go, statically linked
● /bin/etcd
○ daemon
○ 2379 (client requests/HTTP + JSON API)
○ 2380 (peer-to-peer/HTTP + protobuf)
● /bin/etcdctl
○ command line client
● net/http, encoding/json, golang/protobuf, ...
Slide 32
Slide 32 text
Clustering
ETCD BASICS
Slide 33
Slide 33 text
No content
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
No content
Slide 36
Slide 36 text
No content
Slide 37
Slide 37 text
No content
Slide 38
Slide 38 text
No content
Slide 39
Slide 39 text
No content
Slide 40
Slide 40 text
etcd cluster failure tolerance
Slide 41
Slide 41 text
locksmith
ETCD APPS
Slide 42
Slide 42 text
locksmith
● cluster wide reboot lock
○ "semaphore for reboots"
● CoreOS updates happen automatically
○ prevent all the machines restarting at once...
Cluster Wide Reboot Lock
● Need to reboot? Decrement the
semaphore key (atomically) with etcd
● manager.Reboot() and wait...
● After reboot, increment the semaphore
key in etcd (atomically)
routes to
192.168.1.40
192.168.1.10
192.168.1.40
10.0.0.0/24 10.0.1.0/24
Slide 68
Slide 68 text
192.168.1.40
10.0.1.0/24
192.168.1.10
routes to
192.168.1.10
Slide 69
Slide 69 text
Canal Today
● virtual (overlay) network for constrained envs
● BGP for physical environments
● Connection policies
● Built for Kubernetes useful in other systems
with CNI
Slide 70
Slide 70 text
skydns
ETCD APPS
Slide 71
Slide 71 text
skydns
● Service discovery and DNS server
● backed by etcd for all configuration/records
Slide 72
Slide 72 text
vulcand
ETCD APPS
Slide 73
Slide 73 text
vulcand
● "programmatic, extendable proxy for
microservices"
● HTTP load balancer
● etcd for storing all configuration
Slide 74
Slide 74 text
No content
Slide 75
Slide 75 text
confd
ETCD APPS
Slide 76
Slide 76 text
confd
● simple configuration templating
● for "dumb" applications
● watch etcd for changes, render templates with
new values, reload applications
Slide 77
Slide 77 text
Kubernetes
ETCD APPS
Slide 78
Slide 78 text
Simple cluster operations
Secure and Simple API
Friendly operational tools
Reliability
● 99% at small scale is easy
○ Failure is infrequent and human manageable
● 99% at large scale is not enough
○ Not manageable by humans
● 99.99% at large scale
○ Reliable systems at bottom layer
Slide 92
Slide 92 text
WAL, Snapshots, Testing
HOW DO WE ACHIEVE RELIABILITY
Slide 93
Slide 93 text
Write Ahead Log
● Append only
○ Simple is good
● Rolling CRC protected
○ Storage & OSes can be unreliable
Slide 94
Slide 94 text
Snapshots
● Torturing DBs for Fun and Profit (OSDI2014)
○ The simpler database is safer
○ LMDB was the winner
● Boltdb an append only B+Tree
○ A simpler LMDB written in Go
Slide 95
Slide 95 text
Testing Clusters Failure
● Inject failures into running clusters
● White box runtime checking
○ Hash state of the system
○ Progress of the system
Slide 96
Slide 96 text
Testing Cluster Health with Failures
Issue lock operations across cluster
Ensure the correctness of client library
Slide 97
Slide 97 text
dash.etcd.io
TESTING CLUSTER
Slide 98
Slide 98 text
etcd/raft Reliability
● Designed for testability and flexibility
● Used by large scale db systems and others
○ Cockroachdb, TiKV, Dgraph
Slide 99
Slide 99 text
Do one thing
ETCD VS OTHERS
Slide 100
Slide 100 text
Only do the One Thing
ETCD VS OTHERS
Slide 101
Slide 101 text
Do it really well
ETCD VS OTHERS
Slide 102
Slide 102 text
Proxy, Caching, Watch Coalescing,
Secondary Index
FUTURE WORK
Slide 103
Slide 103 text
github.com/coreos/etcd
GET INVOLVED
Slide 104
Slide 104 text
No content
Slide 105
Slide 105 text
Linux
Slide 106
Slide 106 text
No content
Slide 107
Slide 107 text
Runs on any platform
Consistency across clouds
Trivial dev, test, & prod
Faster time to market
Slide 108
Slide 108 text
Training
San Francisco September 13 & 14
New York City September 27 & 28
San Francisco October 11 & 12
New York City October 25 & 26
Seattle November 10 & 11
https://coreos.com/training
Slide 109
Slide 109 text
Build, Store and Distribute your Containers
quay.io