etcd: Next Steps with the Cornerstone of Distributed Systems

Next Steps with the Cornerstone of Distributed Systems Brandon Philips
@brandonphilips | [email protected] Demo Code: http://goo.gl/R6Og3Y Free Stickers @ Podium!

Why build etcd? MOTIVATION

Uncoordinated Upgrades

... ... ... ... ... ... Unavailable Uncoordinated Upgrades

Motivation CoreOS cluster reboot lock - Decrement a semaphore key
atomically - Reboot and wait... - After reboot increment the semaphore key

3 CoreOS updates coordination

CoreOS updates coordination 3

... CoreOS updates coordination 2

... ... ... CoreOS updates coordination 0

... ... CoreOS updates coordination 0

... ... CoreOS updates coordination 1

... ... ... CoreOS updates coordination 0

CoreOS updates coordination

Store Application Configuration config

config Start / Restart Start / Restart Store Application Configuration

config Update Store Application Configuration

config Unavailable Store Application Configuration

Requirements Strong Consistency - mutual exclusive at any time for
locking purpose Highly Available - resilient to single points of failure & network partitions Watchable - push configuration updates to application

Requirements CAP Theorem - We want CP - We want
something like Paxos

Common Problem Google - “All” infrastructure relies on Paxos GFS
Paxos Big Table Spanner CFS Chubby

Common Problem Amazon - Replicated log for ec2 Microsoft -
Boxwood for storage infrastructure Hadoop - ZooKeeper is the heart of the ecosystem

#GIFEE and Cloud Native Solution COMMON PROBLEM

10,000 Stars on Github 250 contributors CoreOS, Google, Red Hat,
EMC, Cisco, Huawei, Baidu, Alibaba...

Kubernetes, Cloud Foundry Diego, Canal, many others THE HEART OF
CLOUD NATIVE

Fully Replicated, Highly Available, Consistent ETCD KEY VALUE STORE

History of etcd ◦ 2013.8 Alpha release (v0.x) ◦ 2015.2
Stable release (v2.0+) ◦ stable replication engine (new Raft implementation) ◦ stable v2 API ◦ 2016.6 (v3.0+) ◦ efficient, powerful API ◦ highly scalable backend

Production Ready long running failure injection tests no known data
loss issues no known inconsistency issues

How does etcd work? • Raft consensus algorithm ◦ Using
a replicated log to model a state machine ◦ "In Search of an Understandable Consensus Algorithm" (Ongaro, 2014) • Three key concepts ◦ Leaders ◦ Elections ◦ Terms

How does etcd work? • The cluster elects a leader
for every given term • All log appends (--> state machine changes) are decided by that leader and propagated to followers • Much much more at http://raft.github.io/

How does etcd work? • Written in Go, statically linked
• /bin/etcd ◦ daemon ◦ 2379 (client requests/HTTP + JSON API) ◦ 2380 (peer-to-peer/HTTP + protobuf) • /bin/etcdctl ◦ command line client • net/http, encoding/json, golang/protobuf, ...

Clustering ETCD BASICS

etcd cluster failure tolerance

locksmith ETCD APPS

locksmith • cluster wide reboot lock ◦ "semaphore for reboots"
• CoreOS updates happen automatically ◦ prevent all the machines restarting at once...

Cluster Wide Reboot Lock server1 server2 server3

Cluster Wide Reboot Lock server1 server2 server3 needs reboot

Cluster Wide Reboot Lock • Need to reboot? Decrement the
semaphore key (atomically) with etcd • manager.Reboot() and wait... • After reboot, increment the semaphore key in etcd (atomically)

Cluster Wide Reboot Lock server1 server2 server3 needs reboot Sem=1

Lock()

Lock() acquired

Cluster Wide Reboot Lock server1 server2 server3 Sem=0 Reboot()

Cluster Wide Reboot Lock server1 server2 server3 Sem=0 Rebooting...

Cluster Wide Reboot Lock server1 server2 server3 Sem=0 Rebooting... needs
reboot

reboot Lock()

reboot Lock() fail

reboot wait...

Cluster Wide Reboot Lock server1 server2 server3 Rebooted! Sem=0 needs
reboot wait...

needs reboot Cluster Wide Reboot Lock server1 server2 server3 Sem=0
Unlock() wait...

needs reboot Cluster Wide Reboot Lock server1 server2 server3 Sem=1
Unlock() wait...

needs reboot Cluster Wide Reboot Lock server1 server2 Sem=1 Lock()
server3

needs reboot Cluster Wide Reboot Lock server1 server2 Sem=0 Lock()
acquired server3

Reboot() Cluster Wide Reboot Lock server1 server2 Sem=0 server3

Canal (formerly flannel) ETCD APPS

192.168.1.10 192.168.1.40

10.0.0.3 10.0.0.8 10.0.1.10 10.0.1.20 192.168.1.10 192.168.1.40

192.168.1.10 192.168.1.40 10.0.0.0/24 10.0.1.0/24

routes to 192.168.1.40 192.168.1.10 192.168.1.40 10.0.0.0/24 10.0.1.0/24

192.168.1.40 10.0.1.0/24 192.168.1.10 routes to 192.168.1.10

Canal Today • virtual (overlay) network for constrained envs •
BGP for physical environments • Connection policies • Built for Kubernetes useful in other systems with CNI

skydns ETCD APPS

skydns • Service discovery and DNS server • backed by
etcd for all configuration/records

vulcand ETCD APPS

vulcand • "programmatic, extendable proxy for microservices" • HTTP load
balancer • etcd for storing all configuration

confd ETCD APPS

confd • simple configuration templating • for "dumb" applications •
watch etcd for changes, render templates with new values, reload applications

Kubernetes ETCD APPS

Simple cluster operations Secure and Simple API Friendly operational tools

Guides & Tools coreos.com/kubernetes

etcd benchmarks SCALING

Performance: 1k keys

Performance: etcd2 - 600K keys Snapshot caused performance degradation

Performance: etcd2 - 600K keys Snapshot triggered elections

Performance: Zookeeper default

Performance: Zookeeper default Snapshot triggered election

Performance: Zookeeper snapshot disabled GC

Performance: etcd3 vs zk snapshot disabled

Memory: 512MB data - 2M 256B keys

Reliability ETCD DESIGN

Reliability • 99% at small scale is easy ◦ Failure
is infrequent and human manageable • 99% at large scale is not enough ◦ Not manageable by humans • 99.99% at large scale ◦ Reliable systems at bottom layer

WAL, Snapshots, Testing HOW DO WE ACHIEVE RELIABILITY

Write Ahead Log • Append only ◦ Simple is good
• Rolling CRC protected ◦ Storage & OSes can be unreliable

Snapshots • Torturing DBs for Fun and Profit (OSDI2014) ◦
The simpler database is safer ◦ LMDB was the winner • Boltdb an append only B+Tree ◦ A simpler LMDB written in Go

Testing Clusters Failure • Inject failures into running clusters •
White box runtime checking ◦ Hash state of the system ◦ Progress of the system

Testing Cluster Health with Failures Issue lock operations across cluster
Ensure the correctness of client library

dash.etcd.io TESTING CLUSTER

etcd/raft Reliability • Designed for testability and flexibility • Used
by large scale db systems and others ◦ Cockroachdb, TiKV, Dgraph

Do one thing ETCD VS OTHERS

Only do the One Thing ETCD VS OTHERS

Do it really well ETCD VS OTHERS

Proxy, Caching, Watch Coalescing, Secondary Index FUTURE WORK

github.com/coreos/etcd GET INVOLVED

Runs on any platform Consistency across clouds Trivial dev, test,
& prod Faster time to market

Training San Francisco September 13 & 14 New York City
September 27 & 28 San Francisco October 11 & 12 New York City October 25 & 26 Seattle November 10 & 11 https://coreos.com/training

Build, Store and Distribute your Containers quay.io

Thank you! Brandon Philips @brandonphilips | [email protected] | coreos.com We’re
hiring in all departments! Email: [email protected] Positions: coreos.com/ careers

etcd: Next Steps with the Cornerstone of Distr...

etcd: Next Steps with the Cornerstone of Distributed Systems

More Decks by Brandon Philips

Featured

Transcript