Motivation
CoreOS cluster reboot lock
- Decrement a semaphore key atomically
- Reboot and wait...
- After reboot increment the semaphore key
Slide 5
Slide 5 text
3
CoreOS updates coordination
Slide 6
Slide 6 text
CoreOS updates coordination
3
Slide 7
Slide 7 text
...
CoreOS updates coordination
2
Slide 8
Slide 8 text
... ... ...
CoreOS updates coordination
0
Slide 9
Slide 9 text
... ... ...
CoreOS updates coordination
0
Slide 10
Slide 10 text
... ...
CoreOS updates coordination
0
Slide 11
Slide 11 text
... ...
CoreOS updates coordination
0
Slide 12
Slide 12 text
... ...
CoreOS updates coordination
0
Slide 13
Slide 13 text
... ...
...
CoreOS updates coordination
0
Slide 14
Slide 14 text
CoreOS updates coordination
Slide 15
Slide 15 text
Store Application Configuration
config
Slide 16
Slide 16 text
config
Start / Restart
Start / Restart
Store Application Configuration
Slide 17
Slide 17 text
config
Update
Store Application Configuration
Slide 18
Slide 18 text
config
Unavailable
Store Application Configuration
Slide 19
Slide 19 text
Requirements
Strong Consistency
- mutual exclusive at any time for locking purpose
Highly Available
- resilient to single points of failure & network partitions
Watchable
- push configuration updates to application
Slide 20
Slide 20 text
Requirements
CAP
- We want CP
- We want something like Paxos
Slide 21
Slide 21 text
Common problem
GFS
Paxos
Big Table
Spanner
CFS
Chubby
Google - “All” infrastructure relies on Paxos
Slide 22
Slide 22 text
Common problem
Amazon - Replicated log powers ec2
Microsoft - Boxwood powers storage
infrastructure
Hadoop - ZooKeeper is the heart of the ecosystem
Slide 23
Slide 23 text
COMMON PROBLEM
#GIFEE and Cloud Native Solution
Slide 24
Slide 24 text
10,000 Stars on Github
250 contributors
Google, Red Hat, EMC, Cisco, Huawei,
Baidu, Alibaba...
Slide 25
Slide 25 text
THE HEART OF CLOUD NATIVE
Kubernetes, Cloud Foundry Diego,
Project Calico, many others
Slide 26
Slide 26 text
ETCD KEY VALUE STORE
Fully Replicated, Highly Available,
Consistent
Slide 27
Slide 27 text
PUT(foo, bar), GET(foo), DELETE(foo)
Watch(foo)
CAS(foo, bar, bar1)
Key-value Operations
Reliable Performance
- Similar to ZooKeeper with snapshot disabled
- Incremental snapshot
- No Garbage Collection Pauses
- Off-heap storage
Slide 50
Slide 50 text
Performance etcd3 /ZooKeeper snapshot disabled
Slide 51
Slide 51 text
Performance etcd3 /ZooKeeper snapshot disabled
Slide 52
Slide 52 text
Memory
10GB
2.4GB
0.8GB
512MB data - 2M 256B keys
Slide 53
Slide 53 text
Reliability
99% at small scale is easy
- Failure is infrequent and human manageable
99% at large scale is not enough
- Not manageable by humans
99.99% at large scale
- Reliable systems at bottom layer
Slide 54
Slide 54 text
HOW DO WE ACHIEVE RELIABILITY
WAL, Snapshots, Testing
Slide 55
Slide 55 text
Write Ahead Log
Append only
- Simple is good
Rolling CRC protected
- Storage & OSes can be unreliable
Slide 56
Slide 56 text
Snapshots
Torturing DBs for Fun and Profit (OSDI2014)
- The simpler database is safer
- LMDB was the winner
Boltdb an append only B+Tree
- A simpler LMDB written in Go
Slide 57
Slide 57 text
Testing Clusters Failure
Inject failures into running clusters
White box runtime checking
- Hash state of the system
- Progress of the system
Slide 58
Slide 58 text
Testing Cluster Health with Failures
Issue lock operations across cluster
Ensure the correctness of client library
Slide 59
Slide 59 text
TESTING CLUSTER
dash.etcd.io
Slide 60
Slide 60 text
etcd/raft Reliability
Designed for testability and flexibility
Used by large scale db systems and others
- Cockroachdb, TiKV, Dgraph
Slide 61
Slide 61 text
etcd vs others
Do one thing
Slide 62
Slide 62 text
etcd vs others
Only do the One Thing
Slide 63
Slide 63 text
etcd vs others
Do it Really Well
Slide 64
Slide 64 text
etcd Reliability
Do it Really Well
Slide 65
Slide 65 text
ETCD v3.0 BETA
Efficient and Scalable
Slide 66
Slide 66 text
BETA AVAILABLE TODAY
github.com/coreos/etcd
Slide 67
Slide 67 text
FUTURE WORK
Proxy, Caching, Watch Coalescing,
Secondary Index