Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

16 – 17 November, Sofia ISTACON.ORG Running in multiple data centers By Nikolay Stoitsev

Slide 3

Slide 3 text

16 – 17 November, Sofia ISTACON.ORG

Slide 4

Slide 4 text

16 – 17 November, Sofia ISTACON.ORG 600+ cities

Slide 5

Slide 5 text

16 – 17 November, Sofia ISTACON.ORG 75+ countries

Slide 6

Slide 6 text

16 – 17 November, Sofia ISTACON.ORG 6 continents

Slide 7

Slide 7 text

2 000 000+ drivers

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

16 – 17 November, Sofia ISTACON.ORG How the Internet Kept Humming During 2 Hurricanes https://www.nytimes.com/2017/09/18/us/harvey-irma-internet.html

Slide 10

Slide 10 text

16 – 17 November, Sofia ISTACON.ORG Fault tolerance

Slide 11

Slide 11 text

16 – 17 November, Sofia ISTACON.ORG Low latency

Slide 12

Slide 12 text

16 – 17 November, Sofia ISTACON.ORG Compliance

Slide 13

Slide 13 text

16 – 17 November, Sofia ISTACON.ORG Data locality

Slide 14

Slide 14 text

16 – 17 November, Sofia ISTACON.ORG Under-utilized capacity

Slide 15

Slide 15 text

16 – 17 November, Sofia ISTACON.ORG CAP

Slide 16

Slide 16 text

16 – 17 November, Sofia ISTACON.ORG Continuous network partition

Slide 17

Slide 17 text

16 – 17 November, Sofia ISTACON.ORG 2 types of architecture

Slide 18

Slide 18 text

16 – 17 November, Sofia ISTACON.ORG Active-Passive

Slide 19

Slide 19 text

DC 1 DC 2

Slide 20

Slide 20 text

DC 1 DC 2

Slide 21

Slide 21 text

16 – 17 November, Sofia ISTACON.ORG Failover

Slide 22

Slide 22 text

16 – 17 November, Sofia ISTACON.ORG DNS

Slide 23

Slide 23 text

16 – 17 November, Sofia ISTACON.ORG Stateless service

Slide 24

Slide 24 text

16 – 17 November, Sofia ISTACON.ORG Stateful service

Slide 25

Slide 25 text

DC 1 DC 2 DB 1 DB 2 Active-Passive example

Slide 26

Slide 26 text

DC 1 DC 2 DB 1 DB 2 Active-Passive example

Slide 27

Slide 27 text

DC 1 DC 2 DB 1 DB 2 Active-Passive example

Slide 28

Slide 28 text

DC 1 DC 2 Master Slave Slave Slave Real-life example

Slide 29

Slide 29 text

DC 1 DC 2 Master Slave Slave Slave HAProxy HAProxy Smart intermediary

Slide 30

Slide 30 text

DC 1 DC 2 Master Slave Slave Slave HAProxy HAProxy Smart intermediary

Slide 31

Slide 31 text

DC 1 DC 2 Slave Slave Master Slave HAProxy HAProxy Smart intermediary

Slide 32

Slide 32 text

16 – 17 November, Sofia ISTACON.ORG All-active

Slide 33

Slide 33 text

DC 1 DC 2

Slide 34

Slide 34 text

16 – 17 November, Sofia ISTACON.ORG Locality

Slide 35

Slide 35 text

16 – 17 November, Sofia ISTACON.ORG Split traffic in groups

Slide 36

Slide 36 text

16 – 17 November, Sofia ISTACON.ORG Global State

Slide 37

Slide 37 text

mod 2 DC 1 DC 2 user_id = 0 = 1 Partitioning

Slide 38

Slide 38 text

mod 3 DC 1 DC 2 user_id = 0 = 1 DC 3 = 2 Partitioning

Slide 39

Slide 39 text

16 – 17 November, Sofia ISTACON.ORG Very inefficient

Slide 40

Slide 40 text

16 – 17 November, Sofia ISTACON.ORG Consistent hashing DC 1 DC 3 DC 2 DC 3

Slide 41

Slide 41 text

16 – 17 November, Sofia ISTACON.ORG Consistent hashing DC 1 DC 3 DC 2 DC 3 user_id

Slide 42

Slide 42 text

16 – 17 November, Sofia ISTACON.ORG DNS load balancing

Slide 43

Slide 43 text

16 – 17 November, Sofia ISTACON.ORG DC 1 DC 2 San Francisco Los Angeles New York Toronto

Slide 44

Slide 44 text

16 – 17 November, Sofia ISTACON.ORG DC 1 DC 2 San Francisco Los Angeles New York Toronto

Slide 45

Slide 45 text

16 – 17 November, Sofia ISTACON.ORG Database layer

Slide 46

Slide 46 text

16 – 17 November, Sofia ISTACON.ORG No generic solution

Slide 47

Slide 47 text

16 – 17 November, Sofia ISTACON.ORG Galera Cluster Synchronous multi-master database cluster http://galeracluster.com/

Slide 48

Slide 48 text

16 – 17 November, Sofia ISTACON.ORG DC 1 Master Slave DC 2 Slave Master DC 3 Master Slave DC 4 Slave Master

Slide 49

Slide 49 text

16 – 17 November, Sofia ISTACON.ORG Apache Cassandra http://cassandra.apache.org/

Slide 50

Slide 50 text

16 – 17 November, Sofia ISTACON.ORG Linear scalability Fault-tolerance Commodity hardware

Slide 51

Slide 51 text

16 – 17 November, Sofia ISTACON.ORG Designed for multiple data centers

Slide 52

Slide 52 text

16 – 17 November, Sofia ISTACON.ORG Apache Mesos http://mesos.apache.org/

Slide 53

Slide 53 text

16 – 17 November, Sofia ISTACON.ORG Application Layer

Slide 54

Slide 54 text

16 – 17 November, Sofia ISTACON.ORG Apache Kafka

Slide 55

Slide 55 text

16 – 17 November, Sofia ISTACON.ORG uReplicator https://github.com/uber/uReplicator

Slide 56

Slide 56 text

16 – 17 November, Sofia ISTACON.ORG https://eng.uber.com/ureplicator/

Slide 57

Slide 57 text

16 – 17 November, Sofia ISTACON.ORG https://eng.uber.com/ureplicator/

Slide 58

Slide 58 text

16 – 17 November, Sofia ISTACON.ORG Cherami https://github.com/uber/cherami-server

Slide 59

Slide 59 text

16 – 17 November, Sofia ISTACON.ORG Multi-zone topics Producer Producer Topic Topic Consumer Group Consumer Group replication

Slide 60

Slide 60 text

16 – 17 November, Sofia ISTACON.ORG Multi-zone consumers Producer Topic Topic Consumer Group Consumer Group replication offset sync

Slide 61

Slide 61 text

16 – 17 November, Sofia ISTACON.ORG https://eng.uber.com/cherami/

Slide 62

Slide 62 text

16 – 17 November, Sofia ISTACON.ORG Lessons learned

Slide 63

Slide 63 text

16 – 17 November, Sofia ISTACON.ORG Total dev time Time thinking about failover

Slide 64

Slide 64 text

16 – 17 November, Sofia ISTACON.ORG Total dev time Time thinking about failover

Slide 65

Slide 65 text

16 – 17 November, Sofia ISTACON.ORG Failover testing

Slide 66

Slide 66 text

16 – 17 November, Sofia ISTACON.ORG Failure testing

Slide 67

Slide 67 text

16 – 17 November, Sofia ISTACON.ORG Super smart clients

Slide 68

Slide 68 text

16 – 17 November, Sofia ISTACON.ORG “ The best way to avoid failure is to fail constantly. http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html

Slide 69

Slide 69 text

16 – 17 November, Sofia ISTACON.ORG Thank you! @stoitsev Nikolay Stoitsev http://careersinfo.uber.com/sofia-engineering