Slide 1

Slide 1 text

Operating & Tuning Riak

Slide 2

Slide 2 text

Tom Santero @tsantero

Slide 3

Slide 3 text

Hector Castro @hectcastro

Slide 4

Slide 4 text

Let’s build a Database!

Slide 5

Slide 5 text

Desired Properties High Availability Low Latency Scalable Fault Tolerance Ops-Friendly Predictable

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

DYNAMO

Slide 8

Slide 8 text

1. Data Model 2. Persistence Mechanism

Slide 9

Slide 9 text

{“key”: “value”} 1. Data Model 2. Persistence Mechanism

Slide 10

Slide 10 text

{“key”: “value”} 1. Data Model 2. Persistence Mechanism

Slide 11

Slide 11 text

1. Data Model 2. Persistence Mechanism key value key value key value key value key value key value key value key value key value

Slide 12

Slide 12 text

1. Data Model 2. Persistence Mechanism key value key value key value key value key value key value key value key value key value Bitcask LevelDB

Slide 13

Slide 13 text

3. Distribution Logic 4. Replication A ----> B ----> C ----> D ---->

Slide 14

Slide 14 text

3. Distribution Logic 4. Replication A ----> B ----> C ----> D ----> index = hash(key) % count(servers)

Slide 15

Slide 15 text

3. Distribution Logic 4. Replication A ----> B ----> C ----> D ----> index = hash(key) % count(servers) =>3

Slide 16

Slide 16 text

3. Distribution Logic 4. Replication A ----> B ----> C ----> D ----> index = hash(key) % count(servers) E ---->

Slide 17

Slide 17 text

3. Distribution Logic 4. Replication A ----> B ----> C ----> D ----> index = hash(key) % count(servers) E ----> =>4

Slide 18

Slide 18 text

3. Distribution Logic 4. Replication A ----> B ----> C ----> D ----> index = hash(key) % count(servers) E ----> remap keys with every topology change???

Slide 19

Slide 19 text

AIN’T NOBODY GOT TIME FOR THAT!

Slide 20

Slide 20 text

3. Distribution Logic 4. Replication A ----> B ----> C ----> D ----> E ----> SOLUTION: consistent hashing!

Slide 21

Slide 21 text

hash ring

Slide 22

Slide 22 text

tokenize it

Slide 23

Slide 23 text

node 0 node 1 node 2

Slide 24

Slide 24 text

node 0 node 1 node 2 hash(key)

Slide 25

Slide 25 text

node 0 node 1 node 2 node 3 + hash(key)

Slide 26

Slide 26 text

3. Distribution Logic 4. Replication node 0 node 1 node 2 node 3 Replicas are stored N - 1 contiguous partitions

Slide 27

Slide 27 text

3. Distribution Logic 4. Replication node 0 node 1 node 2 node 3 hash(“cities/atlanta”) Replicas are stored N - 1 contiguous partitions

Slide 28

Slide 28 text

5. CRUD 6. Global State http://basho.com/understanding-riaks-con!gurable-behaviors-part-1/ Quorum requests N R W PR/PW DW (just open a socket)

Slide 29

Slide 29 text

5. CRUD 6. Global State gossip protocol /var/lib/riak/data/ring

Slide 30

Slide 30 text

7. Fault Tolerance 8. Con!ict Resolution node 0 node 1 node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2

Slide 31

Slide 31 text

7. Fault Tolerance 8. Con!ict Resolution node 0 node 1 node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine

Slide 32

Slide 32 text

7. Fault Tolerance 8. Con!ict Resolution node 0 node 1 node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine put(“cities/atlanta”)

Slide 33

Slide 33 text

7. Fault Tolerance 8. Con!ict Resolution node 0 node 1 node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine put(“cities/atlanta”)

Slide 34

Slide 34 text

7. Fault Tolerance 8. Con!ict Resolution node 0 node 1 node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine put(“cities/atlanta”) FALLBACK “SECONDARY”

Slide 35

Slide 35 text

7. Fault Tolerance 8. Con!ict Resolution node 0 node 1 node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine FALLBACK “SECONDARY”

Slide 36

Slide 36 text

7. Fault Tolerance 8. Con!ict Resolution node 0 node 1 node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine FALLBACK “SECONDARY” node 2

Slide 37

Slide 37 text

7. Fault Tolerance 8. Con!ict Resolution node 0 node 1 node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine node 2 HINTED HANDOFF

Slide 38

Slide 38 text

7. Fault Tolerance 8. Con!ict Resolution Vector Clocks establish temporality - gives us “happened before” - easy to reason about - provide a way for resolving con!icting writes

Slide 39

Slide 39 text

7. Fault Tolerance 8. Con!ict Resolution Vector Clocks establish temporality - gives us “happened before” - easy to reason about - provide a way for resolving con!icting writes

Slide 40

Slide 40 text

9. Anti-Entropy 10. Map/Reduce merkle tree to track changes coordinated at the vnode level run as a background process exchange with neighbor vnodes for inconsistencies resolution semantics: trigger read-repair

Slide 41

Slide 41 text

9. Anti-Entropy 10. Map/Reduce = hashes marked “dirty”

Slide 42

Slide 42 text

9. Anti-Entropy 10. Map/Reduce

Slide 43

Slide 43 text

9. Anti-Entropy 10. Map/Reduce

Slide 44

Slide 44 text

9. Anti-Entropy 10. Map/Reduce

Slide 45

Slide 45 text

9. Anti-Entropy 10. Map/Reduce

Slide 46

Slide 46 text

9. Anti-Entropy 10. Rich Queries = keys to read-repair

Slide 47

Slide 47 text

9. Anti-Entropy 10. Rich Queries 2i Riak Search Map/Reduce

Slide 48

Slide 48 text

9. Anti-Entropy 10. Rich Queries 2i Yokozuna Riak Search Map/Reduce

Slide 49

Slide 49 text

What did we build?

Slide 50

Slide 50 text

A masterless, distributed key/value store with peer-to-peer replication, object versioning logic, rich query capabilities that’s resilient to failures and self-healing.

Slide 51

Slide 51 text

Problem?

Slide 52

Slide 52 text

is excessive friendliness unfriendly?

Slide 53

Slide 53 text

What we talk about when we talk about love Availability

Slide 54

Slide 54 text

Latency Node Liveness Network Partitions

Slide 55

Slide 55 text

Riak should have a !at latency curve. 95th & 99th are irregular, your developers are ****ing you.

Slide 56

Slide 56 text

packets

Slide 57

Slide 57 text

FAILURE

Slide 58

Slide 58 text

How many hosts do you need to survive “F” failures?

Slide 59

Slide 59 text

F + 1 2F + 1 3F + 1 fundamental minimum a majority are alive Byzantine Fault Tolerance

Slide 60

Slide 60 text

“...widespread underestimation of the speci!c di"culties of size seems one of the major underlying causes of the current software failure.” --EW Dijkstra Notes on Structured Programming 1969

Slide 61

Slide 61 text

Common Mistakes

Slide 62

Slide 62 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale

Slide 63

Slide 63 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale Metric CPU Memory Disk Space Disk IO Network File Descriptors Swap Threshold 75% * num_cores 70% - bu"ers 75% 80% sustained 70% sustained 75% of ulimit > 0KB

Slide 64

Slide 64 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://:/stats Riak Counters graph them with:

Slide 65

Slide 65 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://:/stats Riak Counters graph them with: or if you dont want to run your own monitoring service, there’s a aaS for that...

Slide 66

Slide 66 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://:/stats Riak Counters graph them with: or if you dont want to run your own monitoring service, there’s a aaS for that... 1        import  json 2        from  urllib2  import  urlopen 3        import  socket 4        from  time  import  sleep 5   6        UDP_ADDRESS  =  "carbon.hostedgraphite.com" 7        UDP_PORT  =  2003 8        RIAK_STATS_URL='http://localhost:11098/stats' 9   10      HG_API_KEY='Your  Api  Key  from  HostedGraphite.com' 11   12      stats=json.load(urlopen(RIAK_STATS_URL)) 13   14      nn  =  stats['nodename'].replace('.',  '-­‐') 15      sock  =  socket.socket(socket.AF_INET,  socket.SOCK_DGRAM)  #  UDP#  Internet 16 17      for  k  in  stats: 18          if  type(1)  ==  type(stats[k]): 19              message='%s.%s.%s  %s'  %  (HG_API_KEY,nn,k,stats[k]) 20              sock.sendto(message,  (UDP_ADDRESS,  UDP_PORT)) 21              #sleep(0.1) 22              print  message 23      print  'Sent  %s'  %  len(stats) 24

Slide 67

Slide 67 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale

Slide 68

Slide 68 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale Run Queues Process Process Process Process Process Process OS + kernel CPU Core 1 . . . . . . CPU Core N Erlang VM N 1 SMP Schedulers (one per core)

Slide 69

Slide 69 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale Riak != Hadoop

Slide 70

Slide 70 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale busy_dist_port got you down?

Slide 71

Slide 71 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale

Slide 72

Slide 72 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://aphyr.com/posts/224-do-not-expose-riak-to-the-internet

Slide 73

Slide 73 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale https://github.com/basho/basho_bench Basho Bench

Slide 74

Slide 74 text

1. No Monitoring 2. List All The Keys! 3. hadoooooops! 4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale (actual footage of a cluster under load attempting hando")

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

Tuning Demo