Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Operating and Tuning Riak

Operating and Tuning Riak

Presented at http://www.meetup.com/Atlanta-Riak-Meetup/events/116993012/ - mostly a discussion about latency.

7c4bac30ed2d3a9d346ced746b1d985d?s=128

Tom Santero

May 29, 2013
Tweet

Transcript

  1. Operating & Tuning Riak

  2. Tom Santero @tsantero

  3. Hector Castro @hectcastro

  4. Let’s build a Database!

  5. Desired Properties High Availability Low Latency Scalable Fault Tolerance Ops-Friendly

    Predictable
  6. None
  7. DYNAMO

  8. 1. Data Model 2. Persistence Mechanism

  9. {“key”: “value”} 1. Data Model 2. Persistence Mechanism

  10. {“key”: “value”} 1. Data Model 2. Persistence Mechanism

  11. 1. Data Model 2. Persistence Mechanism key value key value

    key value key value key value key value key value key value key value
  12. 1. Data Model 2. Persistence Mechanism key value key value

    key value key value key value key value key value key value key value Bitcask LevelDB
  13. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ---->
  14. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> index = hash(key) % count(servers)
  15. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> index = hash(key) % count(servers) =>3
  16. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> index = hash(key) % count(servers) E ---->
  17. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> index = hash(key) % count(servers) E ----> =>4
  18. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> index = hash(key) % count(servers) E ----> remap keys with every topology change???
  19. AIN’T NOBODY GOT TIME FOR THAT!

  20. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> E ----> SOLUTION: consistent hashing!
  21. hash ring

  22. tokenize it

  23. node 0 node 1 node 2

  24. node 0 node 1 node 2 hash(key)

  25. node 0 node 1 node 2 node 3 + hash(key)

  26. 3. Distribution Logic 4. Replication node 0 node 1 node

    2 node 3 Replicas are stored N - 1 contiguous partitions
  27. 3. Distribution Logic 4. Replication node 0 node 1 node

    2 node 3 hash(“cities/atlanta”) Replicas are stored N - 1 contiguous partitions
  28. 5. CRUD 6. Global State http://basho.com/understanding-riaks-con!gurable-behaviors-part-1/ Quorum requests N R

    W PR/PW DW (just open a socket)
  29. 5. CRUD 6. Global State gossip protocol /var/lib/riak/data/ring

  30. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2
  31. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine
  32. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine put(“cities/atlanta”)
  33. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine put(“cities/atlanta”)
  34. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine put(“cities/atlanta”) FALLBACK “SECONDARY”
  35. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine FALLBACK “SECONDARY”
  36. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine FALLBACK “SECONDARY” node 2
  37. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine node 2 HINTED HANDOFF
  38. 7. Fault Tolerance 8. Con!ict Resolution Vector Clocks establish temporality

    - gives us “happened before” - easy to reason about - provide a way for resolving con!icting writes
  39. 7. Fault Tolerance 8. Con!ict Resolution Vector Clocks establish temporality

    - gives us “happened before” - easy to reason about - provide a way for resolving con!icting writes
  40. 9. Anti-Entropy 10. Map/Reduce merkle tree to track changes coordinated

    at the vnode level run as a background process exchange with neighbor vnodes for inconsistencies resolution semantics: trigger read-repair
  41. 9. Anti-Entropy 10. Map/Reduce = hashes marked “dirty”

  42. 9. Anti-Entropy 10. Map/Reduce

  43. 9. Anti-Entropy 10. Map/Reduce

  44. 9. Anti-Entropy 10. Map/Reduce

  45. 9. Anti-Entropy 10. Map/Reduce

  46. 9. Anti-Entropy 10. Rich Queries = keys to read-repair

  47. 9. Anti-Entropy 10. Rich Queries 2i Riak Search Map/Reduce

  48. 9. Anti-Entropy 10. Rich Queries 2i Yokozuna Riak Search Map/Reduce

  49. What did we build?

  50. A masterless, distributed key/value store with peer-to-peer replication, object versioning

    logic, rich query capabilities that’s resilient to failures and self-healing.
  51. Problem?

  52. is excessive friendliness unfriendly?

  53. What we talk about when we talk about love Availability

  54. Latency Node Liveness Network Partitions

  55. Riak should have a !at latency curve. 95th & 99th

    are irregular, your developers are ****ing you.
  56. packets

  57. FAILURE

  58. How many hosts do you need to survive “F” failures?

  59. F + 1 2F + 1 3F + 1 fundamental

    minimum a majority are alive Byzantine Fault Tolerance
  60. “...widespread underestimation of the speci!c di"culties of size seems one

    of the major underlying causes of the current software failure.” --EW Dijkstra Notes on Structured Programming 1969
  61. Common Mistakes

  62. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale
  63. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale Metric CPU Memory Disk Space Disk IO Network File Descriptors Swap Threshold 75% * num_cores 70% - bu"ers 75% 80% sustained 70% sustained 75% of ulimit > 0KB
  64. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://<riakip>:<port>/stats Riak Counters graph them with: <insert monitoring tool>
  65. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://<riakip>:<port>/stats Riak Counters graph them with: <insert monitoring tool> or if you dont want to run your own monitoring service, there’s a aaS for that...
  66. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://<riakip>:<port>/stats Riak Counters graph them with: <insert monitoring tool> or if you dont want to run your own monitoring service, there’s a aaS for that... 1        import  json 2        from  urllib2  import  urlopen 3        import  socket 4        from  time  import  sleep 5   6        UDP_ADDRESS  =  "carbon.hostedgraphite.com" 7        UDP_PORT  =  2003 8        RIAK_STATS_URL='http://localhost:11098/stats' 9   10      HG_API_KEY='Your  Api  Key  from  HostedGraphite.com' 11   12      stats=json.load(urlopen(RIAK_STATS_URL)) 13   14      nn  =  stats['nodename'].replace('.',  '-­‐') 15      sock  =  socket.socket(socket.AF_INET,  socket.SOCK_DGRAM)  #  UDP#  Internet 16 17      for  k  in  stats: 18          if  type(1)  ==  type(stats[k]): 19              message='%s.%s.%s  %s'  %  (HG_API_KEY,nn,k,stats[k]) 20              sock.sendto(message,  (UDP_ADDRESS,  UDP_PORT)) 21              #sleep(0.1) 22              print  message 23      print  'Sent  %s'  %  len(stats) 24
  67. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale
  68. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale Run Queues Process Process Process Process Process Process OS + kernel CPU Core 1 . . . . . . CPU Core N Erlang VM N 1 SMP Schedulers (one per core)
  69. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale Riak != Hadoop
  70. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale busy_dist_port got you down?
  71. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale
  72. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://aphyr.com/posts/224-do-not-expose-riak-to-the-internet
  73. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale https://github.com/basho/basho_bench Basho Bench
  74. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale (actual footage of a cluster under load attempting hando")
  75. None
  76. Tuning Demo