Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Operating and Tuning Riak

Operating and Tuning Riak

Presented at http://www.meetup.com/Atlanta-Riak-Meetup/events/116993012/ - mostly a discussion about latency.

Tom Santero

May 29, 2013
Tweet

More Decks by Tom Santero

Other Decks in Technology

Transcript

  1. 1. Data Model 2. Persistence Mechanism key value key value

    key value key value key value key value key value key value key value
  2. 1. Data Model 2. Persistence Mechanism key value key value

    key value key value key value key value key value key value key value Bitcask LevelDB
  3. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> index = hash(key) % count(servers)
  4. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> index = hash(key) % count(servers) =>3
  5. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> index = hash(key) % count(servers) E ---->
  6. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> index = hash(key) % count(servers) E ----> =>4
  7. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> index = hash(key) % count(servers) E ----> remap keys with every topology change???
  8. 3. Distribution Logic 4. Replication A ----> B ----> C

    ----> D ----> E ----> SOLUTION: consistent hashing!
  9. 3. Distribution Logic 4. Replication node 0 node 1 node

    2 node 3 Replicas are stored N - 1 contiguous partitions
  10. 3. Distribution Logic 4. Replication node 0 node 1 node

    2 node 3 hash(“cities/atlanta”) Replicas are stored N - 1 contiguous partitions
  11. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2
  12. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine
  13. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine put(“cities/atlanta”)
  14. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine put(“cities/atlanta”)
  15. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine put(“cities/atlanta”) FALLBACK “SECONDARY”
  16. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine FALLBACK “SECONDARY”
  17. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine FALLBACK “SECONDARY” node 2
  18. 7. Fault Tolerance 8. Con!ict Resolution node 0 node 1

    node 2 node 3 Replicas are stored N - 1 contiguous partitions node 2 o"ine node 2 HINTED HANDOFF
  19. 7. Fault Tolerance 8. Con!ict Resolution Vector Clocks establish temporality

    - gives us “happened before” - easy to reason about - provide a way for resolving con!icting writes
  20. 7. Fault Tolerance 8. Con!ict Resolution Vector Clocks establish temporality

    - gives us “happened before” - easy to reason about - provide a way for resolving con!icting writes
  21. 9. Anti-Entropy 10. Map/Reduce merkle tree to track changes coordinated

    at the vnode level run as a background process exchange with neighbor vnodes for inconsistencies resolution semantics: trigger read-repair
  22. A masterless, distributed key/value store with peer-to-peer replication, object versioning

    logic, rich query capabilities that’s resilient to failures and self-healing.
  23. Riak should have a !at latency curve. 95th & 99th

    are irregular, your developers are ****ing you.
  24. F + 1 2F + 1 3F + 1 fundamental

    minimum a majority are alive Byzantine Fault Tolerance
  25. “...widespread underestimation of the speci!c di"culties of size seems one

    of the major underlying causes of the current software failure.” --EW Dijkstra Notes on Structured Programming 1969
  26. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale
  27. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale Metric CPU Memory Disk Space Disk IO Network File Descriptors Swap Threshold 75% * num_cores 70% - bu"ers 75% 80% sustained 70% sustained 75% of ulimit > 0KB
  28. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://<riakip>:<port>/stats Riak Counters graph them with: <insert monitoring tool>
  29. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://<riakip>:<port>/stats Riak Counters graph them with: <insert monitoring tool> or if you dont want to run your own monitoring service, there’s a aaS for that...
  30. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://<riakip>:<port>/stats Riak Counters graph them with: <insert monitoring tool> or if you dont want to run your own monitoring service, there’s a aaS for that... 1        import  json 2        from  urllib2  import  urlopen 3        import  socket 4        from  time  import  sleep 5   6        UDP_ADDRESS  =  "carbon.hostedgraphite.com" 7        UDP_PORT  =  2003 8        RIAK_STATS_URL='http://localhost:11098/stats' 9   10      HG_API_KEY='Your  Api  Key  from  HostedGraphite.com' 11   12      stats=json.load(urlopen(RIAK_STATS_URL)) 13   14      nn  =  stats['nodename'].replace('.',  '-­‐') 15      sock  =  socket.socket(socket.AF_INET,  socket.SOCK_DGRAM)  #  UDP#  Internet 16 17      for  k  in  stats: 18          if  type(1)  ==  type(stats[k]): 19              message='%s.%s.%s  %s'  %  (HG_API_KEY,nn,k,stats[k]) 20              sock.sendto(message,  (UDP_ADDRESS,  UDP_PORT)) 21              #sleep(0.1) 22              print  message 23      print  'Sent  %s'  %  len(stats) 24
  31. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale
  32. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale Run Queues Process Process Process Process Process Process OS + kernel CPU Core 1 . . . . . . CPU Core N Erlang VM N 1 SMP Schedulers (one per core)
  33. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale Riak != Hadoop
  34. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale busy_dist_port got you down?
  35. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale
  36. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale http://aphyr.com/posts/224-do-not-expose-riak-to-the-internet
  37. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale https://github.com/basho/basho_bench Basho Bench
  38. 1. No Monitoring 2. List All The Keys! 3. hadoooooops!

    4. reallllly BIG objects 5. 1 cluster, several AZs 6. No Firewall 7. No Capacity Planning 8. Wait too long to scale (actual footage of a cluster under load attempting hando")