Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOps: A Cognitive Bias

7c4bac30ed2d3a9d346ced746b1d985d?s=47 Tom Santero
November 27, 2012

DevOps: A Cognitive Bias

Slides from my talk at http://phillydevops.org/

7c4bac30ed2d3a9d346ced746b1d985d?s=128

Tom Santero

November 27, 2012
Tweet

Transcript

  1. DevOps A Cognitive Bias Tom Santero @tsantero Nov 27, 2012

    - Philly DevOps Meetup Wednesday, November 28, 12
  2. @tsantero Tom Santero Technical Evangelist, Basho Technologies Wednesday, November 28,

    12
  3. Basho makes Riak Wednesday, November 28, 12

  4. Riak distributed, masterless highly available key value store PROS: CONS:

    high read/write availability predictable latency minimal maintenance required I/O bound network is very chatty permissive API Wednesday, November 28, 12
  5. Fun Facts: Basho was originally a SaaS We originally built

    Riak for said SaaS In 2009 we open-sourced and shifted our focus to Riak itself Wednesday, November 28, 12
  6. Riak: designed for SCALE Wednesday, November 28, 12

  7. In production at... ...and thousands more. Wednesday, November 28, 12

  8. Basho is unique in that we build a piece of

    infrastructure Wednesday, November 28, 12
  9. deploy. monitor. manage. Wednesday, November 28, 12

  10. We build Riak for YOU Wednesday, November 28, 12

  11. The Issues Basho is a software vendor, we don’t operate

    large scale systems. Every single deployment is di!erent dat Full Stack Wednesday, November 28, 12
  12. It passed the unit tests... SHIP IT! this is Artie

    the Lone Testing Ranger Wednesday, November 28, 12
  13. what if your test has bugs? Wednesday, November 28, 12

  14. 1823module 'mapred_test' 1824 mapred_test:76: compat_basic1_test_...*failed* 1825::{badmatch,{error, {could_not_reach_node,nonode@nohost}}} the build passes

    but the test fails... WTF?!?! Wednesday, November 28, 12
  15. Wednesday, November 28, 12

  16. Innumeracy the inability for human beings to reason with numbers

    Cognitive Bias the inability for human beings to reason intuitively about orders of magnitude Wednesday, November 28, 12
  17. Make Assumptions Make Educated Guesses Test The Shit Out Of

    Everything so, what can we do? and recognize that no matter what we do, something will ALWAYS BREAK Wednesday, November 28, 12
  18. this is TRUE of ALL SOFTWARE Wednesday, November 28, 12

  19. Customer Driven Development here, take this code and install it

    as a critical piece of your stack oh, it broke because of XYZ? here’s a patch / bug "x, sorry about that... Wednesday, November 28, 12
  20. A Tale from Production Wednesday, November 28, 12

  21. You have cluster Things are great It’s time to add

    capacity Situation Wednesday, November 28, 12
  22. Add a new node Solution Wednesday, November 28, 12

  23. Customer named nodes after drinks: Hostnames Aston IPA Highball Gin

    Framboise ESB Wednesday, November 28, 12
  24. riak-admin adding a node is easy! on  aston: $  riak-­‐admin

     join  ipa@192.168.10.223 * replaced by riak-admin cluster join <node> in Riak 1.2+ * Wednesday, November 28, 12
  25. uhh...wat? Wednesday, November 28, 12

  26. Basho Support Stop the handoff between nodes on  every  node

     we: riak  attach application:set_env(riak_core,  handoff_concurrency,  0). Wednesday, November 28, 12
  27. Monitor Wednesday, November 28, 12

  28. for signs of... Wednesday, November 28, 12

  29. stabilization Wednesday, November 28, 12

  30. What happened? 1. New node added 2. Ring must rebalance

    3. Nodes claim partitions 4. Handoff of data begins 5. Disks fill up Wednesday, November 28, 12
  31. triage time $  riak-­‐admin  member_status =================================  Membership  ================================ Status  

           Ring        Pending        Node -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ valid              4.3%          16.8%        riak@aston valid            18.8%          16.8%        riak@esb valid            19.1%          16.8%        riak@framboise valid            19.5%          16.8%        riak@gin valid            19.1%          16.4%        riak@highball valid            19.1%          16.4%        riak@ipa -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ Valid:6  /  Leaving:0  /  Exiting:0  /  Joining:0  /  Down:0 Wednesday, November 28, 12
  32. triage time Let’s try to relieve the pressure a bit

    Focus on the node with the least disk space left. gin:~$  riak  attach application:set_env(riak_core,  forced_ownership_handoff,  0). application:set_env(riak_core,  vnode_inactivity_timeout,  300000). application:set_env(riak_core,  handoff_concurrency,  1).   riak_core_vnode:trigger_handoff(element(2,   riak_core_vnode_master:get_vnode_pid(4110473354993164457447863592014545992782310277 12,  riak_kv_vnode))). Wednesday, November 28, 12
  33. triage time It took 20 minutes to transfer the vnode

    (riak@gin)7>  19:34:00.574  [info]  Starting  handoff  of  partition  riak_kv_vnode   411047335499316445744786359201454599278231027712  from  riak@gin  to  riak@aston gin:~$  sudo  netstat  -­‐nap  |  fgrep  10.36.18.245           tcp                0      1065  10.36.110.79:40532      10.36.18.245:8099      ESTABLISHED  27124/beam.smp     tcp                0            0  10.36.110.79:46345      10.36.18.245:53664    ESTABLISHED  27124/beam.smp (riak@gin)7>  19:54:56.721  [info]  Handoff  of  partition  riak_kv_vnode   411047335499316445744786359201454599278231027712  from  riak@gin  to  riak@aston completed:  sent  3805730  objects  in  1256.14  seconds Wednesday, November 28, 12
  34. triage time And the vnode had arrived at Aston from

    Gin aston:/data/riak/bitcask/ 205523667749658222872393179600727299639115513856-­‐132148847970820$  ls  -­‐la total  7305344 drwxr-­‐xr-­‐x      2  riak  riak              4096  2011-­‐11-­‐11  18:05  . drwxr-­‐xr-­‐x  258  riak  riak            36864  2011-­‐11-­‐11  18:56  .. -­‐rw-­‐-­‐-­‐-­‐-­‐-­‐-­‐      1  riak  riak  2147479761  2011-­‐11-­‐11  17:53  1321055508.bitcask.data -­‐rw-­‐r-­‐-­‐r-­‐-­‐      1  riak  riak      86614226  2011-­‐11-­‐11  17:53  1321055508.bitcask.hint -­‐rw-­‐-­‐-­‐-­‐-­‐-­‐-­‐      1  riak  riak  1120382399  2011-­‐11-­‐11  19:50  1321055611.bitcask.data -­‐rw-­‐r-­‐-­‐r-­‐-­‐      1  riak  riak      55333675  2011-­‐11-­‐11  19:50  1321055611.bitcask.hint -­‐rw-­‐-­‐-­‐-­‐-­‐-­‐-­‐      1  riak  riak  2035568266  2011-­‐11-­‐11  18:03  1321056070.bitcask.data -­‐rw-­‐r-­‐-­‐r-­‐-­‐      1  riak  riak      99390277  2011-­‐11-­‐11  18:03  1321056070.bitcask.hint -­‐rw-­‐-­‐-­‐-­‐-­‐-­‐-­‐      1  riak  riak  1879298219  2011-­‐11-­‐11  18:05  1321056214.bitcask.data -­‐rw-­‐r-­‐-­‐r-­‐-­‐      1  riak  riak      56509595  2011-­‐11-­‐11  18:05  1321056214.bitcask.hint -­‐rw-­‐-­‐-­‐-­‐-­‐-­‐-­‐      1  riak  riak                119  2011-­‐11-­‐11  17:53  bitcask.write.lock Wednesday, November 28, 12
  35. Eureka!!! You gonna eat that inode? Data was not being

    cleaned up after hando! This would eventually eat up all disk space Wednesday, November 28, 12
  36. Solution We already had a bug"x for the next release

    that detected this problem hot patch it! Wednesday, November 28, 12
  37. Hot Patch We patched their live, production system while still

    under load. (on  all  nodes)  riak  attach l(riak_kv_bitcask_backend). m(riak_kv_bitcask_backend). Module  riak_kv_bitcask_backend  compiled:  Date:  November  12  2011,  Time:  04.18 Compiler  options:    [{outdir,"ebin"},                                        debug_info,warnings_as_errors,                                        {parse_transform,lager_transform},                                        {i,"include"}] Object  file:  /usr/lib/riak/lib/riak_kv-­‐1.0.1/ebin/ riak_kv_bitcask_backend.beam Exports:   api_version/0                                  is_empty/1 callback/3                                        key_counts/0 delete/4                                            key_counts/1 drop/1                                                module_info/0 fold_buckets/4                                module_info/1 fold_keys/4                                      put/5 fold_objects/4                                start/2 get/3                                                  status/1... Wednesday, November 28, 12
  38. Yay!!!! And the new code did what we expected. {ok,

     R}  =  riak_core_ring_manager:get_my_ring(). [riak_core_vnode_master:get_vnode_pid(Partition,  riak_kv_vnode)  ||   {Partition,_}  <-­‐  riak_core_ring:all_owners(R)]. (riak@gin)19>  [riak_core_vnode_master:get_vnode_pid(Partition,  riak_kv_vnode)   ||  {Partition,_}  <-­‐  riak_core_ring:all_owners(R)]. 22:48:07.423  [notice]  Unused  data  directories  exist  for  partition   "11417981541647679048466287755595961091061972992":  "/data/riak/bitcask/ 11417981541647679048466287755595961091061972992" 22:48:07.785  [notice]  Unused  data  directories  exist  for  partition   "582317058624031631471780675535394015644160622592":  "/data/riak/bitcask/ 582317058624031631471780675535394015644160622592" 22:48:07.829  [notice]  Unused  data  directories  exist  for  partition   "782131735602866014819940711258323334737745149952":  "/data/riak/bitcask/ 782131735602866014819940711258323334737745149952" [{ok,<0.30093.11>}, ... Wednesday, November 28, 12
  39. Manual Cleanup So we backed up those vnodes with unused

    data on Gin to another system and manually removed them. gin:/data/riak/bitcask$  ls  manual_cleanup/   11417981541647679048466287755595961091061972992       782131735602866014819940711258323334737745149952 582317058624031631471780675535394015644160622592 gin:/data/riak/bitcask$  rm  -­‐rf  manual_cleanup Wednesday, November 28, 12
  40. Status Improves Wednesday, November 28, 12

  41. Open the Tap On Gin only: reset to defaults, re-enable

    hando!s on  gin: application:unset_env(riak_core,  forced_ownership_handoff). application:set_env(riak_core,  vnode_inactivity_timeout,  60000). application:set_env(riak_core,  handoff_concurrency,  1). Wednesday, November 28, 12
  42. GIN hando! -> IPA Wednesday, November 28, 12

  43. Highballs Turn Highball was next lowest now that Gin was

    handing data o!, time to restart it too. on  highball application:unset_env(riak_core,  forced_ownership_handoff). application:set_env(riak_core,  vnode_inactivity_timeout,  60000). application:set_env(riak_core,  handoff_concurrency,  1). on  gin application:set_env(riak_core,  handoff_concurrency,  4).  %  the  default  setting riak_core_vnode_manager:force_handoffs(). Wednesday, November 28, 12
  44. Rebalancing.... Wednesday, November 28, 12

  45. Rebalancing.... Wednesday, November 28, 12

  46. Rebalancing.... Wednesday, November 28, 12

  47. Rebalancing.... Wednesday, November 28, 12

  48. Rebalanced. Wednesday, November 28, 12

  49. Minimal Impact 6ms variance for 99th % (32ms to 38ms)

    0.68s variance for 100th % (0.12 to 0.8s) Wednesday, November 28, 12
  50. What Have We Learned? Fallible humans write imperfect code. Have

    shut-o! valves ready for when inevitable bad behavior strikes. Construct your system so that you can do triage without major downtime. Wednesday, November 28, 12
  51. Thanks! Oh yeah, go download Riak :) Questions? Comments? Criticisms?

    @tsantero Wednesday, November 28, 12