Slide 1

Slide 1 text

DevOps A Cognitive Bias Tom Santero @tsantero Nov 27, 2012 - Philly DevOps Meetup Wednesday, November 28, 12

Slide 2

Slide 2 text

@tsantero Tom Santero Technical Evangelist, Basho Technologies Wednesday, November 28, 12

Slide 3

Slide 3 text

Basho makes Riak Wednesday, November 28, 12

Slide 4

Slide 4 text

Riak distributed, masterless highly available key value store PROS: CONS: high read/write availability predictable latency minimal maintenance required I/O bound network is very chatty permissive API Wednesday, November 28, 12

Slide 5

Slide 5 text

Fun Facts: Basho was originally a SaaS We originally built Riak for said SaaS In 2009 we open-sourced and shifted our focus to Riak itself Wednesday, November 28, 12

Slide 6

Slide 6 text

Riak: designed for SCALE Wednesday, November 28, 12

Slide 7

Slide 7 text

In production at... ...and thousands more. Wednesday, November 28, 12

Slide 8

Slide 8 text

Basho is unique in that we build a piece of infrastructure Wednesday, November 28, 12

Slide 9

Slide 9 text

deploy. monitor. manage. Wednesday, November 28, 12

Slide 10

Slide 10 text

We build Riak for YOU Wednesday, November 28, 12

Slide 11

Slide 11 text

The Issues Basho is a software vendor, we don’t operate large scale systems. Every single deployment is di!erent dat Full Stack Wednesday, November 28, 12

Slide 12

Slide 12 text

It passed the unit tests... SHIP IT! this is Artie the Lone Testing Ranger Wednesday, November 28, 12

Slide 13

Slide 13 text

what if your test has bugs? Wednesday, November 28, 12

Slide 14

Slide 14 text

1823module 'mapred_test' 1824 mapred_test:76: compat_basic1_test_...*failed* 1825::{badmatch,{error, {could_not_reach_node,nonode@nohost}}} the build passes but the test fails... WTF?!?! Wednesday, November 28, 12

Slide 15

Slide 15 text

Wednesday, November 28, 12

Slide 16

Slide 16 text

Innumeracy the inability for human beings to reason with numbers Cognitive Bias the inability for human beings to reason intuitively about orders of magnitude Wednesday, November 28, 12

Slide 17

Slide 17 text

Make Assumptions Make Educated Guesses Test The Shit Out Of Everything so, what can we do? and recognize that no matter what we do, something will ALWAYS BREAK Wednesday, November 28, 12

Slide 18

Slide 18 text

this is TRUE of ALL SOFTWARE Wednesday, November 28, 12

Slide 19

Slide 19 text

Customer Driven Development here, take this code and install it as a critical piece of your stack oh, it broke because of XYZ? here’s a patch / bug "x, sorry about that... Wednesday, November 28, 12

Slide 20

Slide 20 text

A Tale from Production Wednesday, November 28, 12

Slide 21

Slide 21 text

You have cluster Things are great It’s time to add capacity Situation Wednesday, November 28, 12

Slide 22

Slide 22 text

Add a new node Solution Wednesday, November 28, 12

Slide 23

Slide 23 text

Customer named nodes after drinks: Hostnames Aston IPA Highball Gin Framboise ESB Wednesday, November 28, 12

Slide 24

Slide 24 text

riak-admin adding a node is easy! on  aston: $  riak-­‐admin  join  [email protected] * replaced by riak-admin cluster join in Riak 1.2+ * Wednesday, November 28, 12

Slide 25

Slide 25 text

uhh...wat? Wednesday, November 28, 12

Slide 26

Slide 26 text

Basho Support Stop the handoff between nodes on  every  node  we: riak  attach application:set_env(riak_core,  handoff_concurrency,  0). Wednesday, November 28, 12

Slide 27

Slide 27 text

Monitor Wednesday, November 28, 12

Slide 28

Slide 28 text

for signs of... Wednesday, November 28, 12

Slide 29

Slide 29 text

stabilization Wednesday, November 28, 12

Slide 30

Slide 30 text

What happened? 1. New node added 2. Ring must rebalance 3. Nodes claim partitions 4. Handoff of data begins 5. Disks fill up Wednesday, November 28, 12

Slide 31

Slide 31 text

triage time $  riak-­‐admin  member_status =================================  Membership  ================================ Status          Ring        Pending        Node -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ valid              4.3%          16.8%        riak@aston valid            18.8%          16.8%        riak@esb valid            19.1%          16.8%        riak@framboise valid            19.5%          16.8%        riak@gin valid            19.1%          16.4%        riak@highball valid            19.1%          16.4%        riak@ipa -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ Valid:6  /  Leaving:0  /  Exiting:0  /  Joining:0  /  Down:0 Wednesday, November 28, 12

Slide 32

Slide 32 text

triage time Let’s try to relieve the pressure a bit Focus on the node with the least disk space left. gin:~$  riak  attach application:set_env(riak_core,  forced_ownership_handoff,  0). application:set_env(riak_core,  vnode_inactivity_timeout,  300000). application:set_env(riak_core,  handoff_concurrency,  1).   riak_core_vnode:trigger_handoff(element(2,   riak_core_vnode_master:get_vnode_pid(4110473354993164457447863592014545992782310277 12,  riak_kv_vnode))). Wednesday, November 28, 12

Slide 33

Slide 33 text

triage time It took 20 minutes to transfer the vnode (riak@gin)7>  19:34:00.574  [info]  Starting  handoff  of  partition  riak_kv_vnode   411047335499316445744786359201454599278231027712  from  riak@gin  to  riak@aston gin:~$  sudo  netstat  -­‐nap  |  fgrep  10.36.18.245           tcp                0      1065  10.36.110.79:40532      10.36.18.245:8099      ESTABLISHED  27124/beam.smp     tcp                0            0  10.36.110.79:46345      10.36.18.245:53664    ESTABLISHED  27124/beam.smp (riak@gin)7>  19:54:56.721  [info]  Handoff  of  partition  riak_kv_vnode   411047335499316445744786359201454599278231027712  from  riak@gin  to  riak@aston completed:  sent  3805730  objects  in  1256.14  seconds Wednesday, November 28, 12

Slide 34

Slide 34 text

triage time And the vnode had arrived at Aston from Gin aston:/data/riak/bitcask/ 205523667749658222872393179600727299639115513856-­‐132148847970820$  ls  -­‐la total  7305344 drwxr-­‐xr-­‐x      2  riak  riak              4096  2011-­‐11-­‐11  18:05  . drwxr-­‐xr-­‐x  258  riak  riak            36864  2011-­‐11-­‐11  18:56  .. -­‐rw-­‐-­‐-­‐-­‐-­‐-­‐-­‐      1  riak  riak  2147479761  2011-­‐11-­‐11  17:53  1321055508.bitcask.data -­‐rw-­‐r-­‐-­‐r-­‐-­‐      1  riak  riak      86614226  2011-­‐11-­‐11  17:53  1321055508.bitcask.hint -­‐rw-­‐-­‐-­‐-­‐-­‐-­‐-­‐      1  riak  riak  1120382399  2011-­‐11-­‐11  19:50  1321055611.bitcask.data -­‐rw-­‐r-­‐-­‐r-­‐-­‐      1  riak  riak      55333675  2011-­‐11-­‐11  19:50  1321055611.bitcask.hint -­‐rw-­‐-­‐-­‐-­‐-­‐-­‐-­‐      1  riak  riak  2035568266  2011-­‐11-­‐11  18:03  1321056070.bitcask.data -­‐rw-­‐r-­‐-­‐r-­‐-­‐      1  riak  riak      99390277  2011-­‐11-­‐11  18:03  1321056070.bitcask.hint -­‐rw-­‐-­‐-­‐-­‐-­‐-­‐-­‐      1  riak  riak  1879298219  2011-­‐11-­‐11  18:05  1321056214.bitcask.data -­‐rw-­‐r-­‐-­‐r-­‐-­‐      1  riak  riak      56509595  2011-­‐11-­‐11  18:05  1321056214.bitcask.hint -­‐rw-­‐-­‐-­‐-­‐-­‐-­‐-­‐      1  riak  riak                119  2011-­‐11-­‐11  17:53  bitcask.write.lock Wednesday, November 28, 12

Slide 35

Slide 35 text

Eureka!!! You gonna eat that inode? Data was not being cleaned up after hando! This would eventually eat up all disk space Wednesday, November 28, 12

Slide 36

Slide 36 text

Solution We already had a bug"x for the next release that detected this problem hot patch it! Wednesday, November 28, 12

Slide 37

Slide 37 text

Hot Patch We patched their live, production system while still under load. (on  all  nodes)  riak  attach l(riak_kv_bitcask_backend). m(riak_kv_bitcask_backend). Module  riak_kv_bitcask_backend  compiled:  Date:  November  12  2011,  Time:  04.18 Compiler  options:    [{outdir,"ebin"},                                        debug_info,warnings_as_errors,                                        {parse_transform,lager_transform},                                        {i,"include"}] Object  file:  /usr/lib/riak/lib/riak_kv-­‐1.0.1/ebin/ riak_kv_bitcask_backend.beam Exports:   api_version/0                                  is_empty/1 callback/3                                        key_counts/0 delete/4                                            key_counts/1 drop/1                                                module_info/0 fold_buckets/4                                module_info/1 fold_keys/4                                      put/5 fold_objects/4                                start/2 get/3                                                  status/1... Wednesday, November 28, 12

Slide 38

Slide 38 text

Yay!!!! And the new code did what we expected. {ok,  R}  =  riak_core_ring_manager:get_my_ring(). [riak_core_vnode_master:get_vnode_pid(Partition,  riak_kv_vnode)  ||   {Partition,_}  <-­‐  riak_core_ring:all_owners(R)]. (riak@gin)19>  [riak_core_vnode_master:get_vnode_pid(Partition,  riak_kv_vnode)   ||  {Partition,_}  <-­‐  riak_core_ring:all_owners(R)]. 22:48:07.423  [notice]  Unused  data  directories  exist  for  partition   "11417981541647679048466287755595961091061972992":  "/data/riak/bitcask/ 11417981541647679048466287755595961091061972992" 22:48:07.785  [notice]  Unused  data  directories  exist  for  partition   "582317058624031631471780675535394015644160622592":  "/data/riak/bitcask/ 582317058624031631471780675535394015644160622592" 22:48:07.829  [notice]  Unused  data  directories  exist  for  partition   "782131735602866014819940711258323334737745149952":  "/data/riak/bitcask/ 782131735602866014819940711258323334737745149952" [{ok,<0.30093.11>}, ... Wednesday, November 28, 12

Slide 39

Slide 39 text

Manual Cleanup So we backed up those vnodes with unused data on Gin to another system and manually removed them. gin:/data/riak/bitcask$  ls  manual_cleanup/   11417981541647679048466287755595961091061972992       782131735602866014819940711258323334737745149952 582317058624031631471780675535394015644160622592 gin:/data/riak/bitcask$  rm  -­‐rf  manual_cleanup Wednesday, November 28, 12

Slide 40

Slide 40 text

Status Improves Wednesday, November 28, 12

Slide 41

Slide 41 text

Open the Tap On Gin only: reset to defaults, re-enable hando!s on  gin: application:unset_env(riak_core,  forced_ownership_handoff). application:set_env(riak_core,  vnode_inactivity_timeout,  60000). application:set_env(riak_core,  handoff_concurrency,  1). Wednesday, November 28, 12

Slide 42

Slide 42 text

GIN hando! -> IPA Wednesday, November 28, 12

Slide 43

Slide 43 text

Highballs Turn Highball was next lowest now that Gin was handing data o!, time to restart it too. on  highball application:unset_env(riak_core,  forced_ownership_handoff). application:set_env(riak_core,  vnode_inactivity_timeout,  60000). application:set_env(riak_core,  handoff_concurrency,  1). on  gin application:set_env(riak_core,  handoff_concurrency,  4).  %  the  default  setting riak_core_vnode_manager:force_handoffs(). Wednesday, November 28, 12

Slide 44

Slide 44 text

Rebalancing.... Wednesday, November 28, 12

Slide 45

Slide 45 text

Rebalancing.... Wednesday, November 28, 12

Slide 46

Slide 46 text

Rebalancing.... Wednesday, November 28, 12

Slide 47

Slide 47 text

Rebalancing.... Wednesday, November 28, 12

Slide 48

Slide 48 text

Rebalanced. Wednesday, November 28, 12

Slide 49

Slide 49 text

Minimal Impact 6ms variance for 99th % (32ms to 38ms) 0.68s variance for 100th % (0.12 to 0.8s) Wednesday, November 28, 12

Slide 50

Slide 50 text

What Have We Learned? Fallible humans write imperfect code. Have shut-o! valves ready for when inevitable bad behavior strikes. Construct your system so that you can do triage without major downtime. Wednesday, November 28, 12

Slide 51

Slide 51 text

Thanks! Oh yeah, go download Riak :) Questions? Comments? Criticisms? @tsantero Wednesday, November 28, 12