Planning to Fail

779e0fba968b181ac4edbad013f5d3b7?s=47 Dave Gardner
February 23, 2013

Planning to Fail

How to build resilient and reliable services by embracing failure.

779e0fba968b181ac4edbad013f5d3b7?s=128

Dave Gardner

February 23, 2013
Tweet

Transcript

  1. Planning

  2. dave!

  3. the taxi app !

  4. Planning

  5. Planning

  6. Planning

  7. The beginning

  8. <?php

  9. My  website:  single  VPS  running  PHP  +  MySQL  

  10. No  growth,  low  volume,  simple  func@onality,  one  engineer  (me!)  

  11. Large  growth,  high  volume,  complex  func@onality,  lots  of  engineers  

  12. • Launched  in  London   November  2011   • Now  in  5

     ci@es  in  3  countries   (30%+  growth  every  month)   • A  Hailo  hail  is  accepted  around   the  world  every  5  seconds  
  13. “..  Brooks  [1]  reveals  that  the  complexity   of  a

     soTware  project  grows  as  the  square   of  the  number  of  engineers  and  Leveson   [17]  cites  evidence  that  most  failures  in   complex  systems  result  from  unexpected   inter-­‐component  interac@on  rather  than   intra-­‐component  bugs,  we  conclude  that   less  machinery  is  (quadra@cally)  beYer.”     hYp://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf  
  14. • SOA  (10+  services)   • AWS  (3  regions,  9  AZs,  lots

     of   instances)   • 10+  engineers  building  services   and you?! (hailo is hiring)!
  15. Our  overall   reliability  is  in   danger  

  16. Embracing failure   (a  coping  strategy)

  17. None
  18. VPC   (running  PHP+MySQL)   reliable?!

  19. Reliable   !==   Resilient  

  20. None
  21. Choosing a stack

  22. “Hailo”   (running  PHP+MySQL)   reliable?!

  23. Service   each service does one job well! Service  

    Service   Service   Service  Oriented  Architecture  
  24. • Fewer  lines  of  code   • Fewer  responsibili@es   • Changes  less

     frequently   • Can  swap  en@re  implementa@on   if  needed  
  25. Service   (running  PHP+MySQL)   reliable?!

  26. Service   MySQL   MySQL  running  on  different  box  

  27. Service   MySQL   MySQL   MySQL  running  in  Mul@-­‐Master

     mode  
  28. Going  global  

  29. MySQL   Separa@ng  concerns   CRUD   Locking   Search

      Analy@cs   ID  genera@on   also queuing…!
  30. At  Hailo  we  look  for  technologies  that  are:   • Distributed

      run  on  more  than  one  machine   • Homogenous   all  nodes  look  the  same   • Resilient   can  cope  with  the  loss  of  node(s)  with  no   loss  of  data  
  31. “There  is  no  such  thing  as  standby   infrastructure:  there

     is  stuff  you   always  use  and  stuff  that  won’t   work  when  you  need  it.”       hYp://blog.b3k.us/2012/01/24/some-­‐rules.html  
  32. • Highly  performant,  scalable  and   resilient  data  store   • Underpins

     much  of  what  we  do   at  Hailo   • Makes  mul@-­‐DC  easy!  
  33. • Highly  reliable  distributed   coordina@on   • We  implement  locking  and

      leadership  elec@on  on  top  of  ZK   and  use  sparingly   ZooKeeper
  34. • Distributed,  RESTful,  Search   Engine  built  on  top  of  Apache

      Lucene   • Replaced  basic  foo  LIKE  ‘%bar%’   queries  (so  much  beYer)  
  35. • Real@me  message  processing   system  designed  to  handle   billions

     of  messages  per  day   • Fault  tolerant,  highly  available   with  reliable  message  delivery   guarantee   NSQ
  36. • Distributed  ID  genera@on  with   no  coordina@on  required   • Rock

     solid   Cruftflake
  37. • All  these  technologies  have   similar  proper@es  of  distribu@on  

    and  resilience     • They  are  designed  to  cope  with   failure   • They  are  not  broken  by  design  
  38. Lessons learned

  39. Minimise  the   cri@cal  path  

  40. What  is  the  minimum  viable  service?  

  41. class HailoMemcacheService { private $mc = null; public function __call()

    { $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; } } Lazy-­‐init  instances;  connect  on  use  
  42. Configure  clients   carefully  

  43. $this->mc = new \Memcached; $this->mc->addServers($s); $this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout); $this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT,

    $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make  sure  @meouts  are  configured  
  44. Choose  @meouts  based  on  data   here?!

  45. “Fail  Fast:  Set  aggressive  @meouts   such  that  failing  components

      don’t  make  the  en@re  system   crawl  to  a  halt.”       hYp://techblog.neslix.com/2011/04/lessons-­‐ neslix-­‐learned-­‐from-­‐aws-­‐outage.html  
  46. 95th  percen@le   here?!

  47. Test  

  48. • Kill  memcache  on  box  A,   measure  impact  on  applica@on

      • Kill  memcache  on  box  B,   measure  impact  on  applica@on     All  fine..  we’ve  got  this  covered!  
  49. FAIL  

  50. • Box  A,  running  in  AWS,  locks  up   • Any  parts

     of  applica@on  that   touch  Memcache  stop  working  
  51. Things  fail  in   exo@c  ways  

  52. $ iptables -A INPUT -i eth0 \ -p tcp --dport

    11211 -j REJECT $ php test-memcache.php Working OK! Packets  rejected  and  source  no@fied  by  ICMP.  Expect  fast  fails.  
  53. $ iptables -A INPUT -i eth0 \ -p tcp --dport

    11211 -j DROP $ php test-memcache.php Working OK! Packets  silently  dropped.  Expect  long  @me  outs.  
  54. $ iptables -A INPUT -i eth0 \ -p tcp --dport

    11211 \ -m state --state ESTABLISHED \ -j DROP $ php test-memcache.php Hangs!  Uh  oh.  
  55. • When  AWS  instances  hang  they   appear  to  accept  connec@ons

      but  drop  packets   • Bug!     hYps://bugs.launchpad.net/libmemcached/ +bug/583031  
  56. Fix,  rinse,  repeat  

  57. It  would  be     nice  if  we  could  

    automate  this  
  58. Automate!  

  59. • Hailo  run  a  dedicated  automated   test  environment   • Powered

     by  bash,  JMeter  and   Graphite   • Con@nuous  automated  tes@ng   with  failure  simula@ons  
  60. Fix  aYempt  1:  bad  @meouts  configured  

  61. Fix  aYempt  2:  beYer  @meouts  

  62. Simulate  in   system  tests  

  63. Simulate  failure   Assert  monitoring  endpoint   picks  this  up

      Assert  features  s>ll  work  
  64. In conclusion

  65. “the  best  way  to  avoid   failure  is  to  fail

     constantly.”         hYp://www.codinghorror.com/blog/2011/04/ working-­‐with-­‐the-­‐chaos-­‐monkey.html  
  66. TIMED  BLOCK  ALL     THE  THINGS  

  67. Thanks     SoTware  used  at  Hailo   hYp://cassandra.apache.org/  

    hYp://zookeeper.apache.org/   hYp://www.elas@csearch.org/   hYp://www.acunu.com/acunu-­‐analy@cs.html   hYps://github.com/bitly/nsq   hYps://github.com/davegardnerisme/cruTflake   hYps://github.com/davegardnerisme/nsqphp   Plus  a  load  of  other  things  I’ve  not  men@oned.  
  68. Further  reading   Hystrix:  Latency  and  Fault  Tolerance  for  Distributed

     Systems   hYps://github.com/Neslix/Hystrix   Timelike:  a  network  simulator   hYp://aphyr.com/posts/277-­‐@melike-­‐a-­‐network-­‐simulator   Notes  on  distributed  systems  for  young  bloods   hYp://www.somethingsimilar.com/2013/01/14/notes-­‐on-­‐distributed-­‐ systems-­‐for-­‐young-­‐bloods/   Stream  de-­‐duplica@on  (relevant  to  NSQ)   hYp://www.davegardner.me.uk/blog/2012/11/06/stream-­‐de-­‐ duplica@on/    ID  genera@on  in  distributed  systems   hYp://www.slideshare.net/davegardnerisme/unique-­‐id-­‐genera@on-­‐in-­‐ distributed-­‐systems