Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Planning to Fail

779e0fba968b181ac4edbad013f5d3b7?s=47 Dave Gardner
February 23, 2013

Planning to Fail

How to build resilient and reliable services by embracing failure.


Dave Gardner

February 23, 2013


  1. Planning

  2. dave!

  3. the taxi app !

  4. Planning

  5. Planning

  6. Planning

  7. The beginning

  8. <?php

  9. My  website:  single  VPS  running  PHP  +  MySQL  

  10. No  growth,  low  volume,  simple  func@onality,  one  engineer  (me!)  

  11. Large  growth,  high  volume,  complex  func@onality,  lots  of  engineers  

  12. • Launched  in  London   November  2011   • Now  in  5

     ci@es  in  3  countries   (30%+  growth  every  month)   • A  Hailo  hail  is  accepted  around   the  world  every  5  seconds  
  13. “..  Brooks  [1]  reveals  that  the  complexity   of  a

     soTware  project  grows  as  the  square   of  the  number  of  engineers  and  Leveson   [17]  cites  evidence  that  most  failures  in   complex  systems  result  from  unexpected   inter-­‐component  interac@on  rather  than   intra-­‐component  bugs,  we  conclude  that   less  machinery  is  (quadra@cally)  beYer.”     hYp://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf  
  14. • SOA  (10+  services)   • AWS  (3  regions,  9  AZs,  lots

     of   instances)   • 10+  engineers  building  services   and you?! (hailo is hiring)!
  15. Our  overall   reliability  is  in   danger  

  16. Embracing failure   (a  coping  strategy)

  17. None
  18. VPC   (running  PHP+MySQL)   reliable?!

  19. Reliable   !==   Resilient  

  20. None
  21. Choosing a stack

  22. “Hailo”   (running  PHP+MySQL)   reliable?!

  23. Service   each service does one job well! Service  

    Service   Service   Service  Oriented  Architecture  
  24. • Fewer  lines  of  code   • Fewer  responsibili@es   • Changes  less

     frequently   • Can  swap  en@re  implementa@on   if  needed  
  25. Service   (running  PHP+MySQL)   reliable?!

  26. Service   MySQL   MySQL  running  on  different  box  

  27. Service   MySQL   MySQL   MySQL  running  in  Mul@-­‐Master

  28. Going  global  

  29. MySQL   Separa@ng  concerns   CRUD   Locking   Search

      Analy@cs   ID  genera@on   also queuing…!
  30. At  Hailo  we  look  for  technologies  that  are:   • Distributed

      run  on  more  than  one  machine   • Homogenous   all  nodes  look  the  same   • Resilient   can  cope  with  the  loss  of  node(s)  with  no   loss  of  data  
  31. “There  is  no  such  thing  as  standby   infrastructure:  there

     is  stuff  you   always  use  and  stuff  that  won’t   work  when  you  need  it.”       hYp://blog.b3k.us/2012/01/24/some-­‐rules.html  
  32. • Highly  performant,  scalable  and   resilient  data  store   • Underpins

     much  of  what  we  do   at  Hailo   • Makes  mul@-­‐DC  easy!  
  33. • Highly  reliable  distributed   coordina@on   • We  implement  locking  and

      leadership  elec@on  on  top  of  ZK   and  use  sparingly   ZooKeeper
  34. • Distributed,  RESTful,  Search   Engine  built  on  top  of  Apache

      Lucene   • Replaced  basic  foo  LIKE  ‘%bar%’   queries  (so  much  beYer)  
  35. • Real@me  message  processing   system  designed  to  handle   billions

     of  messages  per  day   • Fault  tolerant,  highly  available   with  reliable  message  delivery   guarantee   NSQ
  36. • Distributed  ID  genera@on  with   no  coordina@on  required   • Rock

     solid   Cruftflake
  37. • All  these  technologies  have   similar  proper@es  of  distribu@on  

    and  resilience     • They  are  designed  to  cope  with   failure   • They  are  not  broken  by  design  
  38. Lessons learned

  39. Minimise  the   cri@cal  path  

  40. What  is  the  minimum  viable  service?  

  41. class HailoMemcacheService { private $mc = null; public function __call()

    { $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; } } Lazy-­‐init  instances;  connect  on  use  
  42. Configure  clients   carefully  

  43. $this->mc = new \Memcached; $this->mc->addServers($s); $this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout); $this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT,

    $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make  sure  @meouts  are  configured  
  44. Choose  @meouts  based  on  data   here?!

  45. “Fail  Fast:  Set  aggressive  @meouts   such  that  failing  components

      don’t  make  the  en@re  system   crawl  to  a  halt.”       hYp://techblog.neslix.com/2011/04/lessons-­‐ neslix-­‐learned-­‐from-­‐aws-­‐outage.html  
  46. 95th  percen@le   here?!

  47. Test  

  48. • Kill  memcache  on  box  A,   measure  impact  on  applica@on

      • Kill  memcache  on  box  B,   measure  impact  on  applica@on     All  fine..  we’ve  got  this  covered!  
  49. FAIL  

  50. • Box  A,  running  in  AWS,  locks  up   • Any  parts

     of  applica@on  that   touch  Memcache  stop  working  
  51. Things  fail  in   exo@c  ways  

  52. $ iptables -A INPUT -i eth0 \ -p tcp --dport

    11211 -j REJECT $ php test-memcache.php Working OK! Packets  rejected  and  source  no@fied  by  ICMP.  Expect  fast  fails.  
  53. $ iptables -A INPUT -i eth0 \ -p tcp --dport

    11211 -j DROP $ php test-memcache.php Working OK! Packets  silently  dropped.  Expect  long  @me  outs.  
  54. $ iptables -A INPUT -i eth0 \ -p tcp --dport

    11211 \ -m state --state ESTABLISHED \ -j DROP $ php test-memcache.php Hangs!  Uh  oh.  
  55. • When  AWS  instances  hang  they   appear  to  accept  connec@ons

      but  drop  packets   • Bug!     hYps://bugs.launchpad.net/libmemcached/ +bug/583031  
  56. Fix,  rinse,  repeat  

  57. It  would  be     nice  if  we  could  

    automate  this  
  58. Automate!  

  59. • Hailo  run  a  dedicated  automated   test  environment   • Powered

     by  bash,  JMeter  and   Graphite   • Con@nuous  automated  tes@ng   with  failure  simula@ons  
  60. Fix  aYempt  1:  bad  @meouts  configured  

  61. Fix  aYempt  2:  beYer  @meouts  

  62. Simulate  in   system  tests  

  63. Simulate  failure   Assert  monitoring  endpoint   picks  this  up

      Assert  features  s>ll  work  
  64. In conclusion

  65. “the  best  way  to  avoid   failure  is  to  fail

     constantly.”         hYp://www.codinghorror.com/blog/2011/04/ working-­‐with-­‐the-­‐chaos-­‐monkey.html  

  67. Thanks     SoTware  used  at  Hailo   hYp://cassandra.apache.org/  

    hYp://zookeeper.apache.org/   hYp://www.elas@csearch.org/   hYp://www.acunu.com/acunu-­‐analy@cs.html   hYps://github.com/bitly/nsq   hYps://github.com/davegardnerisme/cruTflake   hYps://github.com/davegardnerisme/nsqphp   Plus  a  load  of  other  things  I’ve  not  men@oned.  
  68. Further  reading   Hystrix:  Latency  and  Fault  Tolerance  for  Distributed

     Systems   hYps://github.com/Neslix/Hystrix   Timelike:  a  network  simulator   hYp://aphyr.com/posts/277-­‐@melike-­‐a-­‐network-­‐simulator   Notes  on  distributed  systems  for  young  bloods   hYp://www.somethingsimilar.com/2013/01/14/notes-­‐on-­‐distributed-­‐ systems-­‐for-­‐young-­‐bloods/   Stream  de-­‐duplica@on  (relevant  to  NSQ)   hYp://www.davegardner.me.uk/blog/2012/11/06/stream-­‐de-­‐ duplica@on/    ID  genera@on  in  distributed  systems   hYp://www.slideshare.net/davegardnerisme/unique-­‐id-­‐genera@on-­‐in-­‐ distributed-­‐systems