Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Planning to Fail (PHPNE)

Planning to Fail (PHPNE)

Slightly longer version of my talk at PHP North East, on avoiding failure by failing all the time.

Dave Gardner

March 19, 2013
Tweet

More Decks by Dave Gardner

Other Decks in Technology

Transcript

  1. 99.9%      (three  nines)     Down?me:   43.8

     minutes  per  month   8.76  hours  per  year  
  2. 99.99%    (four  nines)     Down?me:   4.32  minutes

     per  month   52.56  minutes  per  year  
  3. 99.999%      (five  nines)     Down?me:   25.9

     seconds  per  month   5.26  minutes  per  year  
  4. • Launched  in  London   November  2011   • Now  in  5

     ci?es  in  3  countries   (30%+  growth  every  month)   • A  Hailo  hail  is  accepted  around   the  world  every  5  seconds  
  5. “..  Brooks  [1]  reveals  that  the  complexity   of  a

     so^ware  project  grows  as  the  square   of  the  number  of  engineers  and  Leveson   [17]  cites  evidence  that  most  failures  in   complex  systems  result  from  unexpected   inter-­‐component  interac?on  rather  than   intra-­‐component  bugs,  we  conclude  that   less  machinery  is  (quadra?cally)  be"er.”     h"p://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf  
  6. • SOA  (10+  services)   • AWS  (3  regions,  9  AZs,  lots

     of   instances)   • 10+  engineers  building  services   and you?! (hailo is hiring)!
  7. Service   each service does one job well! Service  

    Service   Service   Service  Oriented  Architecture  
  8. • Fewer  lines  of  code   • Fewer  responsibili?es   • Changes  less

     frequently   • Can  swap  en?re  implementa?on   if  needed  
  9. MySQL   Separa?ng  concerns   CRUD   Locking   Search

      Analy?cs   ID  genera?on   also queuing…!
  10. At  Hailo  we  look  for  technologies  that  are:   • Distributed

      run  on  more  than  one  machine   • Homogenous   all  nodes  look  the  same   • Resilient   can  cope  with  the  loss  of  node(s)  with  no   loss  of  data  
  11. “There  is  no  such  thing  as  standby   infrastructure:  there

     is  stuff  you   always  use  and  stuff  that  won’t   work  when  you  need  it.”       h"p://blog.b3k.us/2012/01/24/some-­‐rules.html  
  12. • Highly  performant,  scalable  and   resilient  data  store   • Underpins

     much  of  what  we  do   at  Hailo   • Makes  mul?-­‐DC  easy!  
  13. • Highly  reliable  distributed   coordina?on   • We  implement  locking  and

      leadership  elec?on  on  top  of  ZK   and  use  sparingly   ZooKeeper
  14. • Distributed,  RESTful,  Search   Engine  built  on  top  of  Apache

      Lucene   • Replaced  basic  foo  LIKE  ‘%bar%’   queries  (so  much  be"er)  
  15. • Real?me  message  processing   system  designed  to  handle   billions

     of  messages  per  day   • Fault  tolerant,  highly  available   with  reliable  message  delivery   guarantee   NSQ
  16. • Real  ?me  incremental  analy?cs   plasorm,  backed  by  Apache  

    Cassandra   • Powerful  SQL-­‐like  interface   • Scalable  and  highly  available  
  17. • All  these  technologies  have   similar  proper?es  of  distribu?on  

    and  resilience     • They  are  designed  to  cope  with   failure   • They  are  not  broken  by  design  
  18. class HailoMemcacheService { private $mc = null; public function __call()

    { $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; } } Lazy-­‐init  instances;  connect  on  use  
  19. $this->mc = new \Memcached; $this->mc->addServers($s); $this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout); $this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT,

    $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make  sure  ?meouts  are  configured  
  20. “Fail  Fast:  Set  aggressive  ?meouts   such  that  failing  components

      don’t  make  the  en?re  system   crawl  to  a  halt.”       h"p://techblog.neslix.com/2011/04/lessons-­‐ neslix-­‐learned-­‐from-­‐aws-­‐outage.html  
  21. • Kill  memcache  on  box  A,   measure  impact  on  applica?on

      • Kill  memcache  on  box  B,   measure  impact  on  applica?on     All  fine..  we’ve  got  this  covered!  
  22. • Box  A,  running  in  AWS,  locks  up   • Any  parts

     of  applica?on  that   touch  Memcache  stop  working  
  23. $ iptables -A INPUT -i eth0 \ -p tcp --dport

    11211 -j REJECT $ php test-memcache.php Working OK! Packets  rejected  and  source  no?fied  by  ICMP.  Expect  fast  fails.  
  24. $ iptables -A INPUT -i eth0 \ -p tcp --dport

    11211 -j DROP $ php test-memcache.php Working OK! Packets  silently  dropped.  Expect  long  ?me  outs.  
  25. $ iptables -A INPUT -i eth0 \ -p tcp --dport

    11211 \ -m state --state ESTABLISHED \ -j DROP $ php test-memcache.php Hangs!  Uh  oh.  
  26. • When  AWS  instances  hang  they   appear  to  accept  connec?ons

      but  drop  packets   • Bug!     h"ps://bugs.launchpad.net/libmemcached/ +bug/583031  
  27. $ iptables -A INPUT -i eth0 \ -p tcp --dport

    5672 \ -m state --state ESTABLISHED \ -j DROP $ php test-rabbitmq.php Fantas?c!  Block  AMQP  port,  client  ?mes  out  
  28. “RabbitMQ  clusters  do  not   tolerate  network  par??ons   well.”

            h"p://www.rabbitmq.com/par??ons.html    
  29. $ epmd –names epmd: up and running on port 4369

    with data: name rabbit at port 60278 Each  node  listens  on  a  port  assigned  by  EPMD  
  30. $ iptables -A INPUT -i eth0 \ -p tcp --dport

    60278 \ -m state --state ESTABLISHED \ -j DROP $ php test-rabbitmq.php Hangs!  Uh  oh.  
  31. Mnesia('rabbit@dmzutilities03-global01- test'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@dmzutilities01-global01-test'}

    application: rabbitmq_management exited: shutdown type: temporary RabbitMQ  logs  show  par??oned  network  error;  nodes  shutdown  
  32. while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf

    = fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf; } PHP  library  didn’t  have  any  ?me  limit  on  reading  a  frame  
  33. • Hailo  run  a  dedicated  automated   test  environment   • Powered

     by  bash,  JMeter  and   Graphite   • Con?nuous  automated  tes?ng   with  failure  simula?ons  
  34. “the  best  way  to  avoid   failure  is  to  fail

     constantly.”         h"p://www.codinghorror.com/blog/2011/04/ working-­‐with-­‐the-­‐chaos-­‐monkey.html  
  35. You  should  test  for   failure       How

     does  the  so^ware  react?   How  does  the  PHP  client  react?  
  36. Thanks     So^ware  used  at  Hailo   h"p://cassandra.apache.org/  

    h"p://zookeeper.apache.org/   h"p://www.elas?csearch.org/   h"p://www.acunu.com/acunu-­‐analy?cs.html   h"ps://github.com/bitly/nsq   h"ps://github.com/davegardnerisme/cru^flake   h"ps://github.com/davegardnerisme/nsqphp   Plus  a  load  of  other  things  I’ve  not  men?oned.  
  37. Further  reading   Hystrix:  Latency  and  Fault  Tolerance  for  Distributed

     Systems   h"ps://github.com/Neslix/Hystrix   Timelike:  a  network  simulator   h"p://aphyr.com/posts/277-­‐?melike-­‐a-­‐network-­‐simulator   Notes  on  distributed  systems  for  young  bloods   h"p://www.somethingsimilar.com/2013/01/14/notes-­‐on-­‐distributed-­‐ systems-­‐for-­‐young-­‐bloods/   Stream  de-­‐duplica?on  (relevant  to  NSQ)   h"p://www.davegardner.me.uk/blog/2012/11/06/stream-­‐de-­‐ duplica?on/    ID  genera?on  in  distributed  systems   h"p://www.slideshare.net/davegardnerisme/unique-­‐id-­‐genera?on-­‐in-­‐ distributed-­‐systems