Slide 1

Slide 1 text

Planning

Slide 2

Slide 2 text

dave!

Slide 3

Slide 3 text

the taxi app !

Slide 4

Slide 4 text

Planning

Slide 5

Slide 5 text

Planning

Slide 6

Slide 6 text

Planning

Slide 7

Slide 7 text

Why? h"p://en.wikipedia.org/wiki/High_availability    

Slide 8

Slide 8 text

99.9%      (three  nines)     Down?me:   43.8  minutes  per  month   8.76  hours  per  year  

Slide 9

Slide 9 text

99.99%    (four  nines)     Down?me:   4.32  minutes  per  month   52.56  minutes  per  year  

Slide 10

Slide 10 text

99.999%      (five  nines)     Down?me:   25.9  seconds  per  month   5.26  minutes  per  year  

Slide 11

Slide 11 text

www.whoownsmyavailability.com     ?  

Slide 12

Slide 12 text

www.whoownsmyavailability.com     YOU  

Slide 13

Slide 13 text

The beginning

Slide 14

Slide 14 text

Slide 15

Slide 15 text

My  website:  single  VPS  running  PHP  +  MySQL  

Slide 16

Slide 16 text

No  growth,  low  volume,  simple  func?onality,  one  engineer  (me!)  

Slide 17

Slide 17 text

Large  growth,  high  volume,  complex  func?onality,  lots  of  engineers  

Slide 18

Slide 18 text

• Launched  in  London   November  2011   • Now  in  5  ci?es  in  3  countries   (30%+  growth  every  month)   • A  Hailo  hail  is  accepted  around   the  world  every  5  seconds  

Slide 19

Slide 19 text

“..  Brooks  [1]  reveals  that  the  complexity   of  a  so^ware  project  grows  as  the  square   of  the  number  of  engineers  and  Leveson   [17]  cites  evidence  that  most  failures  in   complex  systems  result  from  unexpected   inter-­‐component  interac?on  rather  than   intra-­‐component  bugs,  we  conclude  that   less  machinery  is  (quadra?cally)  be"er.”     h"p://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf  

Slide 20

Slide 20 text

• SOA  (10+  services)   • AWS  (3  regions,  9  AZs,  lots  of   instances)   • 10+  engineers  building  services   and you?! (hailo is hiring)!

Slide 21

Slide 21 text

Our  overall   reliability  is  in   danger  

Slide 22

Slide 22 text

Embracing failure   (a  coping  strategy)

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

VPC   (running  PHP+MySQL)   reliable?!

Slide 25

Slide 25 text

Reliable   !==   Resilient  

Slide 26

Slide 26 text

Choosing a stack

Slide 27

Slide 27 text

“Hailo”   (running  PHP+MySQL)   reliable?!

Slide 28

Slide 28 text

Service   each service does one job well! Service   Service   Service   Service  Oriented  Architecture  

Slide 29

Slide 29 text

• Fewer  lines  of  code   • Fewer  responsibili?es   • Changes  less  frequently   • Can  swap  en?re  implementa?on   if  needed  

Slide 30

Slide 30 text

Service   (running  PHP+MySQL)   reliable?!

Slide 31

Slide 31 text

Service   MySQL   MySQL  running  on  different  box  

Slide 32

Slide 32 text

Service   MySQL   MySQL   MySQL  running  in  Mul?-­‐Master  mode  

Slide 33

Slide 33 text

Going  global  

Slide 34

Slide 34 text

MySQL   Separa?ng  concerns   CRUD   Locking   Search   Analy?cs   ID  genera?on   also queuing…!

Slide 35

Slide 35 text

At  Hailo  we  look  for  technologies  that  are:   • Distributed   run  on  more  than  one  machine   • Homogenous   all  nodes  look  the  same   • Resilient   can  cope  with  the  loss  of  node(s)  with  no   loss  of  data  

Slide 36

Slide 36 text

“There  is  no  such  thing  as  standby   infrastructure:  there  is  stuff  you   always  use  and  stuff  that  won’t   work  when  you  need  it.”       h"p://blog.b3k.us/2012/01/24/some-­‐rules.html  

Slide 37

Slide 37 text

• Highly  performant,  scalable  and   resilient  data  store   • Underpins  much  of  what  we  do   at  Hailo   • Makes  mul?-­‐DC  easy!  

Slide 38

Slide 38 text

• Highly  reliable  distributed   coordina?on   • We  implement  locking  and   leadership  elec?on  on  top  of  ZK   and  use  sparingly   ZooKeeper

Slide 39

Slide 39 text

• Distributed,  RESTful,  Search   Engine  built  on  top  of  Apache   Lucene   • Replaced  basic  foo  LIKE  ‘%bar%’   queries  (so  much  be"er)  

Slide 40

Slide 40 text

• Real?me  message  processing   system  designed  to  handle   billions  of  messages  per  day   • Fault  tolerant,  highly  available   with  reliable  message  delivery   guarantee   NSQ

Slide 41

Slide 41 text

• Real  ?me  incremental  analy?cs   plasorm,  backed  by  Apache   Cassandra   • Powerful  SQL-­‐like  interface   • Scalable  and  highly  available  

Slide 42

Slide 42 text

• Distributed  ID  genera?on  with   no  coordina?on  required   • Rock  solid   Cruftflake

Slide 43

Slide 43 text

• All  these  technologies  have   similar  proper?es  of  distribu?on   and  resilience     • They  are  designed  to  cope  with   failure   • They  are  not  broken  by  design  

Slide 44

Slide 44 text

Lessons learned

Slide 45

Slide 45 text

Minimise  the   cri?cal  path  

Slide 46

Slide 46 text

What  is  the  minimum  viable  service?  

Slide 47

Slide 47 text

class HailoMemcacheService { private $mc = null; public function __call() { $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; } } Lazy-­‐init  instances;  connect  on  use  

Slide 48

Slide 48 text

Configure  clients   carefully  

Slide 49

Slide 49 text

$this->mc = new \Memcached; $this->mc->addServers($s); $this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout); $this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make  sure  ?meouts  are  configured  

Slide 50

Slide 50 text

Choose  ?meouts  based  on  data   here?!

Slide 51

Slide 51 text

“Fail  Fast:  Set  aggressive  ?meouts   such  that  failing  components   don’t  make  the  en?re  system   crawl  to  a  halt.”       h"p://techblog.neslix.com/2011/04/lessons-­‐ neslix-­‐learned-­‐from-­‐aws-­‐outage.html  

Slide 52

Slide 52 text

95th  percen?le   here?!

Slide 53

Slide 53 text

Test  

Slide 54

Slide 54 text

• Kill  memcache  on  box  A,   measure  impact  on  applica?on   • Kill  memcache  on  box  B,   measure  impact  on  applica?on     All  fine..  we’ve  got  this  covered!  

Slide 55

Slide 55 text

FAIL  

Slide 56

Slide 56 text

• Box  A,  running  in  AWS,  locks  up   • Any  parts  of  applica?on  that   touch  Memcache  stop  working  

Slide 57

Slide 57 text

Things  fail  in   exo?c  ways  

Slide 58

Slide 58 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT $ php test-memcache.php Working OK! Packets  rejected  and  source  no?fied  by  ICMP.  Expect  fast  fails.  

Slide 59

Slide 59 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP $ php test-memcache.php Working OK! Packets  silently  dropped.  Expect  long  ?me  outs.  

Slide 60

Slide 60 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP $ php test-memcache.php Hangs!  Uh  oh.  

Slide 61

Slide 61 text

• When  AWS  instances  hang  they   appear  to  accept  connec?ons   but  drop  packets   • Bug!     h"ps://bugs.launchpad.net/libmemcached/ +bug/583031  

Slide 62

Slide 62 text

Fix,  rinse,  repeat  

Slide 63

Slide 63 text

RabbitMQ   RabbitMQ   RabbitMQ   Service   AMQP  (port  5672)   HA  cluster  

Slide 64

Slide 64 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 5672 \ -m state --state ESTABLISHED \ -j DROP $ php test-rabbitmq.php Fantas?c!  Block  AMQP  port,  client  ?mes  out  

Slide 65

Slide 65 text

FAIL  

Slide 66

Slide 66 text

“RabbitMQ  clusters  do  not   tolerate  network  par??ons   well.”         h"p://www.rabbitmq.com/par??ons.html    

Slide 67

Slide 67 text

$ epmd –names epmd: up and running on port 4369 with data: name rabbit at port 60278 Each  node  listens  on  a  port  assigned  by  EPMD  

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 60278 \ -m state --state ESTABLISHED \ -j DROP $ php test-rabbitmq.php Hangs!  Uh  oh.  

Slide 70

Slide 70 text

Mnesia('rabbit@dmzutilities03-global01- test'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@dmzutilities01-global01-test'} application: rabbitmq_management exited: shutdown type: temporary RabbitMQ  logs  show  par??oned  network  error;  nodes  shutdown  

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf = fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf; } PHP  library  didn’t  have  any  ?me  limit  on  reading  a  frame  

Slide 73

Slide 73 text

Fix,  rinse,  repeat  

Slide 74

Slide 74 text

It  would  be     nice  if  we  could   automate  this  

Slide 75

Slide 75 text

Automate!  

Slide 76

Slide 76 text

• Hailo  run  a  dedicated  automated   test  environment   • Powered  by  bash,  JMeter  and   Graphite   • Con?nuous  automated  tes?ng   with  failure  simula?ons  

Slide 77

Slide 77 text

Fix  a"empt  1:  bad  ?meouts  configured  

Slide 78

Slide 78 text

Fix  a"empt  2:  be"er  ?meouts  

Slide 79

Slide 79 text

Simulate  in   system  tests  

Slide 80

Slide 80 text

Simulate  failure   Assert  monitoring  endpoint   picks  this  up   Assert  features  sBll  work  

Slide 81

Slide 81 text

In conclusion

Slide 82

Slide 82 text

“the  best  way  to  avoid   failure  is  to  fail  constantly.”         h"p://www.codinghorror.com/blog/2011/04/ working-­‐with-­‐the-­‐chaos-­‐monkey.html  

Slide 83

Slide 83 text

You  should  test  for   failure       How  does  the  so^ware  react?   How  does  the  PHP  client  react?  

Slide 84

Slide 84 text

AutomaBon  makes   conBnuous  failure   tesBng  feasible  

Slide 85

Slide 85 text

Systems  that  cope  well   with  failure  are  easier   to  operate  

Slide 86

Slide 86 text

TIMED  BLOCK  ALL     THE  THINGS  

Slide 87

Slide 87 text

Thanks     So^ware  used  at  Hailo   h"p://cassandra.apache.org/   h"p://zookeeper.apache.org/   h"p://www.elas?csearch.org/   h"p://www.acunu.com/acunu-­‐analy?cs.html   h"ps://github.com/bitly/nsq   h"ps://github.com/davegardnerisme/cru^flake   h"ps://github.com/davegardnerisme/nsqphp   Plus  a  load  of  other  things  I’ve  not  men?oned.  

Slide 88

Slide 88 text

Further  reading   Hystrix:  Latency  and  Fault  Tolerance  for  Distributed  Systems   h"ps://github.com/Neslix/Hystrix   Timelike:  a  network  simulator   h"p://aphyr.com/posts/277-­‐?melike-­‐a-­‐network-­‐simulator   Notes  on  distributed  systems  for  young  bloods   h"p://www.somethingsimilar.com/2013/01/14/notes-­‐on-­‐distributed-­‐ systems-­‐for-­‐young-­‐bloods/   Stream  de-­‐duplica?on  (relevant  to  NSQ)   h"p://www.davegardner.me.uk/blog/2012/11/06/stream-­‐de-­‐ duplica?on/    ID  genera?on  in  distributed  systems   h"p://www.slideshare.net/davegardnerisme/unique-­‐id-­‐genera?on-­‐in-­‐ distributed-­‐systems