Slide 1

Slide 1 text

Planning

Slide 2

Slide 2 text

dave!

Slide 3

Slide 3 text

the taxi app !

Slide 4

Slide 4 text

Planning

Slide 5

Slide 5 text

Planning

Slide 6

Slide 6 text

Planning

Slide 7

Slide 7 text

The beginning

Slide 8

Slide 8 text

Slide 9

Slide 9 text

My  website:  single  VPS  running  PHP  +  MySQL  

Slide 10

Slide 10 text

No  growth,  low  volume,  simple  func@onality,  one  engineer  (me!)  

Slide 11

Slide 11 text

Large  growth,  high  volume,  complex  func@onality,  lots  of  engineers  

Slide 12

Slide 12 text

• Launched  in  London   November  2011   • Now  in  5  ci@es  in  3  countries   (30%+  growth  every  month)   • A  Hailo  hail  is  accepted  around   the  world  every  5  seconds  

Slide 13

Slide 13 text

“..  Brooks  [1]  reveals  that  the  complexity   of  a  soTware  project  grows  as  the  square   of  the  number  of  engineers  and  Leveson   [17]  cites  evidence  that  most  failures  in   complex  systems  result  from  unexpected   inter-­‐component  interac@on  rather  than   intra-­‐component  bugs,  we  conclude  that   less  machinery  is  (quadra@cally)  beYer.”     hYp://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf  

Slide 14

Slide 14 text

• SOA  (10+  services)   • AWS  (3  regions,  9  AZs,  lots  of   instances)   • 10+  engineers  building  services   and you?! (hailo is hiring)!

Slide 15

Slide 15 text

Our  overall   reliability  is  in   danger  

Slide 16

Slide 16 text

Embracing failure   (a  coping  strategy)

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

VPC   (running  PHP+MySQL)   reliable?!

Slide 19

Slide 19 text

Reliable   !==   Resilient  

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Choosing a stack

Slide 22

Slide 22 text

“Hailo”   (running  PHP+MySQL)   reliable?!

Slide 23

Slide 23 text

Service   each service does one job well! Service   Service   Service   Service  Oriented  Architecture  

Slide 24

Slide 24 text

• Fewer  lines  of  code   • Fewer  responsibili@es   • Changes  less  frequently   • Can  swap  en@re  implementa@on   if  needed  

Slide 25

Slide 25 text

Service   (running  PHP+MySQL)   reliable?!

Slide 26

Slide 26 text

Service   MySQL   MySQL  running  on  different  box  

Slide 27

Slide 27 text

Service   MySQL   MySQL   MySQL  running  in  Mul@-­‐Master  mode  

Slide 28

Slide 28 text

Going  global  

Slide 29

Slide 29 text

MySQL   Separa@ng  concerns   CRUD   Locking   Search   Analy@cs   ID  genera@on   also queuing…!

Slide 30

Slide 30 text

At  Hailo  we  look  for  technologies  that  are:   • Distributed   run  on  more  than  one  machine   • Homogenous   all  nodes  look  the  same   • Resilient   can  cope  with  the  loss  of  node(s)  with  no   loss  of  data  

Slide 31

Slide 31 text

“There  is  no  such  thing  as  standby   infrastructure:  there  is  stuff  you   always  use  and  stuff  that  won’t   work  when  you  need  it.”       hYp://blog.b3k.us/2012/01/24/some-­‐rules.html  

Slide 32

Slide 32 text

• Highly  performant,  scalable  and   resilient  data  store   • Underpins  much  of  what  we  do   at  Hailo   • Makes  mul@-­‐DC  easy!  

Slide 33

Slide 33 text

• Highly  reliable  distributed   coordina@on   • We  implement  locking  and   leadership  elec@on  on  top  of  ZK   and  use  sparingly   ZooKeeper

Slide 34

Slide 34 text

• Distributed,  RESTful,  Search   Engine  built  on  top  of  Apache   Lucene   • Replaced  basic  foo  LIKE  ‘%bar%’   queries  (so  much  beYer)  

Slide 35

Slide 35 text

• Real@me  message  processing   system  designed  to  handle   billions  of  messages  per  day   • Fault  tolerant,  highly  available   with  reliable  message  delivery   guarantee   NSQ

Slide 36

Slide 36 text

• Distributed  ID  genera@on  with   no  coordina@on  required   • Rock  solid   Cruftflake

Slide 37

Slide 37 text

• All  these  technologies  have   similar  proper@es  of  distribu@on   and  resilience     • They  are  designed  to  cope  with   failure   • They  are  not  broken  by  design  

Slide 38

Slide 38 text

Lessons learned

Slide 39

Slide 39 text

Minimise  the   cri@cal  path  

Slide 40

Slide 40 text

What  is  the  minimum  viable  service?  

Slide 41

Slide 41 text

class HailoMemcacheService { private $mc = null; public function __call() { $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; } } Lazy-­‐init  instances;  connect  on  use  

Slide 42

Slide 42 text

Configure  clients   carefully  

Slide 43

Slide 43 text

$this->mc = new \Memcached; $this->mc->addServers($s); $this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout); $this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make  sure  @meouts  are  configured  

Slide 44

Slide 44 text

Choose  @meouts  based  on  data   here?!

Slide 45

Slide 45 text

“Fail  Fast:  Set  aggressive  @meouts   such  that  failing  components   don’t  make  the  en@re  system   crawl  to  a  halt.”       hYp://techblog.neslix.com/2011/04/lessons-­‐ neslix-­‐learned-­‐from-­‐aws-­‐outage.html  

Slide 46

Slide 46 text

95th  percen@le   here?!

Slide 47

Slide 47 text

Test  

Slide 48

Slide 48 text

• Kill  memcache  on  box  A,   measure  impact  on  applica@on   • Kill  memcache  on  box  B,   measure  impact  on  applica@on     All  fine..  we’ve  got  this  covered!  

Slide 49

Slide 49 text

FAIL  

Slide 50

Slide 50 text

• Box  A,  running  in  AWS,  locks  up   • Any  parts  of  applica@on  that   touch  Memcache  stop  working  

Slide 51

Slide 51 text

Things  fail  in   exo@c  ways  

Slide 52

Slide 52 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT $ php test-memcache.php Working OK! Packets  rejected  and  source  no@fied  by  ICMP.  Expect  fast  fails.  

Slide 53

Slide 53 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP $ php test-memcache.php Working OK! Packets  silently  dropped.  Expect  long  @me  outs.  

Slide 54

Slide 54 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP $ php test-memcache.php Hangs!  Uh  oh.  

Slide 55

Slide 55 text

• When  AWS  instances  hang  they   appear  to  accept  connec@ons   but  drop  packets   • Bug!     hYps://bugs.launchpad.net/libmemcached/ +bug/583031  

Slide 56

Slide 56 text

Fix,  rinse,  repeat  

Slide 57

Slide 57 text

It  would  be     nice  if  we  could   automate  this  

Slide 58

Slide 58 text

Automate!  

Slide 59

Slide 59 text

• Hailo  run  a  dedicated  automated   test  environment   • Powered  by  bash,  JMeter  and   Graphite   • Con@nuous  automated  tes@ng   with  failure  simula@ons  

Slide 60

Slide 60 text

Fix  aYempt  1:  bad  @meouts  configured  

Slide 61

Slide 61 text

Fix  aYempt  2:  beYer  @meouts  

Slide 62

Slide 62 text

Simulate  in   system  tests  

Slide 63

Slide 63 text

Simulate  failure   Assert  monitoring  endpoint   picks  this  up   Assert  features  s>ll  work  

Slide 64

Slide 64 text

In conclusion

Slide 65

Slide 65 text

“the  best  way  to  avoid   failure  is  to  fail  constantly.”         hYp://www.codinghorror.com/blog/2011/04/ working-­‐with-­‐the-­‐chaos-­‐monkey.html  

Slide 66

Slide 66 text

TIMED  BLOCK  ALL     THE  THINGS  

Slide 67

Slide 67 text

Thanks     SoTware  used  at  Hailo   hYp://cassandra.apache.org/   hYp://zookeeper.apache.org/   hYp://[email protected]/   hYp://www.acunu.com/acunu-­‐[email protected]   hYps://github.com/bitly/nsq   hYps://github.com/davegardnerisme/cruTflake   hYps://github.com/davegardnerisme/nsqphp   Plus  a  load  of  other  things  I’ve  not  men@oned.  

Slide 68

Slide 68 text

Further  reading   Hystrix:  Latency  and  Fault  Tolerance  for  Distributed  Systems   hYps://github.com/Neslix/Hystrix   Timelike:  a  network  simulator   hYp://aphyr.com/posts/277-­‐@melike-­‐a-­‐network-­‐simulator   Notes  on  distributed  systems  for  young  bloods   hYp://www.somethingsimilar.com/2013/01/14/notes-­‐on-­‐distributed-­‐ systems-­‐for-­‐young-­‐bloods/   Stream  de-­‐duplica@on  (relevant  to  NSQ)   hYp://www.davegardner.me.uk/blog/2012/11/06/stream-­‐de-­‐ duplica@on/    ID  genera@on  in  distributed  systems   hYp://www.slideshare.net/davegardnerisme/unique-­‐id-­‐genera@on-­‐in-­‐ distributed-­‐systems