Planning to Fail (PHPNE)

Slide 1

Slide 1 text

Planning

Slide 2

Slide 2 text

dave!

Slide 3

Slide 3 text

the taxi app !

Slide 4

Slide 4 text

Planning

Slide 5

Slide 5 text

Planning

Slide 6

Slide 6 text

Planning

Slide 7

Slide 7 text

Why? h"p://en.wikipedia.org/wiki/High_availability

Slide 8

Slide 8 text

99.9% (three nines) Down?me: 43.8 minutes per month 8.76 hours per year

Slide 9

Slide 9 text

99.99% (four nines) Down?me: 4.32 minutes per month 52.56 minutes per year

Slide 10

Slide 10 text

99.999% (ﬁve nines) Down?me: 25.9 seconds per month 5.26 minutes per year

Slide 11

Slide 11 text

www.whoownsmyavailability.com ?

Slide 12

Slide 12 text

www.whoownsmyavailability.com YOU

Slide 13

Slide 13 text

The beginning

Slide 14

Slide 14 text

Slide 15

Slide 15 text

My website: single VPS running PHP + MySQL

Slide 16

Slide 16 text

No growth, low volume, simple func?onality, one engineer (me!)

Slide 17

Slide 17 text

Large growth, high volume, complex func?onality, lots of engineers

Slide 18

Slide 18 text

• Launched in London November 2011 • Now in 5 ci?es in 3 countries (30%+ growth every month) • A Hailo hail is accepted around the world every 5 seconds

Slide 19

Slide 19 text

“.. Brooks [1] reveals that the complexity of a so^ware project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-‐component interac?on rather than intra-‐component bugs, we conclude that less machinery is (quadra?cally) be"er.” h"p://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

Slide 20

Slide 20 text

• SOA (10+ services) • AWS (3 regions, 9 AZs, lots of instances) • 10+ engineers building services and you?! (hailo is hiring)!

Slide 21

Slide 21 text

Our overall reliability is in danger

Slide 22

Slide 22 text

Embracing failure (a coping strategy)

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

VPC (running PHP+MySQL) reliable?!

Slide 25

Slide 25 text

Reliable !== Resilient

Slide 26

Slide 26 text

Choosing a stack

Slide 27

Slide 27 text

“Hailo” (running PHP+MySQL) reliable?!

Slide 28

Slide 28 text

Service each service does one job well! Service Service Service Service Oriented Architecture

Slide 29

Slide 29 text

• Fewer lines of code • Fewer responsibili?es • Changes less frequently • Can swap en?re implementa?on if needed

Slide 30

Slide 30 text

Service (running PHP+MySQL) reliable?!

Slide 31

Slide 31 text

Service MySQL MySQL running on diﬀerent box

Slide 32

Slide 32 text

Service MySQL MySQL MySQL running in Mul?-‐Master mode

Slide 33

Slide 33 text

Going global

Slide 34

Slide 34 text

MySQL Separa?ng concerns CRUD Locking Search Analy?cs ID genera?on also queuing…!

Slide 35

Slide 35 text

At Hailo we look for technologies that are: • Distributed run on more than one machine • Homogenous all nodes look the same • Resilient can cope with the loss of node(s) with no loss of data

Slide 36

Slide 36 text

“There is no such thing as standby infrastructure: there is stuﬀ you always use and stuﬀ that won’t work when you need it.” h"p://blog.b3k.us/2012/01/24/some-‐rules.html

Slide 37

Slide 37 text

• Highly performant, scalable and resilient data store • Underpins much of what we do at Hailo • Makes mul?-‐DC easy!

Slide 38

Slide 38 text

• Highly reliable distributed coordina?on • We implement locking and leadership elec?on on top of ZK and use sparingly ZooKeeper

Slide 39

Slide 39 text

• Distributed, RESTful, Search Engine built on top of Apache Lucene • Replaced basic foo LIKE ‘%bar%’ queries (so much be"er)

Slide 40

Slide 40 text

• Real?me message processing system designed to handle billions of messages per day • Fault tolerant, highly available with reliable message delivery guarantee NSQ

Slide 41

Slide 41 text

• Real ?me incremental analy?cs plasorm, backed by Apache Cassandra • Powerful SQL-‐like interface • Scalable and highly available

Slide 42

Slide 42 text

• Distributed ID genera?on with no coordina?on required • Rock solid Cruftﬂake

Slide 43

Slide 43 text

• All these technologies have similar proper?es of distribu?on and resilience • They are designed to cope with failure • They are not broken by design

Slide 44

Slide 44 text

Lessons learned

Slide 45

Slide 45 text

Minimise the cri?cal path

Slide 46

Slide 46 text

What is the minimum viable service?

Slide 47

Slide 47 text

class HailoMemcacheService { private $mc = null; public function __call() { $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; } } Lazy-‐init instances; connect on use

Slide 48

Slide 48 text

Conﬁgure clients carefully

Slide 49

Slide 49 text

$this->mc = new \Memcached; $this->mc->addServers($s); $this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout); $this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make sure ?meouts are conﬁgured

Slide 50

Slide 50 text

Choose ?meouts based on data here?!

Slide 51

Slide 51 text

“Fail Fast: Set aggressive ?meouts such that failing components don’t make the en?re system crawl to a halt.” h"p://techblog.neslix.com/2011/04/lessons-‐ neslix-‐learned-‐from-‐aws-‐outage.html

Slide 52

Slide 52 text

95th percen?le here?!

Slide 53

Slide 53 text

Test

Slide 54

Slide 54 text

• Kill memcache on box A, measure impact on applica?on • Kill memcache on box B, measure impact on applica?on All ﬁne.. we’ve got this covered!

Slide 55

Slide 55 text

FAIL

Slide 56

Slide 56 text

• Box A, running in AWS, locks up • Any parts of applica?on that touch Memcache stop working

Slide 57

Slide 57 text

Things fail in exo?c ways

Slide 58

Slide 58 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT $ php test-memcache.php Working OK! Packets rejected and source no?ﬁed by ICMP. Expect fast fails.

Slide 59

Slide 59 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP $ php test-memcache.php Working OK! Packets silently dropped. Expect long ?me outs.

Slide 60

Slide 60 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP $ php test-memcache.php Hangs! Uh oh.

Slide 61

Slide 61 text

• When AWS instances hang they appear to accept connec?ons but drop packets • Bug! h"ps://bugs.launchpad.net/libmemcached/ +bug/583031

Slide 62

Slide 62 text

Fix, rinse, repeat

Slide 63

Slide 63 text

RabbitMQ RabbitMQ RabbitMQ Service AMQP (port 5672) HA cluster

Slide 64

Slide 64 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 5672 \ -m state --state ESTABLISHED \ -j DROP $ php test-rabbitmq.php Fantas?c! Block AMQP port, client ?mes out

Slide 65

Slide 65 text

FAIL

Slide 66

Slide 66 text

“RabbitMQ clusters do not tolerate network par??ons well.” h"p://www.rabbitmq.com/par??ons.html

Slide 67

Slide 67 text

$ epmd –names epmd: up and running on port 4369 with data: name rabbit at port 60278 Each node listens on a port assigned by EPMD

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

$ iptables -A INPUT -i eth0 \ -p tcp --dport 60278 \ -m state --state ESTABLISHED \ -j DROP $ php test-rabbitmq.php Hangs! Uh oh.

Slide 70

Slide 70 text

Mnesia('rabbit@dmzutilities03-global01- test'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@dmzutilities01-global01-test'} application: rabbitmq_management exited: shutdown type: temporary RabbitMQ logs show par??oned network error; nodes shutdown

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf = fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf; } PHP library didn’t have any ?me limit on reading a frame

Slide 73

Slide 73 text

Fix, rinse, repeat

Slide 74

Slide 74 text

It would be nice if we could automate this

Slide 75

Slide 75 text

Automate!

Slide 76

Slide 76 text

• Hailo run a dedicated automated test environment • Powered by bash, JMeter and Graphite • Con?nuous automated tes?ng with failure simula?ons

Slide 77

Slide 77 text

Fix a"empt 1: bad ?meouts conﬁgured

Slide 78

Slide 78 text

Fix a"empt 2: be"er ?meouts

Slide 79

Slide 79 text

Simulate in system tests

Slide 80

Slide 80 text

Simulate failure Assert monitoring endpoint picks this up Assert features sBll work

Slide 81

Slide 81 text

In conclusion

Slide 82

Slide 82 text

“the best way to avoid failure is to fail constantly.” h"p://www.codinghorror.com/blog/2011/04/ working-‐with-‐the-‐chaos-‐monkey.html

Slide 83

Slide 83 text

You should test for failure How does the so^ware react? How does the PHP client react?

Slide 84

Slide 84 text

AutomaBon makes conBnuous failure tesBng feasible

Slide 85

Slide 85 text

Systems that cope well with failure are easier to operate

Slide 86

Slide 86 text

TIMED BLOCK ALL THE THINGS

Slide 87

Slide 87 text

Thanks So^ware used at Hailo h"p://cassandra.apache.org/ h"p://zookeeper.apache.org/ h"p://www.elas?csearch.org/ h"p://www.acunu.com/acunu-‐analy?cs.html h"ps://github.com/bitly/nsq h"ps://github.com/davegardnerisme/cru^ﬂake h"ps://github.com/davegardnerisme/nsqphp Plus a load of other things I’ve not men?oned.

Slide 88

Slide 88 text

Further reading Hystrix: Latency and Fault Tolerance for Distributed Systems h"ps://github.com/Neslix/Hystrix Timelike: a network simulator h"p://aphyr.com/posts/277-‐?melike-‐a-‐network-‐simulator Notes on distributed systems for young bloods h"p://www.somethingsimilar.com/2013/01/14/notes-‐on-‐distributed-‐ systems-‐for-‐young-‐bloods/ Stream de-‐duplica?on (relevant to NSQ) h"p://www.davegardner.me.uk/blog/2012/11/06/stream-‐de-‐ duplica?on/ ID genera?on in distributed systems h"p://www.slideshare.net/davegardnerisme/unique-‐id-‐genera?on-‐in-‐ distributed-‐systems