Planning to Fail (PHPNE)

Planning

the taxi app !

Planning

Why? h"p://en.wikipedia.org/wiki/High_availability

99.9% (three nines) Down?me: 43.8
minutes per month 8.76 hours per year

99.99% (four nines) Down?me: 4.32 minutes
per month 52.56 minutes per year

99.999% (ﬁve nines) Down?me: 25.9
seconds per month 5.26 minutes per year

www.whoownsmyavailability.com ?

www.whoownsmyavailability.com YOU

The beginning

My website: single VPS running PHP + MySQL

No growth, low volume, simple func?onality, one engineer (me!)

Large growth, high volume, complex func?onality, lots of engineers

• Launched in London November 2011 • Now in 5
ci?es in 3 countries (30%+ growth every month) • A Hailo hail is accepted around the world every 5 seconds

“.. Brooks [1] reveals that the complexity of a
so^ware project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-‐component interac?on rather than intra-‐component bugs, we conclude that less machinery is (quadra?cally) be"er.” h"p://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

• SOA (10+ services) • AWS (3 regions, 9 AZs, lots
of instances) • 10+ engineers building services and you?! (hailo is hiring)!

Our overall reliability is in danger

Embracing failure (a coping strategy)

VPC (running PHP+MySQL) reliable?!

Reliable !== Resilient

Choosing a stack

“Hailo” (running PHP+MySQL) reliable?!

Service each service does one job well! Service
Service Service Service Oriented Architecture

• Fewer lines of code • Fewer responsibili?es • Changes less
frequently • Can swap en?re implementa?on if needed

Service (running PHP+MySQL) reliable?!

Service MySQL MySQL running on diﬀerent box

Service MySQL MySQL MySQL running in Mul?-‐Master
mode

Going global

MySQL Separa?ng concerns CRUD Locking Search
Analy?cs ID genera?on also queuing…!

At Hailo we look for technologies that are: • Distributed
run on more than one machine • Homogenous all nodes look the same • Resilient can cope with the loss of node(s) with no loss of data

“There is no such thing as standby infrastructure: there
is stuﬀ you always use and stuﬀ that won’t work when you need it.” h"p://blog.b3k.us/2012/01/24/some-‐rules.html

• Highly performant, scalable and resilient data store • Underpins
much of what we do at Hailo • Makes mul?-‐DC easy!

• Highly reliable distributed coordina?on • We implement locking and
leadership elec?on on top of ZK and use sparingly ZooKeeper

• Distributed, RESTful, Search Engine built on top of Apache
Lucene • Replaced basic foo LIKE ‘%bar%’ queries (so much be"er)

• Real?me message processing system designed to handle billions
of messages per day • Fault tolerant, highly available with reliable message delivery guarantee NSQ

• Real ?me incremental analy?cs plasorm, backed by Apache
Cassandra • Powerful SQL-‐like interface • Scalable and highly available

• Distributed ID genera?on with no coordina?on required • Rock
solid Cruftﬂake

• All these technologies have similar proper?es of distribu?on
and resilience • They are designed to cope with failure • They are not broken by design

Lessons learned

Minimise the cri?cal path

What is the minimum viable service?

class HailoMemcacheService { private $mc = null; public function __call()
{ $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; } } Lazy-‐init instances; connect on use

Conﬁgure clients carefully

$this->mc = new \Memcached; $this->mc->addServers($s); $this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout); $this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT,
$sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make sure ?meouts are conﬁgured

Choose ?meouts based on data here?!

“Fail Fast: Set aggressive ?meouts such that failing components
don’t make the en?re system crawl to a halt.” h"p://techblog.neslix.com/2011/04/lessons-‐ neslix-‐learned-‐from-‐aws-‐outage.html

95th percen?le here?!

• Kill memcache on box A, measure impact on applica?on
• Kill memcache on box B, measure impact on applica?on All ﬁne.. we’ve got this covered!

• Box A, running in AWS, locks up • Any parts
of applica?on that touch Memcache stop working

Things fail in exo?c ways

$ iptables -A INPUT -i eth0 \ -p tcp --dport
11211 -j REJECT $ php test-memcache.php Working OK! Packets rejected and source no?ﬁed by ICMP. Expect fast fails.

11211 -j DROP $ php test-memcache.php Working OK! Packets silently dropped. Expect long ?me outs.

11211 \ -m state --state ESTABLISHED \ -j DROP $ php test-memcache.php Hangs! Uh oh.

• When AWS instances hang they appear to accept connec?ons
but drop packets • Bug! h"ps://bugs.launchpad.net/libmemcached/ +bug/583031

Fix, rinse, repeat

RabbitMQ RabbitMQ RabbitMQ Service AMQP (port
5672) HA cluster

5672 \ -m state --state ESTABLISHED \ -j DROP $ php test-rabbitmq.php Fantas?c! Block AMQP port, client ?mes out

“RabbitMQ clusters do not tolerate network par??ons well.”
h"p://www.rabbitmq.com/par??ons.html

$ epmd –names epmd: up and running on port 4369
with data: name rabbit at port 60278 Each node listens on a port assigned by EPMD

60278 \ -m state --state ESTABLISHED \ -j DROP $ php test-rabbitmq.php Hangs! Uh oh.

Mnesia('rabbit@dmzutilities03-global01- test'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@dmzutilities01-global01-test'}
application: rabbitmq_management exited: shutdown type: temporary RabbitMQ logs show par??oned network error; nodes shutdown

while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf
= fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf; } PHP library didn’t have any ?me limit on reading a frame

Fix, rinse, repeat

It would be nice if we could
automate this

Automate!

• Hailo run a dedicated automated test environment • Powered
by bash, JMeter and Graphite • Con?nuous automated tes?ng with failure simula?ons

Fix a"empt 1: bad ?meouts conﬁgured

Fix a"empt 2: be"er ?meouts

Simulate in system tests

Simulate failure Assert monitoring endpoint picks this up
Assert features sBll work

In conclusion

“the best way to avoid failure is to fail
constantly.” h"p://www.codinghorror.com/blog/2011/04/ working-‐with-‐the-‐chaos-‐monkey.html

You should test for failure How
does the so^ware react? How does the PHP client react?

AutomaBon makes conBnuous failure tesBng feasible

Systems that cope well with failure are easier
to operate

TIMED BLOCK ALL THE THINGS

Thanks So^ware used at Hailo h"p://cassandra.apache.org/
h"p://zookeeper.apache.org/ h"p://www.elas?csearch.org/ h"p://www.acunu.com/acunu-‐analy?cs.html h"ps://github.com/bitly/nsq h"ps://github.com/davegardnerisme/cru^ﬂake h"ps://github.com/davegardnerisme/nsqphp Plus a load of other things I’ve not men?oned.

Further reading Hystrix: Latency and Fault Tolerance for Distributed
Systems h"ps://github.com/Neslix/Hystrix Timelike: a network simulator h"p://aphyr.com/posts/277-‐?melike-‐a-‐network-‐simulator Notes on distributed systems for young bloods h"p://www.somethingsimilar.com/2013/01/14/notes-‐on-‐distributed-‐ systems-‐for-‐young-‐bloods/ Stream de-‐duplica?on (relevant to NSQ) h"p://www.davegardner.me.uk/blog/2012/11/06/stream-‐de-‐ duplica?on/ ID genera?on in distributed systems h"p://www.slideshare.net/davegardnerisme/unique-‐id-‐genera?on-‐in-‐ distributed-‐systems

Planning to Fail (PHPNE)

Planning to Fail (PHPNE)

More Decks by Dave Gardner

Other Decks in Technology

Featured

Transcript