Planning to Fail

Planning

the taxi app !

Planning

The beginning

My website: single VPS running PHP + MySQL

No growth, low volume, simple func@onality, one engineer (me!)

Large growth, high volume, complex func@onality, lots of engineers

• Launched in London November 2011 • Now in 5
ci@es in 3 countries (30%+ growth every month) • A Hailo hail is accepted around the world every 5 seconds

“.. Brooks [1] reveals that the complexity of a
soTware project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-‐component interac@on rather than intra-‐component bugs, we conclude that less machinery is (quadra@cally) beYer.” hYp://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

• SOA (10+ services) • AWS (3 regions, 9 AZs, lots
of instances) • 10+ engineers building services and you?! (hailo is hiring)!

Our overall reliability is in danger

Embracing failure (a coping strategy)

VPC (running PHP+MySQL) reliable?!

Reliable !== Resilient

Choosing a stack

“Hailo” (running PHP+MySQL) reliable?!

Service each service does one job well! Service
Service Service Service Oriented Architecture

• Fewer lines of code • Fewer responsibili@es • Changes less
frequently • Can swap en@re implementa@on if needed

Service (running PHP+MySQL) reliable?!

Service MySQL MySQL running on diﬀerent box

Service MySQL MySQL MySQL running in Mul@-‐Master
mode

Going global

MySQL Separa@ng concerns CRUD Locking Search
Analy@cs ID genera@on also queuing…!

At Hailo we look for technologies that are: • Distributed
run on more than one machine • Homogenous all nodes look the same • Resilient can cope with the loss of node(s) with no loss of data

“There is no such thing as standby infrastructure: there
is stuﬀ you always use and stuﬀ that won’t work when you need it.” hYp://blog.b3k.us/2012/01/24/some-‐rules.html

• Highly performant, scalable and resilient data store • Underpins
much of what we do at Hailo • Makes mul@-‐DC easy!

• Highly reliable distributed coordina@on • We implement locking and
leadership elec@on on top of ZK and use sparingly ZooKeeper

• Distributed, RESTful, Search Engine built on top of Apache
Lucene • Replaced basic foo LIKE ‘%bar%’ queries (so much beYer)

• Real@me message processing system designed to handle billions
of messages per day • Fault tolerant, highly available with reliable message delivery guarantee NSQ

• Distributed ID genera@on with no coordina@on required • Rock
solid Cruftﬂake

• All these technologies have similar proper@es of distribu@on
and resilience • They are designed to cope with failure • They are not broken by design

Lessons learned

Minimise the cri@cal path

What is the minimum viable service?

class HailoMemcacheService { private $mc = null; public function __call()
{ $mc = $this->getInstance(); // do stuff } private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; } } Lazy-‐init instances; connect on use

Conﬁgure clients carefully

$this->mc = new \Memcached; $this->mc->addServers($s); $this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout); $this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT,
$sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout); $this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout); Make sure @meouts are conﬁgured

Choose @meouts based on data here?!

“Fail Fast: Set aggressive @meouts such that failing components
don’t make the en@re system crawl to a halt.” hYp://techblog.neslix.com/2011/04/lessons-‐ neslix-‐learned-‐from-‐aws-‐outage.html

95th percen@le here?!

• Kill memcache on box A, measure impact on applica@on
• Kill memcache on box B, measure impact on applica@on All ﬁne.. we’ve got this covered!

• Box A, running in AWS, locks up • Any parts
of applica@on that touch Memcache stop working

Things fail in exo@c ways

$ iptables -A INPUT -i eth0 \ -p tcp --dport
11211 -j REJECT $ php test-memcache.php Working OK! Packets rejected and source no@ﬁed by ICMP. Expect fast fails.

11211 -j DROP $ php test-memcache.php Working OK! Packets silently dropped. Expect long @me outs.

11211 \ -m state --state ESTABLISHED \ -j DROP $ php test-memcache.php Hangs! Uh oh.

• When AWS instances hang they appear to accept connec@ons
but drop packets • Bug! hYps://bugs.launchpad.net/libmemcached/ +bug/583031

Fix, rinse, repeat

It would be nice if we could
automate this

Automate!

• Hailo run a dedicated automated test environment • Powered
by bash, JMeter and Graphite • Con@nuous automated tes@ng with failure simula@ons

Fix aYempt 1: bad @meouts conﬁgured

Fix aYempt 2: beYer @meouts

Simulate in system tests

Simulate failure Assert monitoring endpoint picks this up
Assert features s>ll work

In conclusion

“the best way to avoid failure is to fail
constantly.” hYp://www.codinghorror.com/blog/2011/04/ working-‐with-‐the-‐chaos-‐monkey.html

TIMED BLOCK ALL THE THINGS

Thanks SoTware used at Hailo hYp://cassandra.apache.org/
hYp://zookeeper.apache.org/ hYp://[email protected]/ hYp://www.acunu.com/acunu-‐[email protected] hYps://github.com/bitly/nsq hYps://github.com/davegardnerisme/cruTﬂake hYps://github.com/davegardnerisme/nsqphp Plus a load of other things I’ve not men@oned.

Further reading Hystrix: Latency and Fault Tolerance for Distributed
Systems hYps://github.com/Neslix/Hystrix Timelike: a network simulator hYp://aphyr.com/posts/277-‐@melike-‐a-‐network-‐simulator Notes on distributed systems for young bloods hYp://www.somethingsimilar.com/2013/01/14/notes-‐on-‐distributed-‐ systems-‐for-‐young-‐bloods/ Stream de-‐duplica@on (relevant to NSQ) hYp://www.davegardner.me.uk/blog/2012/11/06/stream-‐de-‐ duplica@on/ ID genera@on in distributed systems hYp://www.slideshare.net/davegardnerisme/unique-‐id-‐genera@on-‐in-‐ distributed-‐systems

Planning to Fail

Planning to Fail

More Decks by Dave Gardner

Other Decks in Technology

Featured

Transcript