Breaking Point: Building Scalable, Resilient APIs

@markhibberd Breaking Point Building Scalable, Resilient APIs

“A common mistake that people make when trying to design
something completely foolproof was to underestimate the ingenuity of complete fools.” Douglas Adams -! Mostly Harmless (1992)

How Did We Get Here

THE API

THE API G

THE API G $ $ $ $

failure is inevitable

How Systems Fail

“You live and learn. At any rate, you live.” Douglas
Adams -! Mostly Harmless (1992)

The Crash one

Systems Never Fail Cleanly

Cascading Failures two

OCTOBER 27, 2012 ! Bug triggers Cascading Failures Causes Major
Amazon Outage ! https://aws.amazon.com/message/680342/

At 10:00AM PDT Monday, a small number of Amazon Elastic
Block Store (EBS) volumes in one of our five Availability Zones in the US-East Region began seeing degraded performance, and in some cases, became “stuck”

Can Be Triggered By As Little As A Performance Issue

Don’t Listen To Programmers, Performance Matters

(well, at least asymptotics matter)

A Failure is Indistinguishable from a Slow Response

Chain Reactions three

17% 100k req/s 12.5k >>> 14.3k >>> 17k req/s

17% 100k req/s 12.5k >>> 14.3k >>> 17k req/s 4.5%
extra traffic means a 36% load increase on each server

25% 100k req/s 12.5% extra traffic means a 300% load
increase on each server 12.5k >>> 14.3k >>> 17k >>> 50k req/s

Capacity Skew four

JUNE 11, 2010 ! A Perfect Storm.....of Whales Heavy Loads
Causes Series of Twitter Outages During World Cup http://engineering.twitter.com/2010/06/perfect-stormof-whales.html

Since Saturday, Twitter has experienced several incidences of poor site
performance and a high number of errors due to one of our internal sub-networks being over-capacity.

Self Denial of Service five

Clients Server

a very painful experience ! The Quiet Time

critical licensing service, 100 million + active users a day,
millions of $$$. ! A couple of “simple” services. Thick clients, non- updatable, load-balanced on client.

server client

/call server client on-demand

/call server client /check on-demand periodically

/call server client /check on-demand periodically /check2 /check2z /v3check

/call server /check /check2 /check2z /v3check

System Collusion six

NOVEMBER 14, 2014 ! Link Imbalance Mystery Issue Causes Metastable
Failure State ! https://code.facebook.com/posts/1499322996995183/solving-the-mystery-of-link-imbalance-a-metastable- failure-state-at-scale/

Bonded Link Should Evenly Utilise Each Network Pipe

multiple causes for imbalance

systems couldn’t correct themselves

individually each component was behaving correctly

Temporary Latency To Db Caused Skew Connection Pool Started Favouring
Overloaded Link

failure is not clean

Incident Reports Are For You and Your Customers

A Good Incident Report, Helps Others Learn From Your Mistake,
And Ensures You Really Understand What Went Wrong

1. Summary & Impact 2. Timeline 3. Root Cause 4.
Resolution and Recovery 5. Corrective and Preventative Measures 5 Steps To A Good Incident Report https://sysadmincasts.com/episodes/20-how-to-write-an-incident-report-postmortem

How To Control Failure

“Anything that happens, happens. ! Anything that, in happening, causes
something else to happen, causes something else to happen. ! Anything that, in happening, causes itself to happen again, happens again. ! It doesn’t necessarily do it in chronological order, though.” Douglas Adams -! Mostly Harmless (1992)

Timeouts one

Other Systems Will Always Be Your Most Vulnerable Failure Modes

Never Make a Network Call, or Go to Disk, Without
a Timeout

Pay Particular Attention to Reusable & Pooled Resources

Backoff Requests That Time Out

Heartbeats two

If You Know Something Is Failing, Fail Fast

{ “name”: 123, “version”: “mth”, “stats”: {…}, “status”: “ok” }
/status

Circuit Breakers three

{ “id”: 123, “username”: “mth”, “profile”: { “bio”: “…” “image”:
“…” }, “friends”: [ 191, 1 ] }

{ “id”: 123, “username”: “mth”, “profile”: { “bio”: “…” “image”:
“…” } }

Requires Co-Ordination to Manage Degradation Of Service

Partitioning four

traffic Spike

survives another day

Partitioning Can Also Be Performed Within Services Via Limited Thread
& Resource Pools

multiple types of request /shout /download

multiple types of request /shout /download fast

multiple types of request /shout /download fast slow

multiple types of request /shout /download fast really really slow

multiple types of request /shout /download

Backpressure five

Fail One Request, Instead of Failing All Requests

Measure And Signal Slow Requests

Upstream Drops Requests To Allow For Recovery

Taper Limits

Taper Limits 200k Db Requests

Taper Limits 200k Db Requests 50k Server Requests

Taper Limits 200k Db Requests 50k Server Requests 45k Proxy
Requests

failure can be mitigated

How To Prevent Failure

“A beach house isn't just real estate. It's a state
of mind.” Douglas Adams -! Mostly Harmless (1992)

one Testing

the testing you probably don’t want to do is the
testing you need to do most

Speed Scalability Stability

tools help charles proxy ipfw / pf netem monitoring tools
simian army ab / siege

two Measure Everything

every result computed should have traceability back to the code
& data

gather metadata for everything that touches a request

services: { auth: {…} ! ! ! }

services: { auth: {…}, profile: {…}, recommend: {…} ! }

services: { auth: {…}, profile: {…}, recommend: {…}, friends: {…}
}

services: { auth: {…}, profile: {…}, recommend: {…}, friends: {…}
} { version: {…}, stats: {…}, source: {…} }

statistics work, measurements over time will find errors

! deviation: … percentiles: 90: 95: histogram: 20x: 121 30x:
12 40x: 13 50x: 121311313

statistics work, we can use them to automate corrective actions

three Production In Development

production quality data automation of environments lots of testing

production quality data automation of environments lots of testing Rather
Old Hat

four Development in Production

yes, really. i want to ship your worst, un-tried, experimental
code to production

@ambiata we deal with ingesting and processing lots of data
100s TB / per day / per customer scientific experiment and measurement is key experiments affect users directly researchers / non-specialist engineers produce code

query /chord

query /chord {id: ab123} datastore ;chord

query /chord {id: ab123} datastore ;chord report ;result

query /chord {id: ab123} datastore ;chord report ;result /chord/ab123 client

split environments

query /chord {id: ab123}

query /chord {id: ab123} production:live

/chord {id: ab123} production:live proxy query

/chord {id: ab123} production:exp proxy query

/chord {id: ab123} production:* proxy query query

implemented through machine level acls experiment live control

implemented through machine level acls experiment live control write read

implemented through machine level acls experiment live control

implemented through machine level acls experiment live control write read

checkpoints

x x

query /chord {id: ab123} datastore ;chord report ;result x x
behaviour change through in production testing

x x

deep implementation, intra- and inter- process crosschecks

tandem deployments

/chord {id: ab123} production:* proxy query query x x

staged deployments

/chord {id: ab123} production:* proxy query query x x

/chord {id: ab123} production:* proxy query x

failure is inevitable

failure is not clean

failure can be mitigated

Unmodified. CC BY 2.0 (https://creativecommons.org/licenses/by/2.0/)! https://www.flickr.com/photos/timothymorgan/75288582/! https://www.flickr.com/photos/timothymorgan/75288583/! https://www.flickr.com/photos/timothymorgan/75294154/! https://www.flickr.com/photos/timothymorgan/75593155/! https://www.flickr.com/photos/timothymorgan/75593155/

Breaking Point: Building Scalable, Resilient APIs

Breaking Point: Building Scalable, Resilient APIs

More Decks by Mark Hibberd

Other Decks in Programming

Featured

Transcript