Scalability and resilient engineering

SCALABILITY AND RESILIENT ENGINEERING 27.3.2014, Rubyslava

.me • sysadmin by choice • the OPS in devops
• 8+ years of experience • primarily Linux • ex-DIGMIA, currently at ESET

some buzzwords are in order

Scalability? • adapt quick enough to cover the demand. •
dynamic – auto-scaling, elastic • manual – do it when necessary • scaling • replacing CPU with a faster one • loadbalancing traffic to 10 webservers

scale up vs. scale out

Resilient engineering? • system is designed in such way to
sustain damage, better yet – to avoid it, without going through complete failure. • know your environment, expect the unexpected and build for it

quick refresh • scalability – adapt to demand • resilience
– durable system, can sustain failure

what happens if you don’t?

words of wisdom • good advice • best practices •
common guidelines

design scale into solution • design for 20x capacity •
implement for 3x capacity • deploy for 1.5x capacity you will be ready and confident.

do the absolute minimum • don’t do the same thing
more than once • serve from cache until the inputs change • this run generates the same thing? avoid it! • embrace caching • client-side – in browser (js,css,images..) • server-side – reverse-proxy, key/value store • CDN – build it vs. buy it dilemma more (visitors) with less (hardware).

don’t over engineer the solution • complex systems are more
prone to FAIL • debugging and visibility is hard • scaling is difficult keep the system simple and easy.

isolate things • put diff. parts into separate tiers •
be granular with components • too much is too much ! • isolation is a good thing a failure of one component doesn’t affect another component.

scale out datacenters • be multi-homed in at least in
2 zones • active/active, don’t waste resources a catastrophe doesn’t bring your system down. PS: BACKUPS! (tried restoring them?:)

monitor ALL the things! • measure sensible metrics and ACT
on them • know “normal” behaviour • prevent failures • compare values from various timeframes you will get a reliable system and can sleep safe at night.

embrace failure • discuss and learn from failures • post-mortems
(root cause analysis) • never let a good failure go to waste! why has something happened and how to avoid it in the future.

common mistakes what we have seen and what you should
try to avoid

mistake #1 • Dev: “It works on my laptop, just
not on the server..” (= it MUST be the server!) • Me: “Ok, backup your mail, we're putting your laptop into production!” https://twitter.com/oising/status/298464920717099009

mistake #1 – to consider • know your requirements •
app versions • configuration variables • don’t expect every environment to be exactly like your laptop • minimize custom dependencies • custom environments SUCK • PITA to re-create

mistake #2 • “optimistic programming” • you expect every component
to work all the time • at the same speed (timeout, anyone?) • and you don’t consider corner cases • check response for errors

mistake #2 - avoidance • be ready for partial or
full unavailability • db read only • max.clients – unable to connect • memcache slow • ON/OFF feature switch • set timeouts

could you cope with this?

mistake #3 - scenario • colleague makes a change on
prod • monitoring is green • few hours/days something happens, nobody knows why – an effect of the change done long ago • colleague “fixes it”.. • ..since only she knew what she did !

mistake #3 - issues • human SPOF • difficult to
track • no documentation • without testing after a change • testing scenario should go through all parts of business logic

mistake #3 - tips • versioning – git, svn, whatever
works for you • documentation • split environments – dev, staging, prod • testing • automated load testing • automated user testing (are all elements there?) • can I successfully buy a product?

this is how we roll

Links and resources

Links and resources • http://highscalability.com • http://www.velocityconf.com • “scaling reddit
from 1 million to 1 billion” [http://www.infoq.com/presentations/scaling-reddit] • “scaling instagram” [http://www.slideshare.net/iammutex/scaling-instagram] • “Scaling Twitter: Making Twitter 10000 Percent Faster” [http://highscalability.com/scaling-twitter-making-twitter- 10000-percent-faster]

Scalability and resilient engineering

Scalability and resilient engineering

bon

Other Decks in Technology

Featured

Transcript