Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalability and resilient engineering

Avatar for bon bon
March 27, 2014

Scalability and resilient engineering

How to make your product sustain the slashdot effect: basic principles, best practices, common pitfalls, real world examples and more.

Avatar for bon

bon

March 27, 2014

Other Decks in Technology

Transcript

  1. .me • sysadmin by choice • the OPS in devops

    • 8+ years of experience • primarily Linux • ex-DIGMIA, currently at ESET
  2. Scalability? • adapt quick enough to cover the demand. •

    dynamic – auto-scaling, elastic • manual – do it when necessary • scaling • replacing CPU with a faster one • loadbalancing traffic to 10 webservers
  3. Resilient engineering? • system is designed in such way to

    sustain damage, better yet – to avoid it, without going through complete failure. • know your environment, expect the unexpected and build for it
  4. design scale into solution • design for 20x capacity •

    implement for 3x capacity • deploy for 1.5x capacity you will be ready and confident.
  5. do the absolute minimum • don’t do the same thing

    more than once • serve from cache until the inputs change • this run generates the same thing? avoid it! • embrace caching • client-side – in browser (js,css,images..) • server-side – reverse-proxy, key/value store • CDN – build it vs. buy it dilemma more (visitors) with less (hardware).
  6. don’t over engineer the solution • complex systems are more

    prone to FAIL • debugging and visibility is hard • scaling is difficult keep the system simple and easy.
  7. isolate things • put diff. parts into separate tiers •

    be granular with components • too much is too much ! • isolation is a good thing a failure of one component doesn’t affect another component.
  8. scale out datacenters • be multi-homed in at least in

    2 zones • active/active, don’t waste resources a catastrophe doesn’t bring your system down. PS: BACKUPS! (tried restoring them?:)
  9. monitor ALL the things! • measure sensible metrics and ACT

    on them • know “normal” behaviour • prevent failures • compare values from various timeframes you will get a reliable system and can sleep safe at night.
  10. embrace failure • discuss and learn from failures • post-mortems

    (root cause analysis) • never let a good failure go to waste! why has something happened and how to avoid it in the future.
  11. mistake #1 • Dev: “It works on my laptop, just

    not on the server..” (= it MUST be the server!) • Me: “Ok, backup your mail, we're putting your laptop into production!” https://twitter.com/oising/status/298464920717099009
  12. mistake #1 – to consider • know your requirements •

    app versions • configuration variables • don’t expect every environment to be exactly like your laptop • minimize custom dependencies • custom environments SUCK • PITA to re-create
  13. mistake #2 • “optimistic programming” • you expect every component

    to work all the time • at the same speed (timeout, anyone?) • and you don’t consider corner cases • check response for errors
  14. mistake #2 - avoidance • be ready for partial or

    full unavailability • db read only • max.clients – unable to connect • memcache slow • ON/OFF feature switch • set timeouts
  15. mistake #3 - scenario • colleague makes a change on

    prod • monitoring is green • few hours/days something happens, nobody knows why – an effect of the change done long ago • colleague “fixes it”.. • ..since only she knew what she did !
  16. mistake #3 - issues • human SPOF • difficult to

    track • no documentation • without testing after a change • testing scenario should go through all parts of business logic
  17. mistake #3 - tips • versioning – git, svn, whatever

    works for you • documentation • split environments – dev, staging, prod • testing • automated load testing • automated user testing (are all elements there?) • can I successfully buy a product?
  18. Links and resources • http://highscalability.com • http://www.velocityconf.com • “scaling reddit

    from 1 million to 1 billion” [http://www.infoq.com/presentations/scaling-reddit] • “scaling instagram” [http://www.slideshare.net/iammutex/scaling-instagram] • “Scaling Twitter: Making Twitter 10000 Percent Faster” [http://highscalability.com/scaling-twitter-making-twitter- 10000-percent-faster]