Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nat's "History" of "Web Scale"

Avatar for Nat Welch Nat Welch
November 25, 2013

Nat's "History" of "Web Scale"

A walk from one server to a million.

Video: http://youtu.be/vajzYBDg14c

Avatar for Nat Welch

Nat Welch

November 25, 2013

Other Decks in Technology

Transcript

  1. Google Confidential and Proprietary Hi I’m Nat • Site Reliability

    Engineer on Google Compute Engine • Cal Poly CSC Grad, 2011 • twitter.com/icco • plus.google.com/+NatWelch • natwelch.com
  2. Google Confidential and Proprietary Simple: One box for all of

    the work Network traffic Batch Jobs All the work
  3. Google Confidential and Proprietary Common Issues • One process stealing

    resources from another • No redundancy • Growth ceiling is equivalent to how much money you have
  4. Google Confidential and Proprietary Common Issues • Capacity planning •

    Network latency • DNS becomes super important • Caching • Load Balancing • Regional outages
  5. Google Confidential and Proprietary Specialized Servers • Database machines •

    Distributed Cache machines • Dedicated hardware load balancers • Computation machines • Distributed Lock Managers • Log Servers As you scale...
  6. Google Confidential and Proprietary Common Issues • Region distribution ◦

    Database just in US, app servers all around world • Network latency ◦ If you don’t account for machines not being next to each other • Caching • Load Balancing • Regional outages • Config management
  7. Google Confidential and Proprietary “Google [recorded] a whopping $2.29 billion

    in capital expenditures in the third quarter of 2013 [...]. The spending is driven by a massive expansion of Google’s global data center network, which represents perhaps the largest construction effort in the history of the data center industry.” - Rich Miller Data Center Knowledge, 2013
  8. Google Confidential and Proprietary “Web applications can fail in all

    sorts of dramatic ways, and you're not going to foresee all of them. What you can do, however, is make use of what you do know about what happens in your real world on a regular basis.” - John Allspaw High Scalability Blog, 2009
  9. Google Confidential and Proprietary Site Reliability Engineers • 50% Operations,

    50% Engineering • Use exactly the same tools and processes as Developers ◦ Unit tests, Peer review, Coding standards ◦ Configuration treated as code ◦ Develop, test, release, iterate ◦ Write documentation • Learn lessons, continually improve ◦ Formal 'post mortems' after service-impacting incidents • But different ◦ Typically not feature-driven workload. ◦ Typically not front-end related development. ◦ Typically greater choice in what you work on day-to-day.
  10. Google Confidential and Proprietary Two main ideas for scaling up

    Failure Modes • As your service grows, what will start to fail and how will that show itself? • What failure modes can you prevent before they start? • How can you keep serving and growing your service in spite of failures? Limiting Factors • What will limit your ability to grow your application? • Which resource constraints are important to pay attention to? • How can you design to push past these limits?
  11. Google Confidential and Proprietary 10 Principles to scaling 1. Keep

    servers simple 2. Prefer smaller, stateless jobs 3. Retry safely 4. Bound resource usage and fail gracefully 5. Don't Crash/Assert/Exit on exceptions 6. Be Transparent 7. Avoid lazy initialization 8. Maintain operational flexibility 9. Anticipate the future 10. Check the user experience
  12. Google Confidential and Proprietary Examples • If a hard disk

    has a MTBF of 3 years, and you have 10,000 such drives, expect 9 to fail every day. • All our software is designed to cater to small or large component failures. (e.g. server failure, DC failure). • Monitor Everything. Always.
  13. Google Confidential and Proprietary Two paths PAAS • Platform as

    a Service • “Here’s my code. Scale it, maybe?” • Google App Engine, Heroku, others IAAS • Infrastructure as a Service • “Linux Servers, IN THE CLOUD!” • Google Compute Engine, Amazon Elastic Compute Cloud, OpenStack Nova
  14. Google Confidential and Proprietary IAAS Cons • Lots more work

    than a PAAS • Free tiers usually provide inconsistent performance • A lot more expensive than using your own hardware if you don’t have a dynamic workload and large scale Pros • Run anything in any way you want • Easier to migrate to and off of • Cheaper at larger scale
  15. Google Confidential and Proprietary PAAS Cons • Expensive • Usually

    requires you to trust provider • Restrictions on what you can host and/or build Pros • Low maintenance • Great developer communities • Scaling is just a slider
  16. Google Confidential and Proprietary References • http://highscalability.com/blog/2009/6/29/how-to-succeed-at-capacity-planning-without-really-trying-an.html • https://www.google.com/about/datacenters/inside/locations/index.html •

    https://www.google.com/about/datacenters/gallery • https://secure.flickr.com/photos/icco/4676902541/ • https://plus.google.com/photos/111401917971052287374/albums/5943215427176305713 • http://googleresearch.blogspot.ch/2012/07/site-reliability-engineers-solving-most.html • http://googleforstudents.blogspot.ch/2012/06/site-reliability-engineers-worlds-most.html • http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/all/ • http://queue.acm.org/detail.cfm?id=2371516 • http://www.reddit.com/r/IAmA/comments/177267/we_are_the_google_site_reliability_team_we_make/ • https://en.wikipedia.org/wiki/Unix_philosophy • http://company.zynga.com/news/engineering-blog/what-powers-play-zynga • http://company.zynga.com/about/press/engineering-blog/building-scalable-game-server • http://www.datacenterknowledge.com/archives/2013/10/18/google-data-center-spending-continues-to-soar/ • https://en.wikipedia.org/wiki/Distributed_lock_manager • http://googleresearch.blogspot.de/2009/06/speed-matters.html