Nat's "History" of "Web Scale"

Google Confidential and Proprietary Developers

Google Confidential and Proprietary Nat's “History” of “Web Scale” Nat
Welch Google Compute Engine SRE

Google Confidential and Proprietary Hi I’m Nat • Site Reliability
Engineer on Google Compute Engine • Cal Poly CSC Grad, 2011 • twitter.com/icco • plus.google.com/+NatWelch • natwelch.com

Google Confidential and Proprietary "I'm gonna build a website." -
Nat Welch Alone in his room, 2004

Google Confidential and Proprietary Council Bluffs, Iowa

Google Confidential and Proprietary Simple: One box for all of
the work Network traffic Batch Jobs All the work

Google Confidential and Proprietary Common Issues • One process stealing
resources from another • No redundancy • Growth ceiling is equivalent to how much money you have

Google Confidential and Proprietary Tactical Decision to Scale

Google Confidential and Proprietary iFixit, 2010

Google Confidential and Proprietary Common Issues • Capacity planning •
Network latency • DNS becomes super important • Caching • Load Balancing • Regional outages

Google Confidential and Proprietary Specialized Servers • Database machines •
Distributed Cache machines • Dedicated hardware load balancers • Computation machines • Distributed Lock Managers • Log Servers As you scale...

Google Confidential and Proprietary Common Issues • Region distribution ◦
Database just in US, app servers all around world • Network latency ◦ If you don’t account for machines not being next to each other • Caching • Load Balancing • Regional outages • Config management

Google Confidential and Proprietary

Google Confidential and Proprietary “Google [recorded] a whopping $2.29 billion
in capital expenditures in the third quarter of 2013 [...]. The spending is driven by a massive expansion of Google’s global data center network, which represents perhaps the largest construction effort in the history of the data center industry.” - Rich Miller Data Center Knowledge, 2013

Google Confidential and Proprietary

Google Confidential and Proprietary “Web applications can fail in all
sorts of dramatic ways, and you're not going to foresee all of them. What you can do, however, is make use of what you do know about what happens in your real world on a regular basis.” - John Allspaw High Scalability Blog, 2009

Google Confidential and Proprietary Site Reliability Engineers • 50% Operations,
50% Engineering • Use exactly the same tools and processes as Developers ◦ Unit tests, Peer review, Coding standards ◦ Configuration treated as code ◦ Develop, test, release, iterate ◦ Write documentation • Learn lessons, continually improve ◦ Formal 'post mortems' after service-impacting incidents • But different ◦ Typically not feature-driven workload. ◦ Typically not front-end related development. ◦ Typically greater choice in what you work on day-to-day.

Google Confidential and Proprietary Two main ideas for scaling up
Failure Modes • As your service grows, what will start to fail and how will that show itself? • What failure modes can you prevent before they start? • How can you keep serving and growing your service in spite of failures? Limiting Factors • What will limit your ability to grow your application? • Which resource constraints are important to pay attention to? • How can you design to push past these limits?

Google Confidential and Proprietary 10 Principles to scaling 1. Keep
servers simple 2. Prefer smaller, stateless jobs 3. Retry safely 4. Bound resource usage and fail gracefully 5. Don't Crash/Assert/Exit on exceptions 6. Be Transparent 7. Avoid lazy initialization 8. Maintain operational flexibility 9. Anticipate the future 10. Check the user experience

Google Confidential and Proprietary Examples • If a hard disk
has a MTBF of 3 years, and you have 10,000 such drives, expect 9 to fail every day. • All our software is designed to cater to small or large component failures. (e.g. server failure, DC failure). • Monitor Everything. Always.

Google Confidential and Proprietary “Hope is not a strategy.” -
SRE Motto Google, 1999

Google Confidential and Proprietary The Future?!?

Google Confidential and Proprietary Google Cloud Platform

Google Confidential and Proprietary Two paths PAAS • Platform as
a Service • “Here’s my code. Scale it, maybe?” • Google App Engine, Heroku, others IAAS • Infrastructure as a Service • “Linux Servers, IN THE CLOUD!” • Google Compute Engine, Amazon Elastic Compute Cloud, OpenStack Nova

Google Confidential and Proprietary IAAS Cons • Lots more work
than a PAAS • Free tiers usually provide inconsistent performance • A lot more expensive than using your own hardware if you don’t have a dynamic workload and large scale Pros • Run anything in any way you want • Easier to migrate to and off of • Cheaper at larger scale

Google Confidential and Proprietary PAAS Cons • Expensive • Usually
requires you to trust provider • Restrictions on what you can host and/or build Pros • Low maintenance • Great developer communities • Scaling is just a slider

Google Confidential and Proprietary Thanks! (Check out cloud.google.com)

Google Confidential and Proprietary References • http://highscalability.com/blog/2009/6/29/how-to-succeed-at-capacity-planning-without-really-trying-an.html • https://www.google.com/about/datacenters/inside/locations/index.html •
https://www.google.com/about/datacenters/gallery • https://secure.flickr.com/photos/icco/4676902541/ • https://plus.google.com/photos/111401917971052287374/albums/5943215427176305713 • http://googleresearch.blogspot.ch/2012/07/site-reliability-engineers-solving-most.html • http://googleforstudents.blogspot.ch/2012/06/site-reliability-engineers-worlds-most.html • http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/all/ • http://queue.acm.org/detail.cfm?id=2371516 • http://www.reddit.com/r/IAmA/comments/177267/we_are_the_google_site_reliability_team_we_make/ • https://en.wikipedia.org/wiki/Unix_philosophy • http://company.zynga.com/news/engineering-blog/what-powers-play-zynga • http://company.zynga.com/about/press/engineering-blog/building-scalable-game-server • http://www.datacenterknowledge.com/archives/2013/10/18/google-data-center-spending-continues-to-soar/ • https://en.wikipedia.org/wiki/Distributed_lock_manager • http://googleresearch.blogspot.de/2009/06/speed-matters.html

Nat's "History" of "Web Scale"

Nat's "History" of "Web Scale"

Other Decks in Technology

Featured

Transcript