Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Two Sides to Google Infrastructure for Everyone Else

The Two Sides to Google Infrastructure for Everyone Else

My talk from Velocity Santa Clara. The format was a debate. Between myself. Looking at the pros and cons around adopting software and practices from other organisations wholesale, using the GIFEE meme as an example.


Gareth Rushgrove

June 22, 2016


  1. (without introducing more risk) The Two Sides Puppet Gareth Rushgrove

    Of Google Infrastructure for Everyone Else
  2. (without introducing more risk) @garethr

  3. (without introducing more risk) Gareth Rushgrove

  4. (without introducing more risk) Introduction A strange format for a

  5. This is a debate Gareth Rushgrove

  6. I’ll be debating both sides Gareth Rushgrove

  7. Taking opposing viewpoints on the same issue, as a way

    of exploring it in-depth Gareth Rushgrove
  8. The talk is split into two parts; a For part

    and an Against part Gareth Rushgrove
  9. I’d like to explore: - Technical practice evolution - How

    we adopt software - The organisational context Gareth Rushgrove
  10. This house believes… Gareth Rushgrove

  11. Successful companies will look like Google in the future, so

    we should adopt Google-like software and practices today Gareth Rushgrove
  12. Important disclaimer I’ve never worked for Google Gareth Rushgrove

  13. (without introducing more risk) For

  14. You’re probably: 1 Struggling with distributed systems 2 Missing out

    on machine learning 3 Wondering how to scale operations Gareth Rushgrove
  15. Gareth Rushgrove have a 10+ year head start

  16. publish research that influences out industry Gareth Rushgrove

  17. Gareth Rushgrove MapReduce

  18. Gareth Rushgrove Chubby

  19. Gareth Rushgrove Borg

  20. releases (and inspires) software we use Gareth Rushgrove

  21. Gareth Rushgrove

  22. Gareth Rushgrove Go

  23. Gareth Rushgrove from

  24. (without introducing more risk) GFS = HDFS BigTable = HBase

    Protocol Buffers = Thrift or Avro (serialization) Stubby = Thrift or Avro (RPC) ColumnIO = Parquet Dremel = Impala Omega = Mesos Blaze = Pants or Buck FlumeJava = Crunch Logsaver = Scribe or Flume Millwheel = Storm or Samza? Borgmon/Monarch = Graphite Dapper = Zipkin 2014 from @avibryant, @joshwills, @skamille, @marius, @wickman Gareth Rushgrove
  25. We have a term for this; #GIFEE Gareth Rushgrove

  26. Google Infrastructure for Everyone Else Gareth Rushgrove

  27. Distributed systems are hard Gareth Rushgrove

  28. Building your own in-house framework is likely a waste of

    time Gareth Rushgrove
  29. Gareth Rushgrove From Adrian Colyer, Accel, https://speakerdeck.com/acolyer/making-sense-of-it-all

  30. Kubernetes is the 3rd generation of Googles cluster management software

    Gareth Rushgrove
  31. Gareth Rushgrove The Kubernetes API provides primitives that make doing

    the right thing easier
  32. - Orchestration - Logging - Configuration - Self-healing - Storage

    Gareth Rushgrove - Load balancing - Service discovery - Scaling - Batch workloads - Lots more
  33. Gareth Rushgrove Exposed via a modern API

  34. Machine learning is going to be massive Gareth Rushgrove

  35. Soon We Won’t Program Computers. We’ll Train Them Like Dogs

    Gareth Rushgrove ” “
  36. TensorFlow is an open source software library for numerical computation

    Gareth Rushgrove
  37. (without introducing more risk) Gareth Rushgrove …

  38. - Nearest neighbour - Linear regression - Recurrent neural networks

    - Multilayer perceptron - Lots more Gareth Rushgrove
  39. Gareth Rushgrove Introductory ML docs

  40. How do I do devops? Gareth Rushgrove Everyone ever ”

  41. Gareth Rushgrove explain how they work too

  42. Gareth Rushgrove

  43. SRE: Have software engineers do operations Gareth Rushgrove Dan Luu,

    ex Google ” “ http://danluu.com/google-sre-book/
  44. (without introducing more risk) Gareth Rushgrove Dev SRE Ops From

    http://web.devopstopologies.com/ by Matthew Skelton
  45. The familiar: - Capacity planning - Performance - Change management

    - Monitoring Gareth Rushgrove
  46. The unfamiliar: - Error budget - Strong software engineering skills

    - 50% operations work cap Gareth Rushgrove
  47. A growing ecosystem Gareth Rushgrove

  48. Gareth Rushgrove Friendly vendors

  49. Gareth Rushgrove More friendly vendors

  50. Gareth Rushgrove Even more nice vendors

  51. (without introducing more risk) Summing up For

  52. “infrastructure” is shifting to a higher level of abstraction Gareth

  53. It’s fine to just be a consumer Gareth Rushgrove

  54. You should be standing on the shoulders of giants Gareth

  55. You should be standing on the shoulders of Gareth Rushgrove

  56. (without introducing more risk) Against

  57. Your organisation doesn’t look like Google Gareth Rushgrove


  59. Could your organisation look like Google? Gareth Rushgrove

  60. How many employees do you have? Google have about 60,000

    Gareth Rushgrove
  61. What proportion of your organisation are software engineers or operations?

    Gareth Rushgrove
  62. 50 percent? Based on the Google annual report December 2014

    Gareth Rushgrove
  63. How much do you pay software engineers? Gareth Rushgrove

  64. Gareth Rushgrove Data from Glassdoor, June 2016, based on 14k

  65. Gareth Rushgrove The $3million engineer?

  66. Gareth Rushgrove

  67. Gareth Rushgrove Build your own chips?

  68. Could your organisation really look like Google? Gareth Rushgrove

  69. So much of the information in the SRE book makes

    PERFECT sense if you’re Google Gareth Rushgrove John Vincent, Ops Hero ” “
  70. The reality outside Google Gareth Rushgrove

  71. <1% of US workers are software engineers or programmers Gareth

    Rushgrove US Bureau of Labor Statistics 2002. 1,069,000 jobs in working age population of 185million
  72. Strategic vendor relationships Gareth Rushgrove

  73. Different application constrains as well as different organisational constrains Gareth

  74. Goal of SRE team isn’t zero outages – SRE and

    product devs are incentive aligned to spend the error budget to get maximum feature velocity Gareth Rushgrove Dan Luu, ex Google ” “ http://danluu.com/google-sre-book/
  75. What if you’re operating an air traffic control system or

    a nuclear power station? Your goal is probably closer to zero outages Gareth Rushgrove
  76. Gareth Rushgrove John Vincent SRE review

  77. bringing a software engineering perspective to a problem isn’t always

    the best or right solution Gareth Rushgrove ” “ John Vincent, Ops Hero
  78. Many of Google’s conclusions to operations problems are not unique

    Gareth Rushgrove
  79. Gareth Rushgrove

  80. Gareth Rushgrove

  81. Innovation happens elsewhere applies as much to Google as to

    other organisations Gareth Rushgrove
  82. (without introducing more risk) Summing up Against

  83. If a human operator needs to touch your system during

    normal operations, you have a bug. The definition of normal changes as your systems grow Gareth Rushgrove Carla Geisser, Google SRE ” “
  84. What is normal for Google may not be suitable for

    your organisation Gareth Rushgrove
  85. Your startup with a single-purpose application does not have the

    luxury of having your operations team say I’m sorry you’re over your error budget Gareth Rushgrove John Vincent, Ops Hero ” “
  86. Gareth Rushgrove

  87. (without introducing more risk) Conclusions If all you take away

  88. Who votes… Gareth Rushgrove For

  89. Who votes… Gareth Rushgrove Against

  90. Who thinks it’s the wrong question? Gareth Rushgrove

  91. Context is king Gareth Rushgrove

  92. Gareth Rushgrove

  93. The Overwhelming power of context Gareth Rushgrove Charity Majors, Ops

    Person Extraordinaire ” “
  94. The technology we run, and how we run it, are

    interlinked Gareth Rushgrove
  95. (without introducing more risk) The field of Sociotechnical Systems suggests

    that all human systems include both a technical system and a social system Gareth Rushgrove https://en.wikipedia.org/wiki/Coevolution#Technological_coevolution
  96. (without introducing more risk) Better outcomes are usually obtained by

    a reciprocal process of joint optimization, through which both the technical system and the social system change Gareth Rushgrove https://en.wikipedia.org/wiki/Coevolution#Technological_coevolution
  97. Containers will not fix your broken culture Gareth Rushgrove Bridget

    Kromhout, Worlds nicest Ops Person ” “
  98. Awesome culture will not fix your broken containers Gareth Rushgrove

    Me, paraphrasing Bridget ” “
  99. We are all collectively evolving the practice of operations Gareth

  100. Keep sharing, because it’s a pretty amazing ride Gareth Rushgrove

  101. (without introducing more risk) Questions And thanks for listening