$30 off During Our Annual Pro Sale. View Details »

The Two Sides to Google Infrastructure for Everyone Else

The Two Sides to Google Infrastructure for Everyone Else

My talk from Velocity Santa Clara. The format was a debate. Between myself. Looking at the pros and cons around adopting software and practices from other organisations wholesale, using the GIFEE meme as an example.

Gareth Rushgrove

June 22, 2016
Tweet

More Decks by Gareth Rushgrove

Other Decks in Technology

Transcript

  1. (without introducing more risk)
    The Two Sides
    Puppet
    Gareth Rushgrove
    Of Google Infrastructure for Everyone Else

    View Slide

  2. (without introducing more risk)
    @garethr

    View Slide

  3. (without introducing more risk)
    Gareth Rushgrove

    View Slide

  4. (without introducing more risk)
    Introduction
    A strange format for a talk

    View Slide

  5. This is a debate
    Gareth Rushgrove

    View Slide

  6. I’ll be debating both sides
    Gareth Rushgrove

    View Slide

  7. Taking opposing viewpoints on
    the same issue, as a way of
    exploring it in-depth
    Gareth Rushgrove

    View Slide

  8. The talk is split into two parts;
    a For part and an Against part
    Gareth Rushgrove

    View Slide

  9. I’d like to explore:
    - Technical practice evolution
    - How we adopt software
    - The organisational context
    Gareth Rushgrove

    View Slide

  10. This house believes…
    Gareth Rushgrove

    View Slide

  11. Successful companies will look
    like Google in the future, so we
    should adopt Google-like
    software and practices today
    Gareth Rushgrove

    View Slide

  12. Important disclaimer
    I’ve never worked for Google
    Gareth Rushgrove

    View Slide

  13. (without introducing more risk)
    For

    View Slide

  14. You’re probably:
    1 Struggling with distributed systems
    2 Missing out on machine learning
    3 Wondering how to scale operations
    Gareth Rushgrove

    View Slide

  15. Gareth Rushgrove
    have a 10+ year head start

    View Slide

  16. publish research that
    influences out industry
    Gareth Rushgrove

    View Slide

  17. Gareth Rushgrove
    MapReduce

    View Slide

  18. Gareth Rushgrove
    Chubby

    View Slide

  19. Gareth Rushgrove
    Borg

    View Slide

  20. releases (and inspires)
    software we use
    Gareth Rushgrove

    View Slide

  21. Gareth Rushgrove

    View Slide

  22. Gareth Rushgrove
    Go

    View Slide

  23. Gareth Rushgrove
    from

    View Slide

  24. (without introducing more risk)
    GFS = HDFS
    BigTable = HBase
    Protocol Buffers = Thrift or Avro (serialization)
    Stubby = Thrift or Avro (RPC)
    ColumnIO = Parquet
    Dremel = Impala
    Omega = Mesos
    Blaze = Pants or Buck
    FlumeJava = Crunch
    Logsaver = Scribe or Flume
    Millwheel = Storm or Samza?
    Borgmon/Monarch = Graphite
    Dapper = Zipkin
    2014 from @avibryant, @joshwills, @skamille, @marius, @wickman
    Gareth Rushgrove

    View Slide

  25. We have a term for this; #GIFEE
    Gareth Rushgrove

    View Slide

  26. Google Infrastructure for
    Everyone Else
    Gareth Rushgrove

    View Slide

  27. Distributed systems are hard
    Gareth Rushgrove

    View Slide

  28. Building your own in-house
    framework is likely a waste of time
    Gareth Rushgrove

    View Slide

  29. Gareth Rushgrove From Adrian Colyer, Accel, https://speakerdeck.com/acolyer/making-sense-of-it-all

    View Slide

  30. Kubernetes is the 3rd generation
    of Googles cluster management
    software
    Gareth Rushgrove

    View Slide

  31. Gareth Rushgrove
    The Kubernetes API provides
    primitives that make doing the
    right thing easier

    View Slide

  32. - Orchestration
    - Logging
    - Configuration
    - Self-healing
    - Storage
    Gareth Rushgrove
    - Load balancing
    - Service discovery
    - Scaling
    - Batch workloads
    - Lots more

    View Slide

  33. Gareth Rushgrove
    Exposed via a modern API

    View Slide

  34. Machine learning is going
    to be massive
    Gareth Rushgrove

    View Slide

  35. Soon We Won’t Program
    Computers. We’ll Train
    Them Like Dogs
    Gareth Rushgrove


    View Slide

  36. TensorFlow is an open source
    software library for numerical
    computation
    Gareth Rushgrove

    View Slide

  37. (without introducing more risk)
    Gareth Rushgrove

    View Slide

  38. - Nearest neighbour
    - Linear regression
    - Recurrent neural networks
    - Multilayer perceptron
    - Lots more
    Gareth Rushgrove

    View Slide

  39. Gareth Rushgrove
    Introductory ML docs

    View Slide

  40. How do I do devops?
    Gareth Rushgrove
    Everyone ever


    View Slide

  41. Gareth Rushgrove
    explain how they work too

    View Slide

  42. Gareth Rushgrove

    View Slide

  43. SRE: Have software engineers
    do operations
    Gareth Rushgrove
    Dan Luu, ex Google


    http://danluu.com/google-sre-book/

    View Slide

  44. (without introducing more risk)
    Gareth Rushgrove
    Dev SRE Ops
    From http://web.devopstopologies.com/ by Matthew Skelton

    View Slide

  45. The familiar:
    - Capacity planning
    - Performance
    - Change management
    - Monitoring
    Gareth Rushgrove

    View Slide

  46. The unfamiliar:
    - Error budget
    - Strong software engineering skills
    - 50% operations work cap
    Gareth Rushgrove

    View Slide

  47. A growing ecosystem
    Gareth Rushgrove

    View Slide

  48. Gareth Rushgrove
    Friendly vendors

    View Slide

  49. Gareth Rushgrove
    More friendly vendors

    View Slide

  50. Gareth Rushgrove
    Even more nice vendors

    View Slide

  51. (without introducing more risk)
    Summing up
    For

    View Slide

  52. “infrastructure” is shifting to a
    higher level of abstraction
    Gareth Rushgrove

    View Slide

  53. It’s fine to just be a consumer
    Gareth Rushgrove

    View Slide

  54. You should be standing on the
    shoulders of giants
    Gareth Rushgrove

    View Slide

  55. You should be standing on the
    shoulders of
    Gareth Rushgrove

    View Slide

  56. (without introducing more risk)
    Against

    View Slide

  57. Your organisation doesn’t
    look like Google
    Gareth Rushgrove

    View Slide

  58. YOUR
    ORGANISATION
    DOESN’T LOOK
    LIKE GOOGLE
    Gareth Rushgrove

    View Slide

  59. Could your organisation
    look like Google?
    Gareth Rushgrove

    View Slide

  60. How many employees do you
    have? Google have about 60,000
    Gareth Rushgrove

    View Slide

  61. What proportion of your
    organisation are software
    engineers or operations?
    Gareth Rushgrove

    View Slide

  62. 50 percent?
    Based on the Google annual report December 2014
    Gareth Rushgrove

    View Slide

  63. How much do you pay
    software engineers?
    Gareth Rushgrove

    View Slide

  64. Gareth Rushgrove Data from Glassdoor, June 2016, based on 14k salaries

    View Slide

  65. Gareth Rushgrove
    The $3million engineer?

    View Slide

  66. Gareth Rushgrove

    View Slide

  67. Gareth Rushgrove
    Build your own chips?

    View Slide

  68. Could your organisation
    really look like Google?
    Gareth Rushgrove

    View Slide

  69. So much of the information in
    the SRE book makes PERFECT
    sense if you’re Google
    Gareth Rushgrove
    John Vincent, Ops Hero


    View Slide

  70. The reality outside Google
    Gareth Rushgrove

    View Slide

  71. <1% of US workers are software
    engineers or programmers
    Gareth Rushgrove US Bureau of Labor Statistics 2002. 1,069,000 jobs in working age population of 185million

    View Slide

  72. Strategic vendor relationships
    Gareth Rushgrove

    View Slide

  73. Different application
    constrains as well as different
    organisational constrains
    Gareth Rushgrove

    View Slide

  74. Goal of SRE team isn’t zero
    outages – SRE and product devs
    are incentive aligned to spend the
    error budget to get maximum
    feature velocity
    Gareth Rushgrove
    Dan Luu, ex Google


    http://danluu.com/google-sre-book/

    View Slide

  75. What if you’re operating an air
    traffic control system or a nuclear
    power station? Your goal is
    probably closer to zero outages
    Gareth Rushgrove

    View Slide

  76. Gareth Rushgrove
    John Vincent SRE review

    View Slide

  77. bringing a software engineering
    perspective to a problem isn’t
    always the best or right solution
    Gareth Rushgrove


    John Vincent, Ops Hero

    View Slide

  78. Many of Google’s conclusions to
    operations problems are not unique
    Gareth Rushgrove

    View Slide

  79. Gareth Rushgrove

    View Slide

  80. Gareth Rushgrove

    View Slide

  81. Innovation happens elsewhere
    applies as much to Google as to
    other organisations
    Gareth Rushgrove

    View Slide

  82. (without introducing more risk)
    Summing up
    Against

    View Slide

  83. If a human operator needs to touch
    your system during normal
    operations, you have a bug. The
    definition of normal changes as
    your systems grow
    Gareth Rushgrove
    Carla Geisser, Google SRE


    View Slide

  84. What is normal for Google
    may not be suitable for
    your organisation
    Gareth Rushgrove

    View Slide

  85. Your startup with a single-purpose
    application does not have the
    luxury of having your operations
    team say I’m sorry you’re over
    your error budget
    Gareth Rushgrove
    John Vincent, Ops Hero


    View Slide

  86. Gareth Rushgrove

    View Slide

  87. (without introducing more risk)
    Conclusions
    If all you take away is…

    View Slide

  88. Who votes…
    Gareth Rushgrove
    For

    View Slide

  89. Who votes…
    Gareth Rushgrove
    Against

    View Slide

  90. Who thinks it’s the wrong question?
    Gareth Rushgrove

    View Slide

  91. Context is king
    Gareth Rushgrove

    View Slide

  92. Gareth Rushgrove

    View Slide

  93. The Overwhelming power
    of context
    Gareth Rushgrove
    Charity Majors, Ops Person Extraordinaire


    View Slide

  94. The technology we run, and how
    we run it, are interlinked
    Gareth Rushgrove

    View Slide

  95. (without introducing more risk)
    The field of Sociotechnical
    Systems suggests that all human
    systems include both a technical
    system and a social system
    Gareth Rushgrove
    https://en.wikipedia.org/wiki/Coevolution#Technological_coevolution

    View Slide

  96. (without introducing more risk)
    Better outcomes are usually
    obtained by a reciprocal process
    of joint optimization, through
    which both the technical system
    and the social system change
    Gareth Rushgrove
    https://en.wikipedia.org/wiki/Coevolution#Technological_coevolution

    View Slide

  97. Containers will not fix your
    broken culture
    Gareth Rushgrove
    Bridget Kromhout, Worlds nicest Ops Person


    View Slide

  98. Awesome culture will not fix your
    broken containers
    Gareth Rushgrove
    Me, paraphrasing Bridget


    View Slide

  99. We are all collectively evolving the
    practice of operations
    Gareth Rushgrove

    View Slide

  100. Keep sharing, because it’s a
    pretty amazing ride
    Gareth Rushgrove

    View Slide

  101. (without introducing more risk)
    Questions
    And thanks for listening

    View Slide