Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tracking and automating software infrastructure with GitHub

Tracking and automating software infrastructure with GitHub

Talk at GitHub Universe 2017

A5f3383a1a0c7e6d3df7f06361e39a5c?s=128

John Arthorne

October 19, 2017
Tweet

Transcript

  1. Tracking and automating software infrastructure with GitHub GitHub Universe San

    Francisco October 12, 2017 John Arthorne @jarthorne github.com/jarthorn Shopify Production Engineering
  2. • Background • The Problem • Implementation

  3. 500k $65M Shopify Merchants Daily merchant sales (GMV) 2k 80K

    Employees HTTP RPS
  4. Shopify Tech Rails Monolith

  5. Shopify Tech Rails Monolith Other Rails Apps

  6. Shopify Tech Rails Monolith Other Rails Apps Python Apps

  7. Shopify Tech Rails Monolith Other Rails Apps Python Apps Golang

  8. Shopify Tech Rails Monolith Other Rails Apps Python Apps Golang

    MySQL Redis Kafka Elastic Search
  9. Shopify Tech Rails Monolith Other Rails Apps Python Apps Golang

    MySQL Redis Kafka Elastic Search
  10. Shopify Tech Rails Monolith Other Rails Apps Python Apps Golang

    MySQL Redis Kafka Elastic Search Colocated Data Centers 3rd Party Clouds
  11. The Problem

  12. Service infrastructure

  13. Deployment automation CI pipeline Dev time setup automation Uptime monitoring

    Bug monitoring Log retention Data backups SSL certificates Domains Load testing Metrics instrumentation On call rotation Failover automation Service infrastructure examples
  14. 300 12 Services Infrastructure concerns x

  15. Spreadsheet defined infrastructure

  16. Three main goals • Ownership: Establish ownership for all running

    services/apps at Shopify • Measurement: Be able to measure how well we are doing on operational infrastructure for a given service • Automation: Provide tools to make it easier to build out and maintain service infrastructure ➢ Create a tool to track everything in one place, get out of spreadsheet hell
  17. Goal #1: Ownership

  18. Why have owners?

  19. What kind of ownership do we want? Collective Ownership in

    common Ability to deliver with high speed Works well in small teams No specialized roles Authoritarian No change without permission Bureaucratic, slow, safe The norm in massive orgs Highly specialized roles Shopify 2015 Shopify 2017
  20. Ownership as code • Owners tracked in Git for each

    service • Pull request to change owner • Deliberate decision, with retained history
  21. Goal #2: Measurement What do we have running today? Are

    things getting better or worse? My team has a lot of applications, where should we focus efforts on improving infrastructure? Classifying services to make sure we put an appropriate level of work into surrounding infrastructure
  22. All infrastructure information in one place

  23. Figuring out what “good enough” looks like • All services

    placed in tiers based on level of impact • Tier is set by the owner of the service • Higher infrastructure expectations as you go up in tiers
  24. Service tiers Tier Impact Needs 1 Critical Playbooks, defined SLO,

    resiliency patterns, DC failover, scheduled load tests, security reviews 2 Important On call, monitoring with alerts, metrics instrumentation, dedicated DB, load tested, rolling deploy (preboot) 3 Useful >1 owner, deploy automation, CI, standard dev setup, uptime monitor, bugsnag, log retention, backups, SSL 4 Experiments Owner, Security bugs
  25. Service scorecard

  26. Leaderboards!

  27. Goal #3: Automation

  28. Automatic issue reporting … and closing!

  29. Fighting the email bots (with bots)

  30. One click infrastructure automation

  31. Automated code authoring Pull requests for routine software updates Pull

    requests for infrastructure configuration changes
  32. Implementation

  33. Architecture Services DB Services GitHub API Web Hooks Repos Users

    Teams Issues Runtimes Tools Web App Chat App
  34. Checks and Events Checks Team Owner? Uptime Monitor? Load Tests?

    SSL? Emails Slack Commands Quota Hit Downtime Events GitHub Issues Pull Requests Slack Announcements /dev/null cluster us-t2
  35. Automating Library Upgrades Services DB Security Advisories Deprecations Important Libraries

    Services Repos Pull Requests Bundler
  36. Future directions Automate more infrastructure tasks Library upgrades for other

    languages Defining and tracking Service Level Objectives (SLOs) Tracking incident post-mortems and action items
  37. Takeaways Be deliberate about ownership. Know who is taking care

    of each running service and what that implies. Think of infra investment in terms of trade-offs. More is not always better, and aim for just enough investment to get quality goals. Measure progress. Be aware of manual steps involved in creating and maintaining services. Automation is the only way to stay ahead of the growth curve.
  38. Thanks! John Arthorne @jarthorne github.com/jarthorn Shopify Production Engineering GitHub Universe

    San Francisco October 12, 2017
  39. Colour Guide White Indigo Teal Salmon Yellow White Dark Indigo

    White Dark Indigo Dark Indigo