Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tracking and automating software infrastructure...

Tracking and automating software infrastructure with GitHub

Talk at GitHub Universe 2017

John Arthorne

October 19, 2017
Tweet

More Decks by John Arthorne

Other Decks in Technology

Transcript

  1. Tracking and automating software infrastructure with GitHub GitHub Universe San

    Francisco October 12, 2017 John Arthorne @jarthorne github.com/jarthorn Shopify Production Engineering
  2. Shopify Tech Rails Monolith Other Rails Apps Python Apps Golang

    MySQL Redis Kafka Elastic Search Colocated Data Centers 3rd Party Clouds
  3. Deployment automation CI pipeline Dev time setup automation Uptime monitoring

    Bug monitoring Log retention Data backups SSL certificates Domains Load testing Metrics instrumentation On call rotation Failover automation Service infrastructure examples
  4. Three main goals • Ownership: Establish ownership for all running

    services/apps at Shopify • Measurement: Be able to measure how well we are doing on operational infrastructure for a given service • Automation: Provide tools to make it easier to build out and maintain service infrastructure ➢ Create a tool to track everything in one place, get out of spreadsheet hell
  5. What kind of ownership do we want? Collective Ownership in

    common Ability to deliver with high speed Works well in small teams No specialized roles Authoritarian No change without permission Bureaucratic, slow, safe The norm in massive orgs Highly specialized roles Shopify 2015 Shopify 2017
  6. Ownership as code • Owners tracked in Git for each

    service • Pull request to change owner • Deliberate decision, with retained history
  7. Goal #2: Measurement What do we have running today? Are

    things getting better or worse? My team has a lot of applications, where should we focus efforts on improving infrastructure? Classifying services to make sure we put an appropriate level of work into surrounding infrastructure
  8. Figuring out what “good enough” looks like • All services

    placed in tiers based on level of impact • Tier is set by the owner of the service • Higher infrastructure expectations as you go up in tiers
  9. Service tiers Tier Impact Needs 1 Critical Playbooks, defined SLO,

    resiliency patterns, DC failover, scheduled load tests, security reviews 2 Important On call, monitoring with alerts, metrics instrumentation, dedicated DB, load tested, rolling deploy (preboot) 3 Useful >1 owner, deploy automation, CI, standard dev setup, uptime monitor, bugsnag, log retention, backups, SSL 4 Experiments Owner, Security bugs
  10. Automated code authoring Pull requests for routine software updates Pull

    requests for infrastructure configuration changes
  11. Architecture Services DB Services GitHub API Web Hooks Repos Users

    Teams Issues Runtimes Tools Web App Chat App
  12. Checks and Events Checks Team Owner? Uptime Monitor? Load Tests?

    SSL? Emails Slack Commands Quota Hit Downtime Events GitHub Issues Pull Requests Slack Announcements /dev/null cluster us-t2
  13. Future directions Automate more infrastructure tasks Library upgrades for other

    languages Defining and tracking Service Level Objectives (SLOs) Tracking incident post-mortems and action items
  14. Takeaways Be deliberate about ownership. Know who is taking care

    of each running service and what that implies. Think of infra investment in terms of trade-offs. More is not always better, and aim for just enough investment to get quality goals. Measure progress. Be aware of manual steps involved in creating and maintaining services. Automation is the only way to stay ahead of the growth curve.