Tracking and automating software infrastructure with GitHub

Tracking and automating software infrastructure with GitHub GitHub Universe San
Francisco October 12, 2017 John Arthorne @jarthorne github.com/jarthorn Shopify Production Engineering

• Background • The Problem • Implementation

500k $65M Shopify Merchants Daily merchant sales (GMV) 2k 80K
Employees HTTP RPS

Shopify Tech Rails Monolith

Shopify Tech Rails Monolith Other Rails Apps

Shopify Tech Rails Monolith Other Rails Apps Python Apps

Shopify Tech Rails Monolith Other Rails Apps Python Apps Golang

MySQL Redis Kafka Elastic Search

MySQL Redis Kafka Elastic Search Colocated Data Centers 3rd Party Clouds

The Problem

Service infrastructure

Deployment automation CI pipeline Dev time setup automation Uptime monitoring
Bug monitoring Log retention Data backups SSL certificates Domains Load testing Metrics instrumentation On call rotation Failover automation Service infrastructure examples

300 12 Services Infrastructure concerns x

Spreadsheet defined infrastructure

Three main goals • Ownership: Establish ownership for all running
services/apps at Shopify • Measurement: Be able to measure how well we are doing on operational infrastructure for a given service • Automation: Provide tools to make it easier to build out and maintain service infrastructure ➢ Create a tool to track everything in one place, get out of spreadsheet hell

Goal #1: Ownership

Why have owners?

What kind of ownership do we want? Collective Ownership in
common Ability to deliver with high speed Works well in small teams No specialized roles Authoritarian No change without permission Bureaucratic, slow, safe The norm in massive orgs Highly specialized roles Shopify 2015 Shopify 2017

Ownership as code • Owners tracked in Git for each
service • Pull request to change owner • Deliberate decision, with retained history

Goal #2: Measurement What do we have running today? Are
things getting better or worse? My team has a lot of applications, where should we focus efforts on improving infrastructure? Classifying services to make sure we put an appropriate level of work into surrounding infrastructure

All infrastructure information in one place

Figuring out what “good enough” looks like • All services
placed in tiers based on level of impact • Tier is set by the owner of the service • Higher infrastructure expectations as you go up in tiers

Service tiers Tier Impact Needs 1 Critical Playbooks, defined SLO,
resiliency patterns, DC failover, scheduled load tests, security reviews 2 Important On call, monitoring with alerts, metrics instrumentation, dedicated DB, load tested, rolling deploy (preboot) 3 Useful >1 owner, deploy automation, CI, standard dev setup, uptime monitor, bugsnag, log retention, backups, SSL 4 Experiments Owner, Security bugs

Service scorecard

Leaderboards!

Goal #3: Automation

Automatic issue reporting … and closing!

Fighting the email bots (with bots)

One click infrastructure automation

Automated code authoring Pull requests for routine software updates Pull
requests for infrastructure configuration changes

Implementation

Architecture Services DB Services GitHub API Web Hooks Repos Users
Teams Issues Runtimes Tools Web App Chat App

Checks and Events Checks Team Owner? Uptime Monitor? Load Tests?
SSL? Emails Slack Commands Quota Hit Downtime Events GitHub Issues Pull Requests Slack Announcements /dev/null cluster us-t2

Automating Library Upgrades Services DB Security Advisories Deprecations Important Libraries
Services Repos Pull Requests Bundler

Future directions Automate more infrastructure tasks Library upgrades for other
languages Defining and tracking Service Level Objectives (SLOs) Tracking incident post-mortems and action items

Takeaways Be deliberate about ownership. Know who is taking care
of each running service and what that implies. Think of infra investment in terms of trade-offs. More is not always better, and aim for just enough investment to get quality goals. Measure progress. Be aware of manual steps involved in creating and maintaining services. Automation is the only way to stay ahead of the growth curve.

Thanks! John Arthorne @jarthorne github.com/jarthorn Shopify Production Engineering GitHub Universe
San Francisco October 12, 2017

Colour Guide White Indigo Teal Salmon Yellow White Dark Indigo
White Dark Indigo Dark Indigo

Tracking and automating software infrastructure...

Tracking and automating software infrastructure with GitHub

More Decks by John Arthorne

Other Decks in Technology

Featured

Transcript