Tracking and automating software infrastructure with GitHub

Slide 1

Slide 1 text

Tracking and automating software infrastructure with GitHub GitHub Universe San Francisco October 12, 2017 John Arthorne @jarthorne github.com/jarthorn Shopify Production Engineering

Slide 2

Slide 2 text

● Background ● The Problem ● Implementation

Slide 3

Slide 3 text

500k $65M Shopify Merchants Daily merchant sales (GMV) 2k 80K Employees HTTP RPS

Slide 4

Slide 4 text

Shopify Tech Rails Monolith

Slide 5

Slide 5 text

Shopify Tech Rails Monolith Other Rails Apps

Slide 6

Slide 6 text

Shopify Tech Rails Monolith Other Rails Apps Python Apps

Slide 7

Slide 7 text

Shopify Tech Rails Monolith Other Rails Apps Python Apps Golang

Slide 8

Slide 8 text

Shopify Tech Rails Monolith Other Rails Apps Python Apps Golang MySQL Redis Kafka Elastic Search

Slide 9

Slide 9 text

Shopify Tech Rails Monolith Other Rails Apps Python Apps Golang MySQL Redis Kafka Elastic Search

Slide 10

Slide 10 text

Shopify Tech Rails Monolith Other Rails Apps Python Apps Golang MySQL Redis Kafka Elastic Search Colocated Data Centers 3rd Party Clouds

Slide 11

Slide 11 text

The Problem

Slide 12

Slide 12 text

Service infrastructure

Slide 13

Slide 13 text

Deployment automation CI pipeline Dev time setup automation Uptime monitoring Bug monitoring Log retention Data backups SSL certificates Domains Load testing Metrics instrumentation On call rotation Failover automation Service infrastructure examples

Slide 14

Slide 14 text

300 12 Services Infrastructure concerns x

Slide 15

Slide 15 text

Spreadsheet defined infrastructure

Slide 16

Slide 16 text

Three main goals ● Ownership: Establish ownership for all running services/apps at Shopify ● Measurement: Be able to measure how well we are doing on operational infrastructure for a given service ● Automation: Provide tools to make it easier to build out and maintain service infrastructure ➢ Create a tool to track everything in one place, get out of spreadsheet hell

Slide 17

Slide 17 text

Goal #1: Ownership

Slide 18

Slide 18 text

Why have owners?

Slide 19

Slide 19 text

What kind of ownership do we want? Collective Ownership in common Ability to deliver with high speed Works well in small teams No specialized roles Authoritarian No change without permission Bureaucratic, slow, safe The norm in massive orgs Highly specialized roles Shopify 2015 Shopify 2017

Slide 20

Slide 20 text

Ownership as code ● Owners tracked in Git for each service ● Pull request to change owner ● Deliberate decision, with retained history

Slide 21

Slide 21 text

Goal #2: Measurement What do we have running today? Are things getting better or worse? My team has a lot of applications, where should we focus efforts on improving infrastructure? Classifying services to make sure we put an appropriate level of work into surrounding infrastructure

Slide 22

Slide 22 text

All infrastructure information in one place

Slide 23

Slide 23 text

Figuring out what “good enough” looks like ● All services placed in tiers based on level of impact ● Tier is set by the owner of the service ● Higher infrastructure expectations as you go up in tiers

Slide 24

Slide 24 text

Service tiers Tier Impact Needs 1 Critical Playbooks, defined SLO, resiliency patterns, DC failover, scheduled load tests, security reviews 2 Important On call, monitoring with alerts, metrics instrumentation, dedicated DB, load tested, rolling deploy (preboot) 3 Useful >1 owner, deploy automation, CI, standard dev setup, uptime monitor, bugsnag, log retention, backups, SSL 4 Experiments Owner, Security bugs

Slide 25

Slide 25 text

Service scorecard

Slide 26

Slide 26 text

Leaderboards!

Slide 27

Slide 27 text

Goal #3: Automation

Slide 28

Slide 28 text

Automatic issue reporting … and closing!

Slide 29

Slide 29 text

Fighting the email bots (with bots)

Slide 30

Slide 30 text

One click infrastructure automation

Slide 31

Slide 31 text

Automated code authoring Pull requests for routine software updates Pull requests for infrastructure configuration changes

Slide 32

Slide 32 text

Implementation

Slide 33

Slide 33 text

Architecture Services DB Services GitHub API Web Hooks Repos Users Teams Issues Runtimes Tools Web App Chat App

Slide 34

Slide 34 text

Checks and Events Checks Team Owner? Uptime Monitor? Load Tests? SSL? Emails Slack Commands Quota Hit Downtime Events GitHub Issues Pull Requests Slack Announcements /dev/null cluster us-t2

Slide 35

Slide 35 text

Automating Library Upgrades Services DB Security Advisories Deprecations Important Libraries Services Repos Pull Requests Bundler

Slide 36

Slide 36 text

Future directions Automate more infrastructure tasks Library upgrades for other languages Defining and tracking Service Level Objectives (SLOs) Tracking incident post-mortems and action items

Slide 37

Slide 37 text

Takeaways Be deliberate about ownership. Know who is taking care of each running service and what that implies. Think of infra investment in terms of trade-offs. More is not always better, and aim for just enough investment to get quality goals. Measure progress. Be aware of manual steps involved in creating and maintaining services. Automation is the only way to stay ahead of the growth curve.

Slide 38

Slide 38 text

Thanks! John Arthorne @jarthorne github.com/jarthorn Shopify Production Engineering GitHub Universe San Francisco October 12, 2017

Slide 39

Slide 39 text

Colour Guide White Indigo Teal Salmon Yellow White Dark Indigo White Dark Indigo Dark Indigo