Slide 1

Slide 1 text

Lessons from scaling to millions of users Tammy Butow Empress of Chaos Principal SRE, Gremlin @tammybutow

Slide 2

Slide 2 text

Lessons from scaling to millions of users hundreds of ^ Tammy Butow Empress of Chaos
 Principal SRE, Gremlin
 @tammybutow

Slide 3

Slide 3 text

I was previously a SRE Manager @: Dropbox, leading Databases, Magic Pocket and Code Workflows (Dev Tools) Prior to that I worked @ DigitalOcean, National Australia Bank, Queensland University of Technology + more I’m now a Principal SRE @: Gremlin

Slide 4

Slide 4 text

0 - 500M

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

There is always a beginning: Engineering before launch (0 customers)

Slide 7

Slide 7 text

Make life easier for future you. “It just works” is the goal.

Slide 8

Slide 8 text

Always prioritise reliability, performance and durability. Achieved through automation, monitoring, tooling and engineers : ) Have a clear “Engineering Principles” paper for your company.

Slide 9

Slide 9 text

What will you: Build, borrow, buy and break?

Slide 10

Slide 10 text

If I was to begin today: Engineering Before Launch

Slide 11

Slide 11 text

Kubernetes 3 primaries & 3 nodes Sharded MySQL Percona Community with semi-sync replication Monitoring Security Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering Before Launch Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services

Slide 12

Slide 12 text

Infra Engineering Before Launch https://kubernetes.io/docs/admin/high-availability/building/#replicated-api-servers

Slide 13

Slide 13 text

Infra Engineering Before Launch https://kubernetes.io/docs/admin/high-availability/building/#replicated-api-servers

Slide 14

Slide 14 text

Web Mobile* Desktop * Native Development: iOS Android Native Development
 will give you better 
 performance Electron JS (built by GitHub) 
 Electron is used by: - GitHub - Slack Most big tech companies in the bay area are moving to Electron * Only if you are mobile first * Only if you are desktop first React (built by Facebook) 
 React is used by: - Everyone :P Most big tech companies in the bay area use React or are moving to React API Swagger (built by ) 
 Swagger is used by: - Gremlin Most big tech companies in the bay area have an API. Launching With an API makes sense! Product Engineering Before Launch

Slide 15

Slide 15 text

Product Engineering Before Launch

Slide 16

Slide 16 text

Monorepo

Slide 17

Slide 17 text

Code Search

Slide 18

Slide 18 text

Then the growth begins: Engineering with 5 Enterprise Customers

Slide 19

Slide 19 text

You want to retain customers and scale fast!

Slide 20

Slide 20 text

Continue to prioritise reliability, perf and durability. Use your monitoring, logging and observability tooling.

Slide 21

Slide 21 text

Build self-healing systems and automate your infrastructure so you don’t get paged.

Slide 22

Slide 22 text

Infra and product engineers should continue to meet and speak with customers to get feedback.

Slide 23

Slide 23 text

Kubernetes Sharded MySQL Percona Community with semi-sync replication Monitoring Security Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 5 Enterprise Customers Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Small Data Mix Panel Specific Infra Based on your product

Slide 24

Slide 24 text

NOW LET’S GET TO THE MILLIONS A DIFFERENT WORLD

Slide 25

Slide 25 text

Now you have millions of users: Engineering with 3 million users

Slide 26

Slide 26 text

You can start to think about big data. You can do experiments at scale.

Slide 27

Slide 27 text

You will have started to build out infra specific for your product features and optimised for your own workload.

Slide 28

Slide 28 text

Kubernetes Sharded MySQL Percona Community with semi-sync replication Monitoring Security Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 3 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Moar Specific Infra Based on your product Caching Memcache

Slide 29

Slide 29 text

Engineering with 50 million users

Slide 30

Slide 30 text

Kubernetes Sharded MySQL Percona Community with semi-sync replication Monitoring Security Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 50 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Moar Specific Infra Based on your product Caching Memcache

Slide 31

Slide 31 text

From 400 million to 500 million users in one very fast year AKA: Getting on the rocket ship

Slide 32

Slide 32 text

2016 2017 Linux Con AU 2016 GopherCon 2017

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Engineering with 400 million users

Slide 35

Slide 35 text

Kubernetes Sharded MySQL Percona Community with semi-sync replication Monitoring Security Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private Networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 400 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Caching Memcache Moar tools! (I can’t fit them) Distributed Datastore Built in-house Moar Specific Infra Based on your product

Slide 36

Slide 36 text

You need to be able to zoom out with tools to make quick and important decisions

Slide 37

Slide 37 text

You build simple and useful tools for all engineers and other departments (e.g. self-service analytics dashboards, cloud infra allocation CLI tools)

Slide 38

Slide 38 text

You do performance tuning for your cloud infra because you sweat the details. (e.g. linux performance governor and CPU hyperthreading settings)

Slide 39

Slide 39 text

• Prioritise capacity planning • Create org and team roadmaps, but stay flexible • IQRs are useful (infra quarterly reviews) • Give teams 20% time to work on KTLO

Slide 40

Slide 40 text

Engineering with 500 million users AKA: Getting on the rocket ship

Slide 41

Slide 41 text

Kubernetes Sharded MySQL Percona Community with semi-sync replication Monitoring Security Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private Networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 500 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Caching Memcache Moar tools! (I can’t fit them) Distributed Datastore Built in-house Moar Specific Infra Based on your product

Slide 42

Slide 42 text

Keep a close eye on metrics, like a hawk!

Slide 43

Slide 43 text

Always be migrating! Have 1+ migration in progress at all times. (e.g. data migrations and framework/tool migrations - from Ember to React)

Slide 44

Slide 44 text

Focus on improving engineering happiness and productivity too.

Slide 45

Slide 45 text

Code ownership becomes very important e.g. owners.yaml

Slide 46

Slide 46 text

Engineering Productivity with 500m users Aka “simplify and automate all the things”

Slide 47

Slide 47 text

How Do You Dramatically Speed Up Engineering Onboarding? You need @etelsverdlov at your Company Hack Week…! (She is the Director of Community at DigitalOcean)

Slide 48

Slide 48 text

Reduced Eng Onboarding from 4 weeks to 30min. No people required to support onboarding. ~ Automate all the things ~ saved 6500+ engineering hours a month

Slide 49

Slide 49 text

Engineering with any number of users What do you always need?

Slide 50

Slide 50 text

• Prioritise reliability, durability & performance • Focus on making sure “it just works” • Your core product is solid • Infra and Product Engineering work together • You sweat the details and aim higher each day!

Slide 51

Slide 51 text

Engineering in 2019 What does the future look like when scaling?

Slide 52

Slide 52 text

Good luck on your journey scaling to millions of customers hundreds of ^ It’s a wild ride

Slide 53

Slide 53 text

Learn more about scaling @ Chaos Conf One day single track conference in SF on September 28
 Topics include building internet-scale systems, container chaos and chaos engineering. chaosconf.io
 @chaosconf

Slide 54

Slide 54 text

Thank You Tammy Butow
 Principal SRE, Gremlin
 @tammybutow