Lessons from scaling to hundreds of millions of users

Lessons from scaling to millions of users Tammy Butow Empress
of Chaos Principal SRE, Gremlin @tammybutow

Lessons from scaling to millions of users hundreds of ^
Tammy Butow Empress of Chaos  Principal SRE, Gremlin  @tammybutow

I was previously a SRE Manager @: Dropbox, leading Databases,
Magic Pocket and Code Workflows (Dev Tools) Prior to that I worked @ DigitalOcean, National Australia Bank, Queensland University of Technology + more I’m now a Principal SRE @: Gremlin

0 - 500M

There is always a beginning: Engineering before launch (0 customers)

Make life easier for future you. “It just works” is
the goal.

Always prioritise reliability, performance and durability. Achieved through automation, monitoring,
tooling and engineers : ) Have a clear “Engineering Principles” paper for your company.

What will you: Build, borrow, buy and break?

If I was to begin today: Engineering Before Launch

Kubernetes 3 primaries & 3 nodes Sharded MySQL Percona Community
with semi-sync replication Monitoring Security Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering Before Launch Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services

Infra Engineering Before Launch https://kubernetes.io/docs/admin/high-availability/building/#replicated-api-servers

Web Mobile* Desktop * Native Development: iOS Android Native Development 
will give you better   performance Electron JS (built by GitHub)   Electron is used by: - GitHub - Slack Most big tech companies in the bay area are moving to Electron * Only if you are mobile first * Only if you are desktop first React (built by Facebook)   React is used by: - Everyone :P Most big tech companies in the bay area use React or are moving to React API Swagger (built by )   Swagger is used by: - Gremlin Most big tech companies in the bay area have an API. Launching With an API makes sense! Product Engineering Before Launch

Product Engineering Before Launch

Monorepo

Code Search

Then the growth begins: Engineering with 5 Enterprise Customers

You want to retain customers and scale fast!

Continue to prioritise reliability, perf and durability. Use your monitoring,
logging and observability tooling.

Build self-healing systems and automate your infrastructure so you don’t
get paged.

Infra and product engineers should continue to meet and speak
with customers to get feedback.

Kubernetes Sharded MySQL Percona Community with semi-sync replication Monitoring Security
Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 5 Enterprise Customers Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Small Data Mix Panel Specific Infra Based on your product

NOW LET’S GET TO THE MILLIONS A DIFFERENT WORLD

Now you have millions of users: Engineering with 3 million
users

You can start to think about big data. You can
do experiments at scale.

You will have started to build out infra specific for
your product features and optimised for your own workload.

Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 3 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Moar Specific Infra Based on your product Caching Memcache

Engineering with 50 million users

Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 50 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Moar Specific Infra Based on your product Caching Memcache

From 400 million to 500 million users in one very
fast year AKA: Getting on the rocket ship

2016 2017 Linux Con AU 2016 GopherCon 2017

Engineering with 400 million users

Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private Networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 400 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Caching Memcache Moar tools! (I can’t fit them) Distributed Datastore Built in-house Moar Specific Infra Based on your product

You need to be able to zoom out with tools
to make quick and important decisions

You build simple and useful tools for all engineers and
other departments (e.g. self-service analytics dashboards, cloud infra allocation CLI tools)

You do performance tuning for your cloud infra because you
sweat the details. (e.g. linux performance governor and CPU hyperthreading settings)

• Prioritise capacity planning • Create org and team roadmaps,
but stay flexible • IQRs are useful (infra quarterly reviews) • Give teams 20% time to work on KTLO

Engineering with 500 million users AKA: Getting on the rocket
ship

Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private Networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 500 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Caching Memcache Moar tools! (I can’t fit them) Distributed Datastore Built in-house Moar Specific Infra Based on your product

Keep a close eye on metrics, like a hawk!

Always be migrating! Have 1+ migration in progress at all
times. (e.g. data migrations and framework/tool migrations - from Ember to React)

Focus on improving engineering happiness and productivity too.

Code ownership becomes very important e.g. owners.yaml

Engineering Productivity with 500m users Aka “simplify and automate all
the things”

How Do You Dramatically Speed Up Engineering Onboarding? You need
@etelsverdlov at your Company Hack Week…! (She is the Director of Community at DigitalOcean)

Reduced Eng Onboarding from 4 weeks to 30min. No people
required to support onboarding. ~ Automate all the things ~ saved 6500+ engineering hours a month

Engineering with any number of users What do you always
need?

• Prioritise reliability, durability & performance • Focus on making
sure “it just works” • Your core product is solid • Infra and Product Engineering work together • You sweat the details and aim higher each day!

Engineering in 2019 What does the future look like when
scaling?

Good luck on your journey scaling to millions of customers
hundreds of ^ It’s a wild ride

Learn more about scaling @ Chaos Conf One day single
track conference in SF on September 28  Topics include building internet-scale systems, container chaos and chaos engineering. chaosconf.io  @chaosconf

Thank You Tammy Butow  Principal SRE, Gremlin  @tammybutow

Lessons from scaling to hundreds of millions of...

Lessons from scaling to hundreds of millions of users

More Decks by Tammy Bryant Butow

Other Decks in Technology

Featured

Transcript