Lessons from scaling to hundreds of millions of users

Lessons from scaling to hundreds of millions of users

Lessons from scaling to hundreds of millions of users is a talk I presented at DigitalOcean TIDE 2018 in NYC.

203e64aeb53ae59b2b4dcf923c163c23?s=128

Tammy Bütow

April 25, 2018
Tweet

Transcript

  1. Lessons from scaling to millions of users Tammy Butow Empress

    of Chaos Principal SRE, Gremlin @tammybutow
  2. Lessons from scaling to millions of users hundreds of ^

    Tammy Butow Empress of Chaos
 Principal SRE, Gremlin
 @tammybutow
  3. I was previously a SRE Manager @: Dropbox, leading Databases,

    Magic Pocket and Code Workflows (Dev Tools) Prior to that I worked @ DigitalOcean, National Australia Bank, Queensland University of Technology + more I’m now a Principal SRE @: Gremlin
  4. 0 - 500M

  5. None
  6. There is always a beginning: Engineering before launch (0 customers)

  7. Make life easier for future you. “It just works” is

    the goal.
  8. Always prioritise reliability, performance and durability. Achieved through automation, monitoring,

    tooling and engineers : ) Have a clear “Engineering Principles” paper for your company.
  9. What will you: Build, borrow, buy and break?

  10. If I was to begin today: Engineering Before Launch

  11. Kubernetes 3 primaries & 3 nodes Sharded MySQL Percona Community

    with semi-sync replication Monitoring Security Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering Before Launch Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services
  12. Infra Engineering Before Launch https://kubernetes.io/docs/admin/high-availability/building/#replicated-api-servers

  13. Infra Engineering Before Launch https://kubernetes.io/docs/admin/high-availability/building/#replicated-api-servers

  14. Web Mobile* Desktop * Native Development: iOS Android Native Development


    will give you better 
 performance Electron JS (built by GitHub) 
 Electron is used by: - GitHub - Slack Most big tech companies in the bay area are moving to Electron * Only if you are mobile first * Only if you are desktop first React (built by Facebook) 
 React is used by: - Everyone :P Most big tech companies in the bay area use React or are moving to React API Swagger (built by ) 
 Swagger is used by: - Gremlin Most big tech companies in the bay area have an API. Launching With an API makes sense! Product Engineering Before Launch
  15. Product Engineering Before Launch

  16. Monorepo

  17. Code Search

  18. Then the growth begins: Engineering with 5 Enterprise Customers

  19. You want to retain customers and scale fast!

  20. Continue to prioritise reliability, perf and durability. Use your monitoring,

    logging and observability tooling.
  21. Build self-healing systems and automate your infrastructure so you don’t

    get paged.
  22. Infra and product engineers should continue to meet and speak

    with customers to get feedback.
  23. Kubernetes Sharded MySQL Percona Community with semi-sync replication Monitoring Security

    Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 5 Enterprise Customers Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Small Data Mix Panel Specific Infra Based on your product
  24. NOW LET’S GET TO THE MILLIONS A DIFFERENT WORLD

  25. Now you have millions of users: Engineering with 3 million

    users
  26. You can start to think about big data. You can

    do experiments at scale.
  27. You will have started to build out infra specific for

    your product features and optimised for your own workload.
  28. Kubernetes Sharded MySQL Percona Community with semi-sync replication Monitoring Security

    Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 3 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Moar Specific Infra Based on your product Caching Memcache
  29. Engineering with 50 million users

  30. Kubernetes Sharded MySQL Percona Community with semi-sync replication Monitoring Security

    Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 50 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Moar Specific Infra Based on your product Caching Memcache
  31. From 400 million to 500 million users in one very

    fast year AKA: Getting on the rocket ship
  32. 2016 2017 Linux Con AU 2016 GopherCon 2017

  33. None
  34. Engineering with 400 million users

  35. Kubernetes Sharded MySQL Percona Community with semi-sync replication Monitoring Security

    Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private Networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 400 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Caching Memcache Moar tools! (I can’t fit them) Distributed Datastore Built in-house Moar Specific Infra Based on your product
  36. You need to be able to zoom out with tools

    to make quick and important decisions
  37. You build simple and useful tools for all engineers and

    other departments (e.g. self-service analytics dashboards, cloud infra allocation CLI tools)
  38. You do performance tuning for your cloud infra because you

    sweat the details. (e.g. linux performance governor and CPU hyperthreading settings)
  39. • Prioritise capacity planning • Create org and team roadmaps,

    but stay flexible • IQRs are useful (infra quarterly reviews) • Give teams 20% time to work on KTLO
  40. Engineering with 500 million users AKA: Getting on the rocket

    ship
  41. Kubernetes Sharded MySQL Percona Community with semi-sync replication Monitoring Security

    Alerts Capacity Planning Support SLAs Backups (Percona xtra backup, short and long) Docker Containers Load balancing Private Networking Engineering Tools Chaos Engineering (Gremlin) GIT GitHub / Phabricator Circle CI Code Search (LiveGrep) Infra Automation (Terraform) Infra Engineering With 500 Million Users Coding Choose two - three approved languages for use: 1. Rust - Systems 2. Python - Scripting/Tools 3. Go - Services Hadoop Spark, Pig Big Data / Analytics Caching Memcache Moar tools! (I can’t fit them) Distributed Datastore Built in-house Moar Specific Infra Based on your product
  42. Keep a close eye on metrics, like a hawk!

  43. Always be migrating! Have 1+ migration in progress at all

    times. (e.g. data migrations and framework/tool migrations - from Ember to React)
  44. Focus on improving engineering happiness and productivity too.

  45. Code ownership becomes very important e.g. owners.yaml

  46. Engineering Productivity with 500m users Aka “simplify and automate all

    the things”
  47. How Do You Dramatically Speed Up Engineering Onboarding? You need

    @etelsverdlov at your Company Hack Week…! (She is the Director of Community at DigitalOcean)
  48. Reduced Eng Onboarding from 4 weeks to 30min. No people

    required to support onboarding. ~ Automate all the things ~ saved 6500+ engineering hours a month
  49. Engineering with any number of users What do you always

    need?
  50. • Prioritise reliability, durability & performance • Focus on making

    sure “it just works” • Your core product is solid • Infra and Product Engineering work together • You sweat the details and aim higher each day!
  51. Engineering in 2019 What does the future look like when

    scaling?
  52. Good luck on your journey scaling to millions of customers

    hundreds of ^ It’s a wild ride
  53. Learn more about scaling @ Chaos Conf One day single

    track conference in SF on September 28
 Topics include building internet-scale systems, container chaos and chaos engineering. chaosconf.io
 @chaosconf
  54. Thank You Tammy Butow
 Principal SRE, Gremlin
 @tammybutow