Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Move to the Cloud, Double in Size or Automated MySQL Scaling: Pick Three

Move to the Cloud, Double in Size or Automated MySQL Scaling: Pick Three

The slides from [my talk at 2018's Datadog Dash conference](https://www.youtube.com/watch?v=6t8ae1Jgf_s) -- about moving Shopify's data to the cloud _while_ growing and not taking down time.

4a5473ec57fcad4780c17a3757421a4c?s=128

Aaron Brady

July 31, 2018
Tweet

More Decks by Aaron Brady

Other Decks in Technology

Transcript

  1. Move to the Cloud, Double in Size, or Automate MySQL

    Scaling: Pick Three
  2. What even is Shopify? What is Shopify?

  3. What is Shopify?

  4. What is Shopify?

  5. What is Shopify? Hi

  6. 600K Active Merchants 80K RPS Peak $26B GMV 2017 800

    Developers 40 Deploys/Day What is Shopify?
  7. What is Shopify?

  8. What is Shopify?

  9. What is Shopify? RW RO RO RO RO RO Active

    Location Passive Location
  10. What is Shopify? RW RO RO RO RO RO Active

    Location Passive Location
  11. What is Shopify? Shard 1 Shard 3 Shard 2 Shop

    Shop Shop Shop Shop Master Catalog
  12. What is Shopify? Pod 1 Pod 3 Pod 2 MySQL

    Shared Redis Cache Cron MySQL Redis Cache Cron MySQL Redis Cache Cron Web Search
  13. What is Shopify? ≠

  14. What is Shopify? Pod 1 Pod 3 Pod 2 MySQL

    Shared Redis Cache Cron MySQL Redis Cache Cron MySQL Redis Cache Cron Web Search
  15. 2016

  16. CC BY SA Victor Grigas

  17. Virtual IPs (Layer 3) 2016

  18. (Thousands of) Direct Connections 2016

  19. None
  20. None
  21. None
  22. Move to the Cloud

  23. None
  24. Move RW RO RO RO RO RO Shops 1, 2,

    3 and 4
  25. Move RW RO RO RO RO RO RO RO RO

    RO RO RO Shops 1, 2, 3 and 4 Shops 1, 2, 3 and 4
  26. Move RW RO RO RO RO RO RW RO RO

    RO RO RO Shops 1 and 3 (and old 2 and 4) Shops 2 and 4 (and old 1 and 3)
  27. 103 Items

  28. Affected Merchants Length of Outage Shard Split Many Short Shop

    Mover Move
  29. OLD SHARD NEW SHARD COPY CUTOVER CUSTOMERS LOCK Move

  30. OLD SHARD NEW SHARD COPY CUTOVER CUSTOMERS LOCK Move SELECT

    * FROM orders INSERT INTO orders ...
  31. OLD SHARD NEW SHARD COPY CUTOVER CUSTOMERS LOCK UNLOCK Shop

    is read only for hours or days.
  32. Real Data Doesn't Stay Still Move

  33. Affected Merchants Length of Outage Shard Split Many Short Shop

    Mover Few Long ??? Few Short Move
  34. Ghostferry Formally Verified Open Source (MIT) Move

  35. OLD SHARD NEW SHARD COPY STREAM CUTOVER CUSTOMERS Move

  36. OLD SHARD NEW SHARD COPY STREAM CUTOVER CUSTOMERS Move

  37. OLD SHARD NEW SHARD COPY STREAM CUTOVER CUSTOMERS LOCK UNLOCK

    Shop is read only for seconds.
  38. None
  39. Affected Merchants Length of Outage Shard Split Many Short Shop

    Mover Few Long Pod Balancer Few Short Move
  40. Background Jobs Single or Multiple Shops 24x7, ~0 down time

    Move
  41. None
  42. Empty Shards Broken Tooling 10x Scaling Move

  43. (another redacted checklist) 23 Items

  44. Moving Targets Every other team was moving Working around maintenance

    Blocking developers and features Move
  45. Maintenance Replacement Automation Move

  46. Automate Anything Move

  47. Move

  48. 10 Items

  49. Resiliency Experiments Move

  50. Double in Size

  51. ~1 Year to Move 50% BFCM happens every year! Both

    locations at peak performance Double
  52. Nothing is deprecated when it's in production Double

  53. Best Before Removal Tried to start over Refactored even EOL

    code Double
  54. Double ~100,000 ~1,000 1

  55. Double

  56. ProxySQL ProxySQL VIP ProxySQL Kubernetes Service ProxySQL ProxySQL Chef managed

    hardware Containers on VMs Double
  57. This is also a people problem Double

  58. Automate

  59. None
  60. Trust, But Verify Automate

  61. Formal verification Distributed systems (Which break) Automate

  62. Include a Human (Until it's boring) Automate

  63. (orchestrator recovery screenshot) Automate

  64. None
  65. None
  66. Cooldown Timers (Screenshot of refusing to failover) Automate

  67. Perform Holistic Checks Automate

  68. Automate RW RO RO RO RO RO Monitoring

  69. Automate RW RO RO RO RO RO Monitoring

  70. Outcome: Service is healthy. Network is unreliable. Automate RW RO

    RO RO RO RO Monitoring
  71. Automate

  72. Sanity Checks Automate

  73. Automate ProxySQL ProxySQL ProxySQL Outcome: Remove the dead database.

  74. Outcome: Do nothing! There's no safe action. Automate ProxySQL ProxySQL

    ProxySQL
  75. Response: Test the supposedly failed database. Outcome: Do nothing. Automate

    ProxySQL ProxySQL ProxySQL
  76. None
  77. You Will Fail (And that's okay) (Sometimes) Automate

  78. Move Then Now New Shard Time ~3 Days 10 in

    a Day Downtime Large, posted in advance Not noticeable Process Huge check lists Still check lists (smaller)
  79. Double Then Now Technology Chef k8s (also Chef) MySQL Scaling

    1000's of Connections 10s of Connections Shards Dozens 100+
  80. Automate Then Now Failovers CLI driven and complex GUI driven

    and easy to understand Alerting CRITICAL: Host is down Queries per second has fallen 50% DB Monitoring Paging Fixing it for us
  81. Too Long; Didn't Listen

  82. None
  83. Related Shopify talks: Kir Shatrov: Running Jobs at Scale https://speakerdeck.com/kirs/running-jobs-at-scale

    Camilo Lopez: How Shopify Sharded Rails https://www.youtube.com/watch?v=6njTQdFLz6I Emil Stolarsky: Failovers https://www.youtube.com/watch?v=g93wNLPdkFU Jordan Wheeler / Sami Ahlroos: Automatic Failovers https://tinyurl.com/percona-talk Shuhao Wu: Ghostferry https://tinyurl.com/percona-shuhao Fantasia Copyright 1940 Walt Disney Company Tanks Photo by Manny Moss - CC-BY-ND - https://flic.kr/p/aahdGu Servers Photo by Victor Grigas - CC-BY-SA Earthworm Photo by Greg Goebel - CC-BY-SA - https://flic.kr/p/cJR7Do