Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The great microservices migration

The great microservices migration

How did Uber go from a 450,000 lines monolithic Python application to more than 1,000 microservices? This short presentation focuses on the technical aspects of this 5-year migration, and concludes with its cultural and management challenges.

Charles-Axel Dein

October 19, 2017
Tweet

More Decks by Charles-Axel Dein

Other Decks in Programming

Transcript

  1. Who am I? • Charles-Axel Dein - [email protected] • Payments

    Engineering Manager at Uber in Amsterdam • Born and raised in Nantes :)
  2. Joined Uber in July 2012 An incredible growth... July 2012

    Oct 2017 Uber's age 2 7 Cities 10 600+ Engineers 20 2,000+
  3. API's scaling difficulties, circa 2015 • Running out of PostgreSQL

    master DB connections • Running out of memory on machines (≈ 1.5 GB RAM) • Translations growing and using ≈ 1 GB RAM
  4. ⚠ Too many plans look like [launching] a rocket ship.

    [Yet] tiny errors in assumptions can lead to catastrophic outcomes. — Eric Ries, Lean Startup
  5. Prerequisite 1: business monitoring and alerting • ❌ CPU utilization

    • ❌ RAM • ✅ Number of signups per device • ✅ Number of signups per channel
  6. Prerequisite 2: fast config rollout (or feature flags) def get_user(user_uuid):

    if random.random() < config.get('use_new_flow_probability'): use_new_flow() else: use_old_flow()
  7. Prerequisite 3: abstract storage layer class UsersSQLRepository(): def create(...): ...

    def get(user_uuid): user = sql.connect(...).execute("select ...") return user class UsersServiceRepository(): def get(user_uuid): user = http.connect(...).get("/users/...") return user
  8. Shadowing reads # In the monolith def get_user(user_uuid): monolith_user =

    UsersSQLRepository.get(user_uuid) new_user = UsersNewServiceRepository.get(user_uuid) verify(monolith_user, new_user) # Verify that they match return monolith_user # ✅ we are returning the "safe" user
  9. Reverse shadowing reads # In the monolith def get(user_uuid): ...

    # read from both, verify if should_use_new_service(): # feature flag return new_user else: return monolith_user
  10. This requires productionization • Testing the new storage layer •

    Distributed transactions • Data analytics • ...
  11. Results after step 3 1. ✅ Data is migrateds 2.

    ✅ Data is kept up to date 3. ✅ All reads are going to the new service 4. ➡ We can delete the old data
  12. Migrating customers is an opportunity to redesign • Fix some

    tech/product debt • Bring a fresh viewpoint • E.g. move to event sourcing • E.g. better separate offline/online queries • Make the interface micro-services aware
  13. Results after step 4 1. ✅ Data is migrated 2.

    ✅ Data is kept up to date 3. ✅ All reads are going to the new service 4. ✅ All consumers are going to the new service 5. ➡ We can delete the old code
  14. Summary: a bottom-up approach • Step 0: rough plan •

    Step 1: rope bridge • Step 2: migrate the data (writes) • Step 3: migrate the storage layer (reads) • Step 4: migrate consumers • Iterate for all services!
  15. There are so many decisions to make... 1. RPC (transport,

    interface, sync/async, etc.) 2. Debugging (logs, tracing, etc.) 3. Security (authN, authZ, logging sensitive data, etc.) 4. ... too many topics, so we'll only chat about testing
  16. Uber's testing strategies 1. Unit, integration, component testing 2. Staging

    environment (few, very costly) 3. End-to-end tests (very few, anti-pattern) 4. Testing on production: canary deploys 5. Tenancies on production !
  17. The usual testing on prod method does not work with

    microservices • ❌ Require awareness of side effects • ❌ Difficult to share with other teams
  18. Test tenancies example def charge_trip(rider_uuid, trip): """Charge a rider for

    a trip.""" if trip.tenancy == "test": time.sleep(0.5) # Mimics external call return ... # continue charge flow for non-test users
  19. Benefits of using a test tenancy • ✅ All the

    advantages of testing on production • ✅ Allow teams to test autonomously • ❌ Is not suitable for all testing
  20. What to learn and how to learn it • What:

    speed AND quality • What: resilience > intelligence • How: standardize! • How: schedule learning time
  21. How: standardization speeds up learning • Counter analysis paralysis! •

    Example: programming languages • Example: RFC process
  22. How: schedule time for learning! • Chaos testing • Blameless

    incident reviews • External and internal blog • Informal "brown bag" lunch & learn • ...
  23. Book recommendations • Release It!, Michael T. Nygard (lots of

    great patterns, great discussions) • Scalability rules, Martin Lee Abbott, Michael T. Fisher (super concise) • Building Microservices, Sam Newman (quite complete discussion of microservices)
  24. List of references • Service-Oriented Architecture: Scaling the Uber Engineering

    Codebase As We Grow, Uber Engineering Blog • Lessons Learned from Scaling Uber to 2,000 Engineers, 1,000 Services, and 8,000 Git repositories, High Scalability • MonolithFirst, Martin Folwer • Testing Strategies in a Microservice Architecture, Toby Clemson, ThoughtWorks • charlax/professional-programming: a collection of full-stack resources for programmers.
  25. Annexes: some topics I did not talk about • How

    to create components within the monolith • Infra challenges: how to abstract the architecture away from developers • Org: SRE vs. development/operations team • Safe deployment: staging, canarying, prod • Other ways to keep the data consistent between the two services.
  26. Annexes: some topics I did not talk about (cont.) •

    Resource requirements and capacity planning • Service discovery • Multiple repos vs. mono repo • Managing configuration at scale • Hardware efficiency and resource quotas • Application platform: build and release, etc. • MTBR > MTBF
  27. Credits for images • Spaghetti architecture: @benorama • Beehive: Beehive

    | Sarah | Flickr • Pangolin: Pangolin | Adam Tusk | Flickr • Relaxed: Relax | Relax | Flickr • Menhir: Menhirs at Carnac | Anton Schuttelaars | Flickr • Flow chart planning: xkcd: Flowchart
  28. Credits for image (cont.) • Rope bridge: Carrick-a-rede, Rope Bridge,

    Ballintoy, Antrim | La salvaje … | Flickr • Fischli/Weiss, Installation view, Rock on Top of Another Rock 2013, Serpentine Gallery, London, © Peter Fischli David Weiss, Photo: 2013 Morley von Sternberg • Cheetah: File:Sarah (cheetah).jpg - Wikimedia Commons.jpg), Gregory Wilson • Bent tree: Resilience | Captured at Inks Lake State Park View On Black | Anne Worner | Flickr