Save 37% off PRO during our Black Friday Sale! »

The great microservices migration

The great microservices migration

How did Uber go from a 450,000 lines monolithic Python application to more than 1,000 microservices? This short presentation focuses on the technical aspects of this 5-year migration, and concludes with its cultural and management challenges.

894eaf7d342e28755670466829510b36?s=128

Charles-Axel Dein

October 19, 2017
Tweet

Transcript

  1. The great microservices migration Charles-Axel Dein, Uber DevFest, Nantes, September

    2017
  2. What will you get from this talk?

  3. Who am I? • Charles-Axel Dein - charles@uber.com • Payments

    Engineering Manager at Uber in Amsterdam • Born and raised in Nantes :)
  4. Joined Uber in July 2012 An incredible growth... July 2012

    Oct 2017 Uber's age 2 7 Cities 10 600+ Engineers 20 2,000+
  5. Uber's simple architecture in 2012

  6. Today we'll be focusing on "API"

  7. During this period, Uber grew from 2 to 1,000+ services

  8. What are microservices?

  9. This "great migration" was a 5-year adventure

  10. This talk is: • Not exhaustive • Not from an

    expert
  11. Why did we split the monolith?

  12. Reason #1 A large monolithic app slows down developers

  13. Commits per day barely increased

  14. Reason #2 A monolithic app suffers from tragedy of the

    commons
  15. Reason #3 A monolithic app is difficult to scale

  16. API's scaling difficulties, circa 2015 • Running out of PostgreSQL

    master DB connections • Running out of memory on machines (≈ 1.5 GB RAM) • Translations growing and using ≈ 1 GB RAM
  17. I. Starting µservices II. Scaling µservices

  18. How to start a µservices migration

  19. Step 0: make a rough plan

  20. You don't want to move from one monolith to a

    distributed monolith
  21. Any piece of software reflects the organizational structure that produced

    it. — Conway's law
  22. Design your architecture Then Design your organization

  23. ⚠ Too many plans look like [launching] a rocket ship.

    [Yet] tiny errors in assumptions can lead to catastrophic outcomes. — Eric Ries, Lean Startup
  24. Three prerequisites • Business monitoring • Feature flags • Repository

    layer
  25. Prerequisite 1: business monitoring and alerting • ❌ CPU utilization

    • ❌ RAM • ✅ Number of signups per device • ✅ Number of signups per channel
  26. Prerequisite 2: fast config rollout (or feature flags) def get_user(user_uuid):

    if random.random() < config.get('use_new_flow_probability'): use_new_flow() else: use_old_flow()
  27. Prerequisite 3: abstract storage layer class UsersSQLRepository(): def create(...): ...

    def get(user_uuid): user = sql.connect(...).execute("select ...") return user class UsersServiceRepository(): def get(user_uuid): user = http.connect(...).get("/users/...") return user
  28. Step 1: build a rope bridge

  29. Start with one microservice and one use case

  30. Let's take an example: Our Customer rope bridge

  31. Step 2: migrate the data and keep it up-to-date

  32. Migrate the data in batch and keep it up- to-date

  33. Results after step 2 1. ✅ Data is migrated 2.

    ✅ Data is kept up to date
  34. Step 3: migrate the storage layer to read from the

    new service
  35. Shadowing reads # In the monolith def get_user(user_uuid): monolith_user =

    UsersSQLRepository.get(user_uuid) new_user = UsersNewServiceRepository.get(user_uuid) verify(monolith_user, new_user) # Verify that they match return monolith_user # ✅ we are returning the "safe" user
  36. Reverse shadowing reads # In the monolith def get(user_uuid): ...

    # read from both, verify if should_use_new_service(): # feature flag return new_user else: return monolith_user
  37. This requires productionization • Testing the new storage layer •

    Distributed transactions • Data analytics • ...
  38. Results after step 3 1. ✅ Data is migrateds 2.

    ✅ Data is kept up to date 3. ✅ All reads are going to the new service 4. ➡ We can delete the old data
  39. Step 4: migrate the consumers to the new service

  40. Migrating customers is an opportunity to redesign • Fix some

    tech/product debt • Bring a fresh viewpoint • E.g. move to event sourcing • E.g. better separate offline/online queries • Make the interface micro-services aware
  41. Results after step 4 1. ✅ Data is migrated 2.

    ✅ Data is kept up to date 3. ✅ All reads are going to the new service 4. ✅ All consumers are going to the new service 5. ➡ We can delete the old code
  42. Summary: a bottom-up approach • Step 0: rough plan •

    Step 1: rope bridge • Step 2: migrate the data (writes) • Step 3: migrate the storage layer (reads) • Step 4: migrate consumers • Iterate for all services!
  43. How to scale a µservices architecture

  44. None
  45. There are so many decisions to make... 1. RPC (transport,

    interface, sync/async, etc.) 2. Debugging (logs, tracing, etc.) 3. Security (authN, authZ, logging sensitive data, etc.) 4. ... too many topics, so we'll only chat about testing
  46. Uber's testing strategies 1. Unit, integration, component testing 2. Staging

    environment (few, very costly) 3. End-to-end tests (very few, anti-pattern) 4. Testing on production: canary deploys 5. Tenancies on production !
  47. The usual testing on prod method does not work with

    microservices • ❌ Require awareness of side effects • ❌ Difficult to share with other teams
  48. A better way: tenancies

  49. Test tenancies example def charge_trip(rider_uuid, trip): """Charge a rider for

    a trip.""" if trip.tenancy == "test": time.sleep(0.5) # Mimics external call return ... # continue charge flow for non-test users
  50. Benefits of using a test tenancy • ✅ All the

    advantages of testing on production • ✅ Allow teams to test autonomously • ❌ Is not suitable for all testing
  51. ... this is just one example of learning!

  52. What to learn and how to learn it • What:

    speed AND quality • What: resilience > intelligence • How: standardize! • How: schedule learning time
  53. What: speed and quality, not speed vs. quality

  54. What: focus your learning on resilience

  55. How: standardization speeds up learning • Counter analysis paralysis! •

    Example: programming languages • Example: RFC process
  56. How: schedule time for learning! • Chaos testing • Blameless

    incident reviews • External and internal blog • Informal "brown bag" lunch & learn • ...
  57. Summary: scaling a microservices architecture means building a learning organization

  58. New services tend to become monolith so... this never ends!

  59. Thank you! • Feedback welcome at charles@uber.com • Slides will

    be on blog.d3in.org
  60. Annexes & references

  61. Book recommendations • Release It!, Michael T. Nygard (lots of

    great patterns, great discussions) • Scalability rules, Martin Lee Abbott, Michael T. Fisher (super concise) • Building Microservices, Sam Newman (quite complete discussion of microservices)
  62. List of references • Service-Oriented Architecture: Scaling the Uber Engineering

    Codebase As We Grow, Uber Engineering Blog • Lessons Learned from Scaling Uber to 2,000 Engineers, 1,000 Services, and 8,000 Git repositories, High Scalability • MonolithFirst, Martin Folwer • Testing Strategies in a Microservice Architecture, Toby Clemson, ThoughtWorks • charlax/professional-programming: a collection of full-stack resources for programmers.
  63. Annexes: some topics I did not talk about • How

    to create components within the monolith • Infra challenges: how to abstract the architecture away from developers • Org: SRE vs. development/operations team • Safe deployment: staging, canarying, prod • Other ways to keep the data consistent between the two services.
  64. Annexes: some topics I did not talk about (cont.) •

    Resource requirements and capacity planning • Service discovery • Multiple repos vs. mono repo • Managing configuration at scale • Hardware efficiency and resource quotas • Application platform: build and release, etc. • MTBR > MTBF
  65. Credits for images • Spaghetti architecture: @benorama • Beehive: Beehive

    | Sarah | Flickr • Pangolin: Pangolin | Adam Tusk | Flickr • Relaxed: Relax | Relax | Flickr • Menhir: Menhirs at Carnac | Anton Schuttelaars | Flickr • Flow chart planning: xkcd: Flowchart
  66. Credits for image (cont.) • Rope bridge: Carrick-a-rede, Rope Bridge,

    Ballintoy, Antrim | La salvaje … | Flickr • Fischli/Weiss, Installation view, Rock on Top of Another Rock 2013, Serpentine Gallery, London, © Peter Fischli David Weiss, Photo: 2013 Morley von Sternberg • Cheetah: File:Sarah (cheetah).jpg - Wikimedia Commons.jpg), Gregory Wilson • Bent tree: Resilience | Captured at Inks Lake State Park View On Black | Anne Worner | Flickr
  67. Colophon Slides made with Markdown and Deckset, Titillium theme.