Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The great microservices migration

The great microservices migration

How did Uber go from a 450,000 lines monolithic Python application to more than 1,000 microservices? This short presentation focuses on the technical aspects of this 5-year migration, and concludes with its cultural and management challenges.

Charles-Axel Dein

October 19, 2017
Tweet

More Decks by Charles-Axel Dein

Other Decks in Programming

Transcript

  1. The great microservices
    migration
    Charles-Axel Dein, Uber
    DevFest, Nantes, September 2017

    View full-size slide

  2. What will you get
    from this talk?

    View full-size slide

  3. Who am I?
    • Charles-Axel Dein - [email protected]
    • Payments Engineering Manager at Uber in Amsterdam
    • Born and raised in Nantes :)

    View full-size slide

  4. Joined Uber in July 2012
    An incredible growth...
    July 2012 Oct 2017
    Uber's age 2 7
    Cities 10 600+
    Engineers 20 2,000+

    View full-size slide

  5. Uber's simple architecture in 2012

    View full-size slide

  6. Today we'll be focusing on "API"

    View full-size slide

  7. During this period, Uber grew
    from 2 to 1,000+ services

    View full-size slide

  8. What are microservices?

    View full-size slide

  9. This "great migration"
    was a 5-year adventure

    View full-size slide

  10. This talk is:
    • Not exhaustive
    • Not from an expert

    View full-size slide

  11. Why did we split the monolith?

    View full-size slide

  12. Reason #1
    A large monolithic app
    slows down
    developers

    View full-size slide

  13. Commits per day barely increased

    View full-size slide

  14. Reason #2
    A monolithic app suffers from
    tragedy of the commons

    View full-size slide

  15. Reason #3
    A monolithic app is difficult to scale

    View full-size slide

  16. API's scaling difficulties, circa 2015
    • Running out of PostgreSQL master DB connections
    • Running out of memory on machines (≈ 1.5 GB RAM)
    • Translations growing and using ≈ 1 GB RAM

    View full-size slide

  17. I. Starting µservices
    II. Scaling µservices

    View full-size slide

  18. How to start
    a µservices migration

    View full-size slide

  19. Step 0: make a rough plan

    View full-size slide

  20. You don't want to move from
    one monolith
    to a distributed monolith

    View full-size slide

  21. Any piece of software
    reflects the organizational
    structure that produced it.
    — Conway's law

    View full-size slide

  22. Design your architecture
    Then
    Design your organization

    View full-size slide


  23. Too many plans look like
    [launching] a rocket ship.
    [Yet] tiny errors in
    assumptions can lead to
    catastrophic outcomes.
    — Eric Ries, Lean Startup

    View full-size slide

  24. Three prerequisites
    • Business monitoring
    • Feature flags
    • Repository layer

    View full-size slide

  25. Prerequisite 1: business monitoring
    and alerting


    CPU utilization


    RAM


    Number of signups per device


    Number of signups per channel

    View full-size slide

  26. Prerequisite 2: fast config rollout (or
    feature flags)
    def get_user(user_uuid):
    if random.random() < config.get('use_new_flow_probability'):
    use_new_flow()
    else:
    use_old_flow()

    View full-size slide

  27. Prerequisite 3: abstract storage layer
    class UsersSQLRepository():
    def create(...):
    ...
    def get(user_uuid):
    user = sql.connect(...).execute("select ...")
    return user
    class UsersServiceRepository():
    def get(user_uuid):
    user = http.connect(...).get("/users/...")
    return user

    View full-size slide

  28. Step 1: build a rope bridge

    View full-size slide

  29. Start with one microservice
    and one use case

    View full-size slide

  30. Let's take an
    example:
    Our Customer
    rope bridge

    View full-size slide

  31. Step 2: migrate the data
    and keep it up-to-date

    View full-size slide

  32. Migrate the data in
    batch and keep it up-
    to-date

    View full-size slide

  33. Results after step 2
    1.

    Data is migrated
    2.

    Data is kept up to date

    View full-size slide

  34. Step 3: migrate
    the storage layer
    to read from the
    new service

    View full-size slide

  35. Shadowing reads
    # In the monolith
    def get_user(user_uuid):
    monolith_user = UsersSQLRepository.get(user_uuid)
    new_user = UsersNewServiceRepository.get(user_uuid)
    verify(monolith_user, new_user) # Verify that they match
    return monolith_user #

    we are returning the "safe" user

    View full-size slide

  36. Reverse shadowing reads
    # In the monolith
    def get(user_uuid):
    ... # read from both, verify
    if should_use_new_service(): # feature flag
    return new_user
    else:
    return monolith_user

    View full-size slide

  37. This requires
    productionization
    • Testing the new storage layer
    • Distributed transactions
    • Data analytics
    • ...

    View full-size slide

  38. Results after step 3
    1.

    Data is migrateds
    2.

    Data is kept up to date
    3.

    All reads are going to the new service
    4.

    We can delete the old data

    View full-size slide

  39. Step 4: migrate
    the consumers
    to the new
    service

    View full-size slide

  40. Migrating customers is an
    opportunity to redesign
    • Fix some tech/product debt
    • Bring a fresh viewpoint
    • E.g. move to event sourcing
    • E.g. better separate offline/online queries
    • Make the interface micro-services aware

    View full-size slide

  41. Results after step 4
    1.

    Data is migrated
    2.

    Data is kept up to date
    3.

    All reads are going to the new service
    4.

    All consumers are going to the new
    service
    5.

    We can delete the old code

    View full-size slide

  42. Summary: a bottom-up
    approach
    • Step 0: rough plan
    • Step 1: rope bridge
    • Step 2: migrate the data (writes)
    • Step 3: migrate the storage layer (reads)
    • Step 4: migrate consumers
    • Iterate for all services!

    View full-size slide

  43. How to scale
    a µservices architecture

    View full-size slide

  44. There are so many decisions
    to make...
    1. RPC (transport, interface, sync/async, etc.)
    2. Debugging (logs, tracing, etc.)
    3. Security (authN, authZ, logging sensitive data, etc.)
    4. ... too many topics, so we'll only chat about testing

    View full-size slide

  45. Uber's testing strategies
    1. Unit, integration, component testing
    2. Staging environment (few, very costly)
    3. End-to-end tests (very few, anti-pattern)
    4. Testing on production: canary deploys
    5. Tenancies on production
    !

    View full-size slide

  46. The usual testing on prod method
    does not work with microservices


    Require awareness of side effects


    Difficult to share with other teams

    View full-size slide

  47. A better way:
    tenancies

    View full-size slide

  48. Test tenancies example
    def charge_trip(rider_uuid, trip):
    """Charge a rider for a trip."""
    if trip.tenancy == "test":
    time.sleep(0.5) # Mimics external call
    return
    ... # continue charge flow for non-test users

    View full-size slide

  49. Benefits of using a test tenancy


    All the advantages of testing on production


    Allow teams to test autonomously


    Is not suitable for all testing

    View full-size slide

  50. ... this is just one
    example of learning!

    View full-size slide

  51. What to learn and how to
    learn it
    • What: speed AND quality
    • What: resilience > intelligence
    • How: standardize!
    • How: schedule learning time

    View full-size slide

  52. What: speed and quality, not speed
    vs. quality

    View full-size slide

  53. What: focus your learning on
    resilience

    View full-size slide

  54. How: standardization speeds up
    learning
    • Counter analysis paralysis!
    • Example: programming languages
    • Example: RFC process

    View full-size slide

  55. How: schedule time for learning!
    • Chaos testing
    • Blameless incident reviews
    • External and internal blog
    • Informal "brown bag" lunch & learn
    • ...

    View full-size slide

  56. Summary: scaling a microservices
    architecture means building a
    learning organization

    View full-size slide

  57. New services tend
    to become monolith so...
    this never ends!

    View full-size slide

  58. Thank you!
    • Feedback welcome at [email protected]
    • Slides will be on blog.d3in.org

    View full-size slide

  59. Annexes & references

    View full-size slide

  60. Book recommendations
    • Release It!, Michael T. Nygard (lots of great patterns, great
    discussions)
    • Scalability rules, Martin Lee Abbott, Michael T. Fisher (super concise)
    • Building Microservices, Sam Newman (quite complete discussion of
    microservices)

    View full-size slide

  61. List of references
    • Service-Oriented Architecture: Scaling the Uber Engineering
    Codebase As We Grow, Uber Engineering Blog
    • Lessons Learned from Scaling Uber to 2,000 Engineers, 1,000
    Services, and 8,000 Git repositories, High Scalability
    • MonolithFirst, Martin Folwer
    • Testing Strategies in a Microservice Architecture, Toby Clemson,
    ThoughtWorks
    • charlax/professional-programming: a collection of full-stack
    resources for programmers.

    View full-size slide

  62. Annexes: some topics I did not talk about
    • How to create components within the monolith
    • Infra challenges: how to abstract the architecture away from
    developers
    • Org: SRE vs. development/operations team
    • Safe deployment: staging, canarying, prod
    • Other ways to keep the data consistent between the two services.

    View full-size slide

  63. Annexes: some topics I did not talk about (cont.)
    • Resource requirements and capacity planning
    • Service discovery
    • Multiple repos vs. mono repo
    • Managing configuration at scale
    • Hardware efficiency and resource quotas
    • Application platform: build and release, etc.
    • MTBR > MTBF

    View full-size slide

  64. Credits for images
    • Spaghetti architecture: @benorama
    • Beehive: Beehive | Sarah | Flickr
    • Pangolin: Pangolin | Adam Tusk | Flickr
    • Relaxed: Relax | Relax | Flickr
    • Menhir: Menhirs at Carnac | Anton Schuttelaars | Flickr
    • Flow chart planning: xkcd: Flowchart

    View full-size slide

  65. Credits for image (cont.)
    • Rope bridge: Carrick-a-rede, Rope Bridge, Ballintoy, Antrim | La
    salvaje … | Flickr
    • Fischli/Weiss, Installation view, Rock on Top of Another Rock 2013,
    Serpentine Gallery, London, © Peter Fischli David Weiss, Photo:
    2013 Morley von Sternberg
    • Cheetah: File:Sarah (cheetah).jpg - Wikimedia Commons.jpg),
    Gregory Wilson
    • Bent tree: Resilience | Captured at Inks Lake State Park View On
    Black | Anne Worner | Flickr

    View full-size slide

  66. Colophon
    Slides made with Markdown and Deckset, Titillium theme.

    View full-size slide