The great microservices migration

Slide 1

Slide 1 text

The great microservices migration Charles-Axel Dein, Uber DevFest, Nantes, September 2017

Slide 2

Slide 2 text

What will you get from this talk?

Slide 3

Slide 3 text

Who am I? • Charles-Axel Dein - [email protected] • Payments Engineering Manager at Uber in Amsterdam • Born and raised in Nantes :)

Slide 4

Slide 4 text

Joined Uber in July 2012 An incredible growth... July 2012 Oct 2017 Uber's age 2 7 Cities 10 600+ Engineers 20 2,000+

Slide 5

Slide 5 text

Uber's simple architecture in 2012

Slide 6

Slide 6 text

Today we'll be focusing on "API"

Slide 7

Slide 7 text

During this period, Uber grew from 2 to 1,000+ services

Slide 8

Slide 8 text

What are microservices?

Slide 9

Slide 9 text

This "great migration" was a 5-year adventure

Slide 10

Slide 10 text

This talk is: • Not exhaustive • Not from an expert

Slide 11

Slide 11 text

Why did we split the monolith?

Slide 12

Slide 12 text

Reason #1 A large monolithic app slows down developers

Slide 13

Slide 13 text

Commits per day barely increased

Slide 14

Slide 14 text

Reason #2 A monolithic app suffers from tragedy of the commons

Slide 15

Slide 15 text

Reason #3 A monolithic app is difficult to scale

Slide 16

Slide 16 text

API's scaling difficulties, circa 2015 • Running out of PostgreSQL master DB connections • Running out of memory on machines (≈ 1.5 GB RAM) • Translations growing and using ≈ 1 GB RAM

Slide 17

Slide 17 text

I. Starting µservices II. Scaling µservices

Slide 18

Slide 18 text

How to start a µservices migration

Slide 19

Slide 19 text

Step 0: make a rough plan

Slide 20

Slide 20 text

You don't want to move from one monolith to a distributed monolith

Slide 21

Slide 21 text

Any piece of software reflects the organizational structure that produced it. — Conway's law

Slide 22

Slide 22 text

Design your architecture Then Design your organization

Slide 23

Slide 23 text

⚠ Too many plans look like [launching] a rocket ship. [Yet] tiny errors in assumptions can lead to catastrophic outcomes. — Eric Ries, Lean Startup

Slide 24

Slide 24 text

Three prerequisites • Business monitoring • Feature flags • Repository layer

Slide 25

Slide 25 text

Prerequisite 1: business monitoring and alerting • ❌ CPU utilization • ❌ RAM • ✅ Number of signups per device • ✅ Number of signups per channel

Slide 26

Slide 26 text

Prerequisite 2: fast config rollout (or feature flags) def get_user(user_uuid): if random.random() < config.get('use_new_flow_probability'): use_new_flow() else: use_old_flow()

Slide 27

Slide 27 text

Prerequisite 3: abstract storage layer class UsersSQLRepository(): def create(...): ... def get(user_uuid): user = sql.connect(...).execute("select ...") return user class UsersServiceRepository(): def get(user_uuid): user = http.connect(...).get("/users/...") return user

Slide 28

Slide 28 text

Step 1: build a rope bridge

Slide 29

Slide 29 text

Start with one microservice and one use case

Slide 30

Slide 30 text

Let's take an example: Our Customer rope bridge

Slide 31

Slide 31 text

Step 2: migrate the data and keep it up-to-date

Slide 32

Slide 32 text

Migrate the data in batch and keep it up- to-date

Slide 33

Slide 33 text

Results after step 2 1. ✅ Data is migrated 2. ✅ Data is kept up to date

Slide 34

Slide 34 text

Step 3: migrate the storage layer to read from the new service

Slide 35

Slide 35 text

Shadowing reads # In the monolith def get_user(user_uuid): monolith_user = UsersSQLRepository.get(user_uuid) new_user = UsersNewServiceRepository.get(user_uuid) verify(monolith_user, new_user) # Verify that they match return monolith_user # ✅ we are returning the "safe" user

Slide 36

Slide 36 text

Reverse shadowing reads # In the monolith def get(user_uuid): ... # read from both, verify if should_use_new_service(): # feature ﬂag return new_user else: return monolith_user

Slide 37

Slide 37 text

This requires productionization • Testing the new storage layer • Distributed transactions • Data analytics • ...

Slide 38

Slide 38 text

Results after step 3 1. ✅ Data is migrateds 2. ✅ Data is kept up to date 3. ✅ All reads are going to the new service 4. ➡ We can delete the old data

Slide 39

Slide 39 text

Step 4: migrate the consumers to the new service

Slide 40

Slide 40 text

Migrating customers is an opportunity to redesign • Fix some tech/product debt • Bring a fresh viewpoint • E.g. move to event sourcing • E.g. better separate offline/online queries • Make the interface micro-services aware

Slide 41

Slide 41 text

Results after step 4 1. ✅ Data is migrated 2. ✅ Data is kept up to date 3. ✅ All reads are going to the new service 4. ✅ All consumers are going to the new service 5. ➡ We can delete the old code

Slide 42

Slide 42 text

Summary: a bottom-up approach • Step 0: rough plan • Step 1: rope bridge • Step 2: migrate the data (writes) • Step 3: migrate the storage layer (reads) • Step 4: migrate consumers • Iterate for all services!

Slide 43

Slide 43 text

How to scale a µservices architecture

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

There are so many decisions to make... 1. RPC (transport, interface, sync/async, etc.) 2. Debugging (logs, tracing, etc.) 3. Security (authN, authZ, logging sensitive data, etc.) 4. ... too many topics, so we'll only chat about testing

Slide 46

Slide 46 text

Uber's testing strategies 1. Unit, integration, component testing 2. Staging environment (few, very costly) 3. End-to-end tests (very few, anti-pattern) 4. Testing on production: canary deploys 5. Tenancies on production !

Slide 47

Slide 47 text

The usual testing on prod method does not work with microservices • ❌ Require awareness of side effects • ❌ Difficult to share with other teams

Slide 48

Slide 48 text

A better way: tenancies

Slide 49

Slide 49 text

Test tenancies example def charge_trip(rider_uuid, trip): """Charge a rider for a trip.""" if trip.tenancy == "test": time.sleep(0.5) # Mimics external call return ... # continue charge ﬂow for non-test users

Slide 50

Slide 50 text

Benefits of using a test tenancy • ✅ All the advantages of testing on production • ✅ Allow teams to test autonomously • ❌ Is not suitable for all testing

Slide 51

Slide 51 text

... this is just one example of learning!

Slide 52

Slide 52 text

What to learn and how to learn it • What: speed AND quality • What: resilience > intelligence • How: standardize! • How: schedule learning time

Slide 53

Slide 53 text

What: speed and quality, not speed vs. quality

Slide 54

Slide 54 text

What: focus your learning on resilience

Slide 55

Slide 55 text

How: standardization speeds up learning • Counter analysis paralysis! • Example: programming languages • Example: RFC process

Slide 56

Slide 56 text

How: schedule time for learning! • Chaos testing • Blameless incident reviews • External and internal blog • Informal "brown bag" lunch & learn • ...

Slide 57

Slide 57 text

Summary: scaling a microservices architecture means building a learning organization

Slide 58

Slide 58 text

New services tend to become monolith so... this never ends!

Slide 59

Slide 59 text

Thank you! • Feedback welcome at [email protected] • Slides will be on blog.d3in.org

Slide 60

Slide 60 text

Annexes & references

Slide 61

Slide 61 text

Book recommendations • Release It!, Michael T. Nygard (lots of great patterns, great discussions) • Scalability rules, Martin Lee Abbott, Michael T. Fisher (super concise) • Building Microservices, Sam Newman (quite complete discussion of microservices)

Slide 62

Slide 62 text

List of references • Service-Oriented Architecture: Scaling the Uber Engineering Codebase As We Grow, Uber Engineering Blog • Lessons Learned from Scaling Uber to 2,000 Engineers, 1,000 Services, and 8,000 Git repositories, High Scalability • MonolithFirst, Martin Folwer • Testing Strategies in a Microservice Architecture, Toby Clemson, ThoughtWorks • charlax/professional-programming: a collection of full-stack resources for programmers.

Slide 63

Slide 63 text

Annexes: some topics I did not talk about • How to create components within the monolith • Infra challenges: how to abstract the architecture away from developers • Org: SRE vs. development/operations team • Safe deployment: staging, canarying, prod • Other ways to keep the data consistent between the two services.

Slide 64

Slide 64 text

Annexes: some topics I did not talk about (cont.) • Resource requirements and capacity planning • Service discovery • Multiple repos vs. mono repo • Managing configuration at scale • Hardware efficiency and resource quotas • Application platform: build and release, etc. • MTBR > MTBF

Slide 65

Slide 65 text

Slide 66

Slide 66 text

Credits for image (cont.) • Rope bridge: Carrick-a-rede, Rope Bridge, Ballintoy, Antrim | La salvaje … | Flickr • Fischli/Weiss, Installation view, Rock on Top of Another Rock 2013, Serpentine Gallery, London, © Peter Fischli David Weiss, Photo: 2013 Morley von Sternberg • Cheetah: File:Sarah (cheetah).jpg - Wikimedia Commons.jpg), Gregory Wilson • Bent tree: Resilience | Captured at Inks Lake State Park View On Black | Anne Worner | Flickr

Slide 67

Slide 67 text

Colophon Slides made with Markdown and Deckset, Titillium theme.