Building a best-effort distributed fallback

Slide 1

Slide 1 text

Building a best-eﬀort distributed fallback Sushant Bhadkamkar Staﬀ Software Engineer Craft Conference 2021

Slide 2

Slide 2 text

Lyft Lyft is a single, smart app to get around your city: everything from Shared rides and XL vehicles, to bikes, scooters, and transit information

Slide 3

Slide 3 text

The Purchase Flow domain The Purchase Flow team at Lyft is responsible for showing users all the relevant transportation options for their trip. This domain starts with the user entering their destination and ends with them booking a trip in the Lyft app. We call these options “Oﬀers”, and this screen the “Oﬀer Selector”.

Slide 4

Slide 4 text

The Purchase Flow domain

Slide 5

Slide 5 text

Panic! About a year ago, we had 2 outages that lasted a few minutes If our domain is unavailable: • Users cannot request rides • Drivers cannot make money on the platform • Several days are spent responding to support tickets, identifying aﬀected users and compensating them for the bad experience.

Slide 6

Slide 6 text

Build a fallback?

Slide 7

Slide 7 text

Fallback An alternate mode of operation used in an emergency

Slide 8

Slide 8 text

Problems with a distributed fallback • Hard to simulate and test • Itself may fail or have latent bugs • May make the outage worse Recommended reading: Avoiding fallback in distributed systems by Jacob Gabrielson

Slide 9

Slide 9 text

Building Resiliency • First, invest time in improving the reliability of the normal mode of operation • Pick the simplest fallback design • Try and reuse code that is executed regularly in the normal mode • Continuously exercise the fallback mode (in staging AND production environments) • Thoroughly test the fallback mode — end to end • Support a quick and easy way to disable the fallback — with a killswitch

Slide 10

Slide 10 text

Improve reliability of the normal mode

Slide 11

Slide 11 text

Sidebar: Fault injection testing Chaos Experimentation with Envoy on the Lyft Engineering Blog

Slide 12

Slide 12 text

Sidebar: Envoy service mesh

Slide 13

Slide 13 text

Pick the simplest fallback design

Slide 14

Slide 14 text

Pick the simplest fallback design // shouldRunFallback will return true if we failed to fetch offers from the upstream func (h *Handler) shouldRunFallback() bool { if h.upstreamErr != nil { if h.upstreamErr.StatusCode() < http.StatusInternalServerError { return false } return true } return false }

Slide 15

Slide 15 text

Pick the simplest fallback design

Slide 16

Slide 16 text

Sidebar: Static Stability Recommended reading: Static stability using Availability Zones By Becky Weiss and Mike Furr

Slide 17

Slide 17 text

Reuse code from the normal mode import ( "github.com/lyft/purchaseflowlib" ) // buildFallbackResponse builds the API offer representation that can be rendered on clients func (h *Handler) buildFallbackResponse(fallbackProducts []*Products) (*OffersResp, error) { fallbackOffers := purchaseflowlib.BuildOffers(fallbackProducts) rankedOffers := purchaseflowlib.Rank(fallbackOffers) resp, err := purchaseflowlib.ToAPIResponse(h.user, rankedOffers) if err != nil { return nil, err } return resp, nil }

Slide 18

Slide 18 text

Continuously exercise the fallback

Slide 19

Slide 19 text

Sidebar: Feature Toggles Recommended reading:Feature Toggles (aka Feature Flags) by Pete Hodgson Allow modifying system behavior without changing code At Lyft, • Recommended for any new (critical/complex) feature or integration • Implemented as a git repository with an independent release pipeline • Supports hierarchical conﬁguration of toggles — with environment and geographical overrides

Slide 20

Slide 20 text

Thoroughly test the fallback — end to end • Exhaustive automated test suite for fallback triggering and response generation • An HTTP 5xx response from the Personalization service “triggers” the fallback — can be simulated by injecting failures in our service mesh (Envoy proxy) • Fallback response is always served (and can be manually tested) in speciﬁc staging environments • Client (app) UI and regression tests include the fallback mode and are run for every new build

Slide 21

Slide 21 text

Support disabling the fallback func (h *Handler) shouldRunFallback() bool { if h.upstreamErr != nil { if h.upstreamErr.StatusCode() < http.StatusInternalServerError { return false } if h.config.isFallbackEnabled { // killswitch return true } } return false }

Slide 22

Slide 22 text

The fallback mode A somewhat degraded experience • Subset of transportation options available • No price information or time estimates available • Oﬀers not personalized

Slide 23

Slide 23 text

The fallback mode

Slide 24

Slide 24 text

To recap... • First, invest time in improving the reliability of the normal mode of operation • Pick the simplest fallback design • Try and reuse code that is executed regularly in the normal mode • Continuously exercise the fallback mode (in staging AND production environments) • Thoroughly test the fallback mode — end to end • Support a quick and easy way to disable the fallback — with a killswitch

Slide 25

Slide 25 text

Thank you! Send questions and feedback to @tnahsus on Twitter