Building a best-effort distributed fallback

Building a best-eﬀort distributed fallback Sushant Bhadkamkar Staﬀ Software Engineer
Craft Conference 2021

Lyft Lyft is a single, smart app to get around
your city: everything from Shared rides and XL vehicles, to bikes, scooters, and transit information

The Purchase Flow domain The Purchase Flow team at Lyft
is responsible for showing users all the relevant transportation options for their trip. This domain starts with the user entering their destination and ends with them booking a trip in the Lyft app. We call these options “Oﬀers”, and this screen the “Oﬀer Selector”.

The Purchase Flow domain

Panic! About a year ago, we had 2 outages that
lasted a few minutes If our domain is unavailable: • Users cannot request rides • Drivers cannot make money on the platform • Several days are spent responding to support tickets, identifying aﬀected users and compensating them for the bad experience.

Build a fallback?

Fallback An alternate mode of operation used in an emergency

Problems with a distributed fallback • Hard to simulate and
test • Itself may fail or have latent bugs • May make the outage worse Recommended reading: Avoiding fallback in distributed systems by Jacob Gabrielson

Building Resiliency • First, invest time in improving the reliability
of the normal mode of operation • Pick the simplest fallback design • Try and reuse code that is executed regularly in the normal mode • Continuously exercise the fallback mode (in staging AND production environments) • Thoroughly test the fallback mode — end to end • Support a quick and easy way to disable the fallback — with a killswitch

Improve reliability of the normal mode

Sidebar: Fault injection testing Chaos Experimentation with Envoy on the
Lyft Engineering Blog

Sidebar: Envoy service mesh

Pick the simplest fallback design

Pick the simplest fallback design // shouldRunFallback will return true
if we failed to fetch offers from the upstream func (h *Handler) shouldRunFallback() bool { if h.upstreamErr != nil { if h.upstreamErr.StatusCode() < http.StatusInternalServerError { return false } return true } return false }

Pick the simplest fallback design

Sidebar: Static Stability Recommended reading: Static stability using Availability Zones
By Becky Weiss and Mike Furr

Reuse code from the normal mode import ( "github.com/lyft/purchaseflowlib" )
// buildFallbackResponse builds the API offer representation that can be rendered on clients func (h *Handler) buildFallbackResponse(fallbackProducts []*Products) (*OffersResp, error) { fallbackOffers := purchaseflowlib.BuildOffers(fallbackProducts) rankedOffers := purchaseflowlib.Rank(fallbackOffers) resp, err := purchaseflowlib.ToAPIResponse(h.user, rankedOffers) if err != nil { return nil, err } return resp, nil }

Continuously exercise the fallback

Sidebar: Feature Toggles Recommended reading:Feature Toggles (aka Feature Flags) by
Pete Hodgson Allow modifying system behavior without changing code At Lyft, • Recommended for any new (critical/complex) feature or integration • Implemented as a git repository with an independent release pipeline • Supports hierarchical conﬁguration of toggles — with environment and geographical overrides

Thoroughly test the fallback — end to end • Exhaustive
automated test suite for fallback triggering and response generation • An HTTP 5xx response from the Personalization service “triggers” the fallback — can be simulated by injecting failures in our service mesh (Envoy proxy) • Fallback response is always served (and can be manually tested) in speciﬁc staging environments • Client (app) UI and regression tests include the fallback mode and are run for every new build

Support disabling the fallback func (h *Handler) shouldRunFallback() bool {
if h.upstreamErr != nil { if h.upstreamErr.StatusCode() < http.StatusInternalServerError { return false } if h.config.isFallbackEnabled { // killswitch return true } } return false }

The fallback mode A somewhat degraded experience • Subset of
transportation options available • No price information or time estimates available • Oﬀers not personalized

The fallback mode

To recap... • First, invest time in improving the reliability
of the normal mode of operation • Pick the simplest fallback design • Try and reuse code that is executed regularly in the normal mode • Continuously exercise the fallback mode (in staging AND production environments) • Thoroughly test the fallback mode — end to end • Support a quick and easy way to disable the fallback — with a killswitch

Thank you! Send questions and feedback to @tnahsus on Twitter

Building a best-effort distributed fallback

Building a best-effort distributed fallback

Sushant Bhadkamkar

More Decks by Sushant Bhadkamkar

Other Decks in Technology

Featured

Transcript

Building a best-eﬀort distributed fallback Sushant Bhadkamkar Staﬀ Software Engineer

Lyft Lyft is a single, smart app to get around

The Purchase Flow domain The Purchase Flow team at Lyft

The Purchase Flow domain

Panic! About a year ago, we had 2 outages that

Build a fallback?

Fallback An alternate mode of operation used in an emergency

Problems with a distributed fallback • Hard to simulate and

Building Resiliency • First, invest time in improving the reliability

Improve reliability of the normal mode

Sidebar: Fault injection testing Chaos Experimentation with Envoy on the

Sidebar: Envoy service mesh

Pick the simplest fallback design

Pick the simplest fallback design // shouldRunFallback will return true

Pick the simplest fallback design

Sidebar: Static Stability Recommended reading: Static stability using Availability Zones

Reuse code from the normal mode import ( "github.com/lyft/purchaseflowlib" )

Continuously exercise the fallback

Sidebar: Feature Toggles Recommended reading:Feature Toggles (aka Feature Flags) by

Thoroughly test the fallback — end to end • Exhaustive

Support disabling the fallback func (h *Handler) shouldRunFallback() bool {

The fallback mode A somewhat degraded experience • Subset of

The fallback mode

To recap... • First, invest time in improving the reliability

Thank you! Send questions and feedback to @tnahsus on Twitter