Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a best-effort distributed fallback

Building a best-effort distributed fallback

This talk will dive into the details of a best-effort fallback mechanism deployed at Lyft which helped us serve 8000 rides during a recent cloud provider outage, and continues to protect us from transient failures. We’ll talk about why we decided to pursue a fallback strategy despite its pitfalls, what the fallback experience looks like and how we overcame some of the challenges by building the automated mechanism in an API Gateway service. Our simple fallback design, and the techniques for simulating and testing it, will hopefully inspire the audience to start thinking about a best-effort fallback strategy for protecting their users.

Sushant Bhadkamkar

June 30, 2021
Tweet

More Decks by Sushant Bhadkamkar

Other Decks in Technology

Transcript

  1. Lyft Lyft is a single, smart app to get around

    your city: everything from Shared rides and XL vehicles, to bikes, scooters, and transit information
  2. The Purchase Flow domain The Purchase Flow team at Lyft

    is responsible for showing users all the relevant transportation options for their trip. This domain starts with the user entering their destination and ends with them booking a trip in the Lyft app. We call these options “Offers”, and this screen the “Offer Selector”.
  3. Panic! About a year ago, we had 2 outages that

    lasted a few minutes If our domain is unavailable: • Users cannot request rides • Drivers cannot make money on the platform • Several days are spent responding to support tickets, identifying affected users and compensating them for the bad experience.
  4. Problems with a distributed fallback • Hard to simulate and

    test • Itself may fail or have latent bugs • May make the outage worse Recommended reading: Avoiding fallback in distributed systems by Jacob Gabrielson
  5. Building Resiliency • First, invest time in improving the reliability

    of the normal mode of operation • Pick the simplest fallback design • Try and reuse code that is executed regularly in the normal mode • Continuously exercise the fallback mode (in staging AND production environments) • Thoroughly test the fallback mode — end to end • Support a quick and easy way to disable the fallback — with a killswitch
  6. Pick the simplest fallback design // shouldRunFallback will return true

    if we failed to fetch offers from the upstream func (h *Handler) shouldRunFallback() bool { if h.upstreamErr != nil { if h.upstreamErr.StatusCode() < http.StatusInternalServerError { return false } return true } return false }
  7. Reuse code from the normal mode import ( "github.com/lyft/purchaseflowlib" )

    // buildFallbackResponse builds the API offer representation that can be rendered on clients func (h *Handler) buildFallbackResponse(fallbackProducts []*Products) (*OffersResp, error) { fallbackOffers := purchaseflowlib.BuildOffers(fallbackProducts) rankedOffers := purchaseflowlib.Rank(fallbackOffers) resp, err := purchaseflowlib.ToAPIResponse(h.user, rankedOffers) if err != nil { return nil, err } return resp, nil }
  8. Sidebar: Feature Toggles Recommended reading:Feature Toggles (aka Feature Flags) by

    Pete Hodgson Allow modifying system behavior without changing code At Lyft, • Recommended for any new (critical/complex) feature or integration • Implemented as a git repository with an independent release pipeline • Supports hierarchical configuration of toggles — with environment and geographical overrides
  9. Thoroughly test the fallback — end to end • Exhaustive

    automated test suite for fallback triggering and response generation • An HTTP 5xx response from the Personalization service “triggers” the fallback — can be simulated by injecting failures in our service mesh (Envoy proxy) • Fallback response is always served (and can be manually tested) in specific staging environments • Client (app) UI and regression tests include the fallback mode and are run for every new build
  10. Support disabling the fallback func (h *Handler) shouldRunFallback() bool {

    if h.upstreamErr != nil { if h.upstreamErr.StatusCode() < http.StatusInternalServerError { return false } if h.config.isFallbackEnabled { // killswitch return true } } return false }
  11. The fallback mode A somewhat degraded experience • Subset of

    transportation options available • No price information or time estimates available • Offers not personalized
  12. To recap... • First, invest time in improving the reliability

    of the normal mode of operation • Pick the simplest fallback design • Try and reuse code that is executed regularly in the normal mode • Continuously exercise the fallback mode (in staging AND production environments) • Thoroughly test the fallback mode — end to end • Support a quick and easy way to disable the fallback — with a killswitch