Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from Mobile Networking at Scale

JP Simard
September 16, 2022

Lessons from Mobile Networking at Scale

In the last few years, Lyft has invested heavily into the observability of our apps as we've scaled. At first we mostly treated the networking stack as a black box, hoping that the system would do a best effort to execute network requests and that influencing that was mostly out of our hands.

However, over time we discovered a lot more about how and when our network requests were failing, we developed ways to instrument the health of our apps in real time over billions of API requests a day, even triggering alarms when performance regressed. We also discovered how to tune the networking layer to perform best with various usage patterns for our different apps.

Ultimately, this resulted in us creating our own open source cross-platform mobile networking library that has given us unprecedented visibility and control into the performance of our apps at scale. You can find out more about it here:
https://envoymobile.io

JP Simard

September 16, 2022
Tweet

More Decks by JP Simard

Other Decks in Programming

Transcript

  1. About Me ‣ Family: Wife & two kids, living in

    Montréal, Canada ‣ OSS: SwiftLint, Jazzy, Yams, 
 SourceKitten, Realm ‣ Podcast: Swift Unwrapped ‣ Lyft: Transit, Bikes & Scooters JP Simard • [email protected] • @simjp
  2. LESSON 1 Server-side observability only shows you half the picture

    ▸ We found that our server-side stats weren’t accurately re fl ecting our mobile failure rates ▸ Server-side stats don’t capture if the clients were able to successfully receive all the data that was sent ▸ Large payloads in particular have higher failure rates on the receipt end
  3. LESSON 2 Measuring performance a ff ects performance ▸ If

    you need to make networking requests to send telemetry about your networking, that can compete with bandwidth and other system resources if you’re not careful ▸ This can not only a ff ect the user experience, but the measurements you were trying to take in the fi rst place ▸ Similarly, measuring reliability in fl uences reliability ▸ It’s critical to optimize your measurement pipeline
  4. LESSON 3 Small shifts to networking performance matter at scale

    ▸ What you think is a micro-optimization may end up making a tangible impact to the long tail ▸ Conversely, if you think you’re adding a totally innocuous amount of overhead, you may be surprised to see it measurably negatively impacting users at scale ▸ At Lyft, we caught a case of what was at fi rst glance some lightweight runtime re fl ection to get the name of a type actually be responsible for 4-8% of our total app hangs!
  5. LESSON 4 Not all cell carriers are equal ▸ The

    success rate can vary by up to 10% across carriers with the same OS/library con fi guration ▸ The success rate can vary by up to 10% across libraries using the same carrier ▸ The best performing networking library varies across carriers ▸ This is even more true with a global user base
  6. LESSON 5 The networking engine can in fl uence overall

    app performance ▸ We were somewhat surprised to see that switching from URLSession to an in- process networking engine actually decreased app hangs and OOM crashes ▸ On Android, we saw similar shifts to ANRs
  7. LESSON 6 Caching is the solution to, and cause of,

    many problems ▸ Planning to experiment with DNS caching soon to see if it improves the time to fi rst connect on the long tail
  8. LESSON 7 Debugging this stu ff is hard ▸ This

    led us and others at Lyft to build some truly advanced remote debugging tools, but more on that some other time 😉🤫
  9. Rich Observability ▸ Dozens of metrics data and types out

    of the box ▸ Point to a statsd or gRPC endpoint to ingest with little con fi guration
  10. Modern Technologies ▸ QUIC/HTTP3 ▸ TLS 1.3 ▸ gRPC streaming

    ▸ brotli compression ▸ Happy Eyeballs
  11. Cross-Platform ▸ Our iOS, Android and server codebases can all

    use the same networking engine ▸ Deploys back to iOS 13 􀤆
  12. Reasons not to use Envoy Mobile ▸ Most apps should

    use URLSession ▸ For now we don’t recommend Envoy Mobile for general use ▸ Support for proxies on iOS & Android in progress ▸ iOS background requests are enforced by the OS to only work with URLSession ▸ Binary size increase (~5MB)
  13. Contributors Mike José Michael Keith Rafał Alyssa and dozens more

    on Envoy Mobile directly, not to mention over a hundred contributors to the main Envoy project
  14. Lessons Learned 1. Mobile observability is necessary to understand the

    true health of your overall system 2. Measuring performance a ff ects performance 3. Small shifts to networking performance matter at scale 4. Not all cell carriers are equal 5. The networking engine can in fl uence overall app performance 6. Caching is a double-edged sword, use it carefully 7. Debugging this stu ff is hard