Slide 1

Slide 1 text

Mobile Networking At Scale LESSONS FROM JP Simard • [email protected] • @simjp

Slide 2

Slide 2 text

About Me ‣ Family: Wife & two kids, living in Montréal, Canada ‣ OSS: SwiftLint, Jazzy, Yams, 
 SourceKitten, Realm ‣ Podcast: Swift Unwrapped ‣ Lyft: Transit, Bikes & Scooters JP Simard • [email protected] • @simjp

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

2012 2022

Slide 5

Slide 5 text

Billions of requests a day 100+ mobile engineers Millions of daily active users

Slide 6

Slide 6 text

Our Platform Networking connects our riders and drivers to

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Server Mobile Stable Fast Single Provider Control Unreliable Inconsistent Many Carriers User Con fi gured

Slide 9

Slide 9 text

App Feature API Client Networking Engine Code you write Black box

Slide 10

Slide 10 text

What have we learned?

Slide 11

Slide 11 text

LESSON 1 Server-side observability only shows you half the picture ▸ We found that our server-side stats weren’t accurately re fl ecting our mobile failure rates ▸ Server-side stats don’t capture if the clients were able to successfully receive all the data that was sent ▸ Large payloads in particular have higher failure rates on the receipt end

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

LESSON 2 Measuring performance a ff ects performance ▸ If you need to make networking requests to send telemetry about your networking, that can compete with bandwidth and other system resources if you’re not careful ▸ This can not only a ff ect the user experience, but the measurements you were trying to take in the fi rst place ▸ Similarly, measuring reliability in fl uences reliability ▸ It’s critical to optimize your measurement pipeline

Slide 14

Slide 14 text

LESSON 3 Small shifts to networking performance matter at scale ▸ What you think is a micro-optimization may end up making a tangible impact to the long tail ▸ Conversely, if you think you’re adding a totally innocuous amount of overhead, you may be surprised to see it measurably negatively impacting users at scale ▸ At Lyft, we caught a case of what was at fi rst glance some lightweight runtime re fl ection to get the name of a type actually be responsible for 4-8% of our total app hangs!

Slide 15

Slide 15 text

LESSON 4 Not all cell carriers are equal ▸ The success rate can vary by up to 10% across carriers with the same OS/library con fi guration ▸ The success rate can vary by up to 10% across libraries using the same carrier ▸ The best performing networking library varies across carriers ▸ This is even more true with a global user base

Slide 16

Slide 16 text

Success Rate by Carrier & Networking Library

Slide 17

Slide 17 text

LESSON 5 The networking engine can in fl uence overall app performance ▸ We were somewhat surprised to see that switching from URLSession to an in- process networking engine actually decreased app hangs and OOM crashes ▸ On Android, we saw similar shifts to ANRs

Slide 18

Slide 18 text

LESSON 6 Caching is the solution to, and cause of, many problems ▸ Planning to experiment with DNS caching soon to see if it improves the time to fi rst connect on the long tail

Slide 19

Slide 19 text

LESSON 7 Debugging this stu ff is hard ▸ This led us and others at Lyft to build some truly advanced remote debugging tools, but more on that some other time 😉🤫

Slide 20

Slide 20 text

We wanted to solve this for Lyft* *and for the open 
 source community

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

envoymobile.io

Slide 23

Slide 23 text

eng.lyft.com

Slide 24

Slide 24 text

Rich Observability ▸ Dozens of metrics data and types out of the box ▸ Point to a statsd or gRPC endpoint to ingest with little con fi guration

Slide 25

Slide 25 text

Rich Observability

Slide 26

Slide 26 text

Full Control ▸ Open source ▸ Modern codebase ▸ C++ / Swift / Kotlin ▸ Debuggable

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Modern Technologies ▸ QUIC/HTTP3 ▸ TLS 1.3 ▸ gRPC streaming ▸ brotli compression ▸ Happy Eyeballs

Slide 30

Slide 30 text

Cross-Platform ▸ Our iOS, Android and server codebases can all use the same networking engine ▸ Deploys back to iOS 13 􀤆

Slide 31

Slide 31 text

Reasons not to use Envoy Mobile ▸ Most apps should use URLSession ▸ For now we don’t recommend Envoy Mobile for general use ▸ Support for proxies on iOS & Android in progress ▸ iOS background requests are enforced by the OS to only work with URLSession ▸ Binary size increase (~5MB)

Slide 32

Slide 32 text

Contributors Mike José Michael Keith Rafał Alyssa and dozens more on Envoy Mobile directly, not to mention over a hundred contributors to the main Envoy project

Slide 33

Slide 33 text

Lessons Learned 1. Mobile observability is necessary to understand the true health of your overall system 2. Measuring performance a ff ects performance 3. Small shifts to networking performance matter at scale 4. Not all cell carriers are equal 5. The networking engine can in fl uence overall app performance 6. Caching is a double-edged sword, use it carefully 7. Debugging this stu ff is hard

Slide 34

Slide 34 text

Thanks for the ride JP Simard • [email protected] • @simjp

Slide 35

Slide 35 text

Thanks for the ride Thanks for the ride JP Simard • [email protected] • @simjp