$30 off During Our Annual Pro Sale. View Details »

Balancing Speed and Reliability: The Double-Edged Sword of Third-Party Libraries

Ty Smith
September 06, 2023

Balancing Speed and Reliability: The Double-Edged Sword of Third-Party Libraries

Using third-party libraries in your apps can be a great way to save engineering time and move faster, but can also bring significant risk. If a library malfunctions and causes an outage, it may take days or weeks to get it solved for all your users. Apps have long update cycles and don’t have the luxury of hotfixes when something goes wrong. At Uber, as an app that people rely upon for making their income, getting to the doctor, or commuting to work, reliability in our app is the top priority. Learn how Uber decides when mobile libraries are safe to include and when they should be avoided.

We’ll review how Uber analyzes external libraries to reduce risk, walk through some horror stories when things went wrong, and discover some techniques that can help keep reliability for your user when the worst does happen. You’ll walk away with a tactical framework for evaluating libraries in your own apps.

Ty Smith

September 06, 2023
Tweet

More Decks by Ty Smith

Other Decks in Programming

Transcript

  1. Balancing Speed and
    Reliability
    The Double-Edged Sword of Third-Party Libraries
    Ty Smith
    tysmith.me
    Uber

    View Slide

  2. 2
    April 23, 2020
    11:22am PST

    View Slide

  3. View Slide

  4. View Slide

  5. issuetracker.google.com/issues/154855417

    View Slide

  6. 11:30 - Incident detected
    14:40 - Google Rollback Started
    04:00 - Google releases Android fix #1
    10:00 - Enable Uber Maps in US, CA, MX
    14:00 - Release Android hotfix #1
    19:40 - Google release iOS fix
    22:30 - Release iOS hotfix
    11:30 - Google releases Android fix #2
    14:30 - Release Android hotfix #2
    07:30 - Enable Uber Maps In Remaining Areas
    4 day outage
    - Several rotating incident
    commanders
    - Teams from every org
    Outage
    Timeline
    6
    Presentation name
    Thursday
    Friday
    Saturday
    Sunday

    View Slide

  7. 11:30 - Incident detected
    14:40 - Google Rollback Started
    04:00 - Google releases Android fix #1
    10:00 - Enable Uber Maps in US, CA, MX
    14:00 - Release Android hotfix #1
    19:40 - Google release iOS fix
    22:30 - Release iOS hotfix
    11:30 - Google releases Android fix #2
    14:30 - Release Android hotfix #2
    07:30 - Enable Uber Maps In Remaining Areas
    4 day outage
    - Several rotating incident
    commanders
    - Teams from every org
    Outage
    Timeline
    7
    Presentation name
    Thursday
    Friday
    Saturday
    Sunday

    View Slide

  8. 11:30 - Incident detected
    14:40 - Google Rollback Started
    04:00 - Google releases Android fix #1
    10:00 - Enable Uber Maps in US, CA, MX
    14:00 - Release Android hotfix #1
    19:40 - Google release iOS fix
    22:30 - Release iOS hotfix
    11:30 - Google releases Android fix #2
    14:30 - Release Android hotfix #2
    07:30 - Enable Uber Maps In Remaining Areas
    4 day outage
    - Several rotating incident
    commanders
    - Teams from every org
    Outage
    Timeline
    8
    Presentation name
    Thursday
    Friday
    Saturday
    Sunday

    View Slide

  9. 11:30 - Incident detected
    14:40 - Google Rollback Started
    04:00 - Google releases Android fix #1
    10:00 - Enable Uber Maps in US, CA, MX
    14:00 - Release Android hotfix #1
    19:40 - Google release iOS fix
    22:30 - Release iOS hotfix
    11:30 - Google releases Android fix #2
    14:30 - Release Android hotfix #2
    07:30 - Enable Uber Maps In Remaining Areas
    4 day outage
    - Several rotating incident
    commanders
    - Teams from every org
    Outage
    Timeline
    9
    Presentation name
    Thursday
    Friday
    Saturday
    Sunday

    View Slide

  10. issuetracker.google.com/issues/154855417#comment515

    View Slide

  11. 11
    ● Largest mobile outage in Uber’s history
    ● Millions of users blocked
    ● Millions of $ lost
    ● Thousands of hours of lost employee productivity
    Impact

    View Slide

  12. 12
    ● Executive review of postmortem
    ● New Intercompany legal agreements
    ● Improved library governance process
    ● Improved crash protection
    ● Improved crash recovery
    Aftermath

    View Slide

  13. 13
    Third Party Code

    View Slide

  14. 14
    ✅ Modern platform
    ✅ Available Features
    ✅ Faster development
    ✅ Free maintenance and updates
    Third Party Code

    View Slide

  15. 15
    Jetpack
    Compose
    Okio
    OkHTTP
    Kotlin
    stdlib
    Store Coroutine
    Coil Room
    App Code
    Google
    Maps
    Google
    Pay

    View Slide

  16. xkcd.com/2347

    View Slide

  17. 17
    ✅ Faster development
    ✅ Available features
    ✅ Modern platform
    ✅ Free maintenance and updates
    🆇 Crashes
    🆇 Security Vulnerabilities
    🆇 Government Compliance
    🆇 Legal Risk
    🆇 Implicit Permissioning
    🆇 Performance Degradation
    🆇 Memory Leaks
    🆇 Transitive Dependency Conflicts
    🆇 Less control
    Third Party Code

    View Slide

  18. 18
    Jetpack
    Compose
    Okio
    OkHTTP
    Kotlin
    stdlib
    Store Coroutine
    Coil Room
    App Code
    Google
    Maps
    Google
    Pay

    View Slide

  19. 19
    💣
    💣
    💣
    💣
    💣 💣
    💣 💣
    App Code
    💣 💣

    View Slide

  20. 20
    Jetpack
    Compose
    Okio
    OkHTTP
    Kotlin
    stdlib
    Store Coroutine
    Coil Room
    App Code
    Google
    Maps
    Google
    Pay

    View Slide

  21. 21
    → Library Governance
    → Reliability Defense
    → Crash Recovery

    View Slide

  22. 22
    Library Governance

    View Slide

  23. 23
    “The process of managing and
    controlling the use of software
    libraries, including acquisition,
    deployment, use, and maintenance.”
    - Bard
    Library Governance

    View Slide

  24. No policy. Use what’s
    the fastest.
    Seed startup
    Tech lead or sr eng
    best judgement. Bias
    towards speed.
    Small scale-up
    Bespoke. “If you want
    to add a new library,
    come talk to Mobile
    Platform”
    Medium Sized
    Co
    Well defined set of
    criteria and a
    responsible team for
    approval.
    Large Enterprise
    24

    View Slide

  25. 25
    Setting up Library Governance
    ● Define business priorities
    ● Define library requirements
    ● Define governance body
    ● Define review process
    ● Define exception process
    ● Define upgrade process

    View Slide

  26. 26
    Business priorities
    ● Speed to market
    ● Developer velocity & staffing
    ● App quality and reliability
    ● Long term foundation & scale

    View Slide

  27. 27
    Transportation as reliable as running water

    View Slide

  28. 28
    Uber’s priorities and acceptable risk
    1. App quality and reliability
    2. Long term foundation/scale
    3. Speed to market
    4. Developer velocity & staffing

    View Slide

  29. 29
    ● License
    ● Secure
    ● Private
    ● Stable
    ● Mature
    ● Maintained
    ● Small
    ● Industry Standard
    ● Testable
    ● High Quality
    ● Owned internally
    ● Category (Platform/Feature)
    Third Party Library Requirements
    A non-exhaustive list

    View Slide

  30. 30
    Governance Body

    View Slide

  31. 31
    Review Process

    View Slide

  32. View Slide

  33. 33
    Upgrades
    ● Greenkeeping
    ● Similar risk as new libraries
    ● Intentional Updates
    ● Organizational Cost

    View Slide

  34. 34
    Examples

    View Slide

  35. 35
    ✅ Appropriate license (Apache 2.0)
    ✅ Compelling Business Use- case
    ✅ No additional permissions needed
    ✅ Low binary size impact < 50kb
    ✅ Low method count < 200
    ✅ Transitive Deps all in use or reasonable.
    ✅ Standard for Compose image loading
    ✅ Reasonable API that can be flagged
    ✅ No known vulnerabilities
    ✅ Highly used by peer companies
    ✅ Good tests
    ✅ Stable
    ✅ No outside servers or dynamic behavior
    ✅ Regularly maintained
    ✅ No unexpected network or battery effect
    ✅ Reasonable memory profile
    Coil ✅

    View Slide

  36. 36
    ✅ Compelling Business Use-case
    ✅ Security Checks Pass
    ✅ Well Tested
    Facebook Auth SDK
    🆇 Proprietary License
    🆇 Outside infrastructure and APIs
    🆇 Complex Client Side Code
    🆇 Web alternative is feasible

    View Slide

  37. View Slide

  38. 38
    ✅ Compelling Business Use-case
    Twilio Video SDK
    🆇 Closed source
    🆇 High Binary Size > 5 mb
    🆇 Alternative costly

    View Slide

  39. 39
    ◻ Closed source
    ◻ High Binary Size > 5mb
    Twilio Video SDK ✅
    ✅ → Met with Twilio & Organized clean-room analysis
    ✅ → Dynamic Feature Module + Feature Flag

    View Slide

  40. 40
    ✅ Very compelling Business Use-case
    Google Ads SDK
    🆇 Closed source
    🆇 Uncatchable startup code
    🆇 High Binary Size > 1mb
    🆇 Dynamic updates & Internal XP
    🆇 No feasible alternative

    View Slide

  41. 41
    ◻ Closed source
    ◻ Uncatchable startup code
    ◻ High Binary Size > 1mb
    ◻ Dynamic updates & Internal XP
    ✅ → Met with Google Ads team to understand internals
    ✅ → Disable startup code with manifest flag
    ✅ → Dynamic Feature Module + Feature Flag
    ✅ → Runtime fallback via API check
    Google Ads SDK ✅

    View Slide

  42. 42
    Defense

    View Slide

  43. Active Development
    Week 1
    1. Build Train Release
    2. Release Testing
    3. Employee rollout 0 → 100%
    4. Beta rollout 0 → 100%
    Week 2
    Prod rollout 0 → 100%
    40% adoption
    Week 3
    65% adoption
    Week 4
    Life of a commit
    43
    80% adoption
    Week 5
    90% adoption
    Week 6

    View Slide

  44. 44
    Preventing Bugs

    View Slide

  45. Defense Gates
    Develop
    (Build)
    Review Deploy
    (app update)
    Production Rollout
    Design
    (PRD/ERD/Fi
    gma)
    Library
    Governance
    Repackaging
    Integration Testing Soak Testing
    E2E Testing
    Library
    Abstractions
    Monitoring
    Feature Flags
    Delayed
    Initialization
    Dependency
    Scanning
    Dynamic Features
    Employee Testing
    Linters

    View Slide

  46. 46
    class MainActivity {
    fun useSdk() {
    Sdk.doSomething()
    }
    }
    Feature Flags

    View Slide

  47. 47
    class MainActivity {
    fun useSdk() {
    val useSdk = FeatureFlags.get("UseSdk")
    if(useSdk) {
    Sdk.doSomething()
    } else {
    // Fallback Experience
    }
    }
    }
    Feature Flags

    View Slide

  48. 48
    class MyApp : Application() {
    override fun onCreate() {
    super.onCreate()
    Sdk.init()
    // Continue App setup...
    }
    }
    Delayed Initialization

    View Slide

  49. 49
    class MyApp : Application() {
    override fun onCreate() {
    super.onCreate()
    FeatureFlags.get("UseSDK")
    if(useSdk) {
    Sdk.init()
    }
    // Continue App setup...
    }
    }
    Delayed Initialization

    View Slide

  50. 50
    E/UncaughtException: android.os.NetworkOnMainThreadException
    at android.os.StrictMode$AndroidBlockGuardPolicy.onNetwork(StrictMode.java:1303)
    at com.android.org.conscrypt.Platform.blockGuardOnNetwork(Platform.java:300)
    at com.myapp.FeatureFlags.get(FeatureFlags.kt:35)
    at com.myapp.MyApp.onCreate(MyApp.kt:10)

    Delayed Initialization

    View Slide

  51. 51
    Delayed Initialization
    class MyApp : Application() {
    override fun onCreate() {
    super.onCreate()
    FeatureFlags.get("UseSDK", Dispatcher.IO) { useSdk ->
    if(useSdk) {
    Sdk.init()
    }
    }
    // Continue App setup...
    }
    }

    View Slide

  52. 52
    Delayed Initialization
    class SdkFeatureActivity : Activity() {
    override fun onCreate() {
    super.onCreate()
    FeatureFlags.get("UseSDK", Dispatcher.IO) { useSdk ->
    if(useSdk) {
    Sdk.init()
    }
    }
    // Continue Activity setup...
    }
    }

    View Slide

  53. 53
    Delayed Initialization
    dependencies {
    implementation 'com.google.android.gms:play-services-ads:X.Y.Z'
    }

    View Slide

  54. 54
    Delayed Initialization
    android:name="com.google.android.gms.ads.MobileAdsInitProvider"
    android:authorities="${applicationId}.mobileadsinitprovider"
    android:exported="false"
    tools:node="merge">

    View Slide

  55. 55
    Delayed Initialization

    View Slide

  56. 56
    Delayed Initialization
    Application Started ContentProvider onCreate() Application onCreate()
    Google Ads Initialization

    View Slide

  57. 57
    Delayed Initialization
    Application Started ContentProvider onCreate() Application onCreate()
    💥

    View Slide

  58. 58
    Delayed Initialization
    android:name="com.google.android.gms.ads.MobileAdsInitProvider"
    android:authorities="${applicationId}.mobileadsinitprovider"
    tools:node="remove" />

    View Slide

  59. 59
    Bundled Code
    ● Broadcast Receivers
    ● Intent Filters
    ● Content Providers
    ● Native Callbacks
    ● AIDLs

    View Slide

  60. 60
    Play Services
    ● Opaque
    ● System level permissions
    ● Dynamic behavior outside app’s release cadence
    ● XP and feature flags in your app

    View Slide

  61. 61
    Play Services
    MyApp.apk
    Ads
    Play Services
    - Business Logic
    - IPC
    - Updates
    - Feature Flag
    - Experimentation
    - Dynamic Loading
    MLKit
    Pay
    Recaptcha

    View Slide

  62. 62
    Onboarding
    ML Feature
    ML Feature
    Ads Feature
    Play Services + Dynamic Features
    MyApp.aab
    Ads
    Play Services
    - Business Logic
    - IPC
    - Updates
    - Feature Flag
    - Experimentation
    - Dynamic Loading
    MLKit
    Pay
    Recaptcha

    View Slide

  63. 63
    Dynamic Features
    val installSDK = FeatureFlags.get("InstallSDK")
    val initSdk = FeatureFlags.get("InitSDK")
    if (installSdk) {
    SplitInstallManagerFactory.create(context)
    .startInstall(request)
    .addOnSuccessListener {
    if(initSdk) {
    Sdk.init()
    }
    }
    .addOnFailureListener { exception -> ... }
    }

    View Slide

  64. 64
    Transitive Dependency Conflicts
    Foo Bar
    RxJava 2.1 RxJava 2.2
    App

    View Slide

  65. 65
    Transitive Dependency Conflicts
    Foo Bar
    RxJava 2.1 RxJava 2.2
    App

    View Slide

  66. 66
    Transitive Dependency Conflicts
    Foo Bar
    RxJava 2.1 RxJava 2.2
    App

    View Slide



  67. Transitive Dependency Conflicts

    View Slide

  68. 68
    Transitive Dependency Conflicts
    Foo + RxJava 2.1 Bar
    RxJava 2.2
    App

    View Slide

  69. 69
    Jar Shading
    dependencies {
    compile jarjar.repackage {
    from io.reactivex.rxjava2:rxjava:2.1.0'
    classRename "io.reactivex.rxjava2.**" "com.uber.internal.rxjava2.@1"
    }
    }

    View Slide

  70. 70
    Jar Shading
    🆇 Increased App Size
    🆇 Nested Dep Complexity
    🆇 Maintenance
    ✅ Dependency Stability
    ✅ Support multiple versions
    🚨Use as last resort, prioritize updating all code to single version first!

    View Slide

  71. 71
    github.com/uber-research/java-dependency-validator

    View Slide

  72. 72
    ● Local Abstractions
    ○ Useful for local utilities with unstable APIs
    ○ Can enable better testability and feature flagging
    ○ Replace heavy SDKs with small client REST APIs
    ● Server Abstractions
    ○ Use server side integration instead of client side
    Library Abstractions

    View Slide

  73. 73
    Linters
    ● Ban known dangerous APIs
    ● Shift runtime exceptions left into build time exceptions

    View Slide

  74. 74
    val image = service.getCoolPromoImage()
    Picasso.load(image).into(view)
    Linters
    E/UncaughtException: java.lang.IllegalArgumentException Path must not be empty.

    View Slide

  75. 75
    class Picasso {
    fun load(path: String?): RequestCreator {
    ...
    require(path.isNotBlank()) { "Path must not be empty." }
    return load(Uri.parse(path))
    }
    }
    Linters

    View Slide

  76. 76
    fun Picasso.loadSafely(url: String?): RequestCreator {
    if (url != null && url.isEmpty()) {
    Lumber.monitor("picasso").e("empty strings are not allowed by picasso")
    return this.load(null as String?)
    }
    return this.load(url)
    }
    Linters

    View Slide

  77. 77
    /**
    * Methods that should not be used at all.
    *
    */
    @JvmStatic
    val methods =
    mapOf(
    "com.squareup.picasso.Picasso.load(kotlin.String?)" to
    "Empty strings can trigger crashes, use the loadSafely extension.",
    )
    Linters

    View Slide

  78. 78
    Crash Recovery

    View Slide

  79. 79
    ● On-call alert
    ● Triage bug
    ● Rollback feature flag
    ● Monitor
    ● Post-mortem
    Incident
    Golden Path

    View Slide

  80. 80
    ● Automated Crash Recovery
    ● Push Based Recovery
    ● Multiprocess Agent
    ● Hotfixes and Force Upgrade
    What if that doesn’t work…

    View Slide

  81. 81
    Automated Crash Recovery
    App Start
    Remove Boot
    File
    Boot file
    Present?
    Startup Steps
    Step 1
    Step 2
    Step N
    Create Boot
    File
    No
    Yes
    Blackswan
    Recovery 1
    Recovery 2
    Recovery N

    View Slide

  82. 82
    Automated Crash Recovery
    Blackswan
    1: Retry
    2: Clear Cache
    3: Clear XPs
    4: Clear Data
    5: Webview Fallback

    View Slide

  83. 83
    Server Based Rules
    ● Pushed Feature Flags
    ● Blackswan Custom Recovery Actions
    ● DNS + Firebase Remote Config

    View Slide

  84. 84
    Uber App
    Multiprocess Agent
    *Future opportunity
    App Process
    Startup Steps
    App Runtime
    Recovery Process
    Blackswan
    Feature Flags
    DNS + Remote Config
    IPC

    View Slide

  85. 85
    Hotfixes and Force Upgrades
    ● Realtime mitigations are much faster
    ● Hotfix introduces additional risk
    ● Force upgrades cause user attrition

    View Slide

  86. 86
    → Library Governance
    → Reliability Defense
    → Crash Recovery

    View Slide

  87. Balancing Speed and
    Reliability
    The Double-Edged Sword of Third-Party Libraries
    Ty Smith
    tysmith.me
    Uber

    View Slide