Chasing Performance Issues Methodically

Slide 1

Slide 1 text

Chasing Performance Issues Methodically @droid_singh @amanjeetsingh150 Amanjeet Singh

Slide 2

Slide 2 text

What is Performance?

Slide 3

Slide 3 text

What is Performance? Story 📘

Slide 4

Slide 4 text

What is Performance? Story 📘

Slide 5

Slide 5 text

What is Performance? Story 📘

Slide 6

Slide 6 text

What is Performance? Story 📘

Slide 7

Slide 7 text

What is Performance? Story 📘 Pattern 1: Scalable ≠ High Performant Codebase

Slide 8

Slide 8 text

What is Performance? Story 📘 Pattern 1: Scalable ≠ High Performant Codebase • Not upgrading libraries

Slide 9

Slide 9 text

What is Performance? Story 📘 Pattern 1: Scalable ≠ High Performant Codebase • Not upgrading libraries • Partial quality gates

Slide 10

Slide 10 text

What is Performance? Story 📘 Pattern 1: Scalable ≠ High Performant Codebase • Not upgrading libraries • Partial quality gates • Not scoping dagger dependencies properly. I Singleton ❤

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

What is Performance? Story 📘 Pattern 1: Scalable ≠ High Performant Codebase • Not upgrading libraries • Partial quality gates • Not scoping dagger dependencies properly. I Singleton • Selecting configs according to only business metrics ❤ Pattern 2: We get to know from customer tickets or Android Vitals Pattern 3: Prioritising performance issues   with other feature work is an issue

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Platform Team of Company X

Slide 21

Slide 21 text

Mission I: Reducing Cold Start •Drafting OKRs for reducing Cold Start

Slide 22

Slide 22 text

Mission I: Reducing Cold Start •Drafting OKRs for reducing Cold Start •Draft I •Uplifting the app quality of X consumer app (O)

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Mission I: Reducing Cold Start •Drafting OKRs for reducing Cold Start •Draft I •Uplifting the app quality of X consumer app (O) •Key Results I: Reduce cold start by 60% •Key Results II: Reduce wake locks by 30% •Key Results III: Reduce frame drops by 50% •No observability •No quality gate

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

First few Attempts to fix cold start ⏰

Slide 30

Slide 30 text

First few Attempts to fix cold start ⏰ 1. Looking online for blogs "Reducing cold starts at an X   company by 80%"

Slide 31

Slide 31 text

First few Attempts to fix cold start ⏰ 1. Looking online for blogs "Reducing cold starts at an X   company by 80%" 2. Playing with tools like profilers locally to identify   bottlenecks on app start and fix them

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

First few Attempts to fix cold start ⏰ Anti Methodology for Performance Analysis

Slide 37

Slide 37 text

First few Attempts to fix cold start ⏰ Anti Methodology for Performance Analysis • Blame-Someone-Else Anti-Method

Slide 38

Slide 38 text

First few Attempts to fix cold start ⏰ Anti Methodology for Performance Analysis • Blame-Someone-Else Anti-Method • Street light anti-method

Slide 39

Slide 39 text

First few Attempts to fix cold start ⏰ Anti Methodology for Performance Analysis • Blame-Someone-Else Anti-Method • Street light anti-method • Random Change Anti-Method

Slide 40

Slide 40 text

What are we   doing wrong?

Slide 41

Slide 41 text

What are we   doing wrong? • Are we chasing the right metric for cold start?

Slide 42

Slide 42 text

What are we   doing wrong? • Are we chasing the right metric for cold start? • Is there any missing case for cold start we are not   considering?

Slide 43

Slide 43 text

What are we   doing wrong? • Are we chasing the right metric for cold start? • Is there any missing case for cold start we are not   considering? • Maybe there were improvements but were neutralised   by changes made by other developers 🤔  

Slide 44

Slide 44 text

Sit back and Strategize! 🤔

Slide 45

Slide 45 text

Sit back and Strategize! 🤔 Step 1: Identify proper metrics and create observability

Slide 46

Slide 46 text

Sit back and Strategize! 🤔 Step 1: Identify proper metrics and create observability • Cold Start App Launch

Slide 47

Slide 47 text

Sit back and Strategize! 🤔 Step 1: Identify proper metrics and create observability • Cold Start App Launch • Different spans for app launch

Slide 48

Slide 48 text

Sit back and Strategize! 🤔 Step 1: Identify proper metrics and create observability • Cold Start App Launch • Different spans for app launch • Google Content   Provider First   Screen   Drawn

Slide 49

Slide 49 text

Sit back and Strategize! 🤔 Step 1: Identify proper metrics and create observability • Cold Start App Launch • Different spans for app launch • Google • User experienced Content   Provider First   Screen   Drawn onCreate   end onStart

Slide 50

Slide 50 text

Sit back and Strategize! 🤔 Step 1: Identify proper metrics and create observability • Cold Start App Launch • Different spans for app launch • Google • User experienced • Send app launch events and following attributes:

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Slide 54

Slide 54 text

Slide 55

Slide 55 text

Sit back and Strategize! 🤔 // Start AppLaunchTracker.startTracking()

Slide 56

Slide 56 text

Sit back and Strategize! 🤔 // Start AppLaunchTracker.startTracking() // End AppLaunchTracker.stopTracking(firstScreenName)    

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Sit back and Strategize! 🤔 // Start AppLaunchTracker.startTracking() // End AppLaunchTracker.stopTracking(firstScreenName)     // Track event Bundle bundle = new Bundle(); bundle.putString(FirebaseAnalytics.Param.USER_ID, id); bundle.putString(   FirebaseAnalytics.Param.FIRST_SCREEN_DRAWN, firstScreenName   ); bundle.putString(FirebaseAnalytics.Param.TOTAL_DURATION, totalTime); val firebaseAnalytics = FirebaseAnalytics.getInstance(this) firebaseAnalytics.logEvent(FirebaseAnalytics.Event.APP_LAUNCH, bundle)  

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Tracker

Slide 63

Slide 63 text

Tracker

Slide 64

Slide 64 text

Tracker Attributes for app launch ✅

Slide 65

Slide 65 text

Tracker Attributes for app launch ✅ Insights from data: ✅

Slide 66

Slide 66 text

Tracker Attributes for app launch ✅ Insights from data: ✅ • Splash screen is not the only contributor   to first_screen_drawn 0 25 50 75 100 Survey Screen Splash Home Screen First screen names Percentage   distribution   of app launch

Slide 67

Slide 67 text

Tracker Attributes for app launch ✅ Insights from data: ✅ • Splash screen is not the only contributor   to first_screen_drawn • App Icon click is not only trigger 0 25 50 75 100 Survey Screen Splash Home Screen First screen names Percentage   distribution   of app launch

Slide 68

Slide 68 text

Tracker Attributes for app launch ✅ Insights from data: ✅ • Splash screen is not the only contributor   to first_screen_drawn • App Icon click is not only trigger • Campaigns 0 25 50 75 100 Survey Screen Splash Home Screen Percentage   distribution   of app launch First screen names

Slide 69

Slide 69 text

Sit back and Strategize! 🤔 Step 2:

Slide 70

Slide 70 text

Sit back and Strategize! 🤔 Step 2: Stop bleeding by creating baseline

Slide 71

Slide 71 text

Sit back and Strategize! 🤔 Step 2: Stop bleeding by creating baseline?

Slide 72

Slide 72 text

Sit back and Strategize! 🤔 Step 2: Stop bleeding by creating baseline • Creating observability on the PRs

Slide 73

Slide 73 text

Sit back and Strategize! 🤔 Step 2: Stop bleeding by creating baseline • Creating observability on the PRs • Ideal tool for this:

Slide 74

Slide 74 text

Sit back and Strategize! 🤔 Step 2: Stop bleeding by creating baseline • Creating observability on the PRs • Ideal tool for this: • Surface tentative regressions

Slide 75

Slide 75 text

Sit back and Strategize! 🤔 Step 2: Stop bleeding by creating baseline • Creating observability on the PRs • Ideal tool for this: • Surface tentative regressions • Detecting outliers and noise

Slide 76

Slide 76 text

Slide 77

Slide 77 text

Slide 78

Slide 78 text

Sit back and Strategize! 🤔 Step 2: Stop bleeding by creating baseline • Creating observability on the PRs • Ideal tool for this: • Observability on tentative regressions • Detecting outliers and noise • Surface regressions mapping developers and teams • Infra to run the tests

Slide 79

Slide 79 text

Sit back and Strategize! 🤔 Common factors of noise and outliers • Network

Slide 80

Slide 80 text

Sit back and Strategize! 🤔 Common factors of noise and outliers • Network 🎹 Orchestration of test with mitm

Slide 81

Slide 81 text

Sit back and Strategize! 🤔 Common factors of noise and outliers • Network • Remote Configuration /data/data/com.app.id/files/frc__firebase_activate.json

Slide 82

Slide 82 text

Sit back and Strategize! 🤔 Common factors of noise and outliers • Network • Remote Configuration • Debug Builds

Slide 83

Slide 83 text

Sit back and Strategize! 🤔 Why not debug builds?

Slide 84

Slide 84 text

Sit back and Strategize! 🤔 Why not debug builds? • Account for Progaurd and Dexguard effects

Slide 85

Slide 85 text

Sit back and Strategize! 🤔 Why not debug builds? • Account for Progaurd and Dexguard effects • Don’t account for debug artifacts

Slide 86

Slide 86 text

Sit back and Strategize! 🤔 Why not debug builds? • Account for Progaurd and Dexguard effects • Don’t account for debug artifacts • Take in account build configuration as close to   release builds

Slide 87

Slide 87 text

Sit back and Strategize! 🤔 Common factors of noise and outliers • Network • Remote Configuration • Debug Builds • App/Device based noise

Slide 88

Slide 88 text

Sit back and Strategize! 🤔 App/Device based noise

Slide 89

Slide 89 text

Sit back and Strategize! 🤔 • Random GC triggers App/Device based noise

Slide 90

Slide 90 text

Sit back and Strategize! 🤔 App/Device based noise • Random GC triggers • System Dialogs from Crashes/ANRs

Slide 91

Slide 91 text

Sit back and Strategize! 🤔 App/Device based noise • Random GC triggers • System Dialogs from Crashes/ANRs • Device configuration like CPU frequency

Slide 92

Slide 92 text

Sit back and Strategize! 🤔 App/Device based noise • Random GC triggers • System Dialogs from Crashes/ANRs • Device configuration like CPU frequency • CPU/Runtime optimisations in general

Slide 93

Slide 93 text

Sit back and Strategize! 🤔 App/Device based noise • Random GC triggers • System Dialogs from Crashes/ANRs • Device configuration like CPU frequency • CPU/Runtime optimisations in general • Never ending.

Slide 94

Slide 94 text

Slide 95

Slide 95 text

Slide 96

Slide 96 text

Tracker Attributes for app launch ✅ Insights from data: ✅ • Splash screen is not the only   contributor to first_screen_drawn • App Icon click is not only trigger • Campaigns

Slide 97

Slide 97 text

Tracker Attributes for app launch ✅ Insights from data: ✅ • Splash screen is not the only   contributor to first_screen_drawn • App Icon click is not only trigger • Campaigns Eyes on the changes entering while you fix ✅

Slide 98

Slide 98 text

Slide 99

Slide 99 text

Slide 100

Slide 100 text

No content

Slide 101

Slide 101 text

Sit back and Strategize! 🤔 Production analytics Quality gate on baseline branch ✅ ✅

Slide 102

Slide 102 text

Sit back and Strategize! 🤔 Step 3: Extract Impacted sessions and collect prod traces   from the impacted users

Slide 103

Slide 103 text

Sit back and Strategize! 🤔 Step 3: Extract Impacted sessions and collect prod traces   from the impacted users

Slide 104

Slide 104 text

Sit back and Strategize! 🤔 • Definition of Impacted session for app launch Step 3: Extract Impacted sessions and collect prod traces   from the impacted users

Slide 105

Slide 105 text

Sit back and Strategize! 🤔 • Definition of Impacted session for app launch • Google Android Vitals >= 5 seconds Step 3: Extract Impacted sessions and collect prod traces   from the impacted users

Slide 106

Slide 106 text

Sit back and Strategize! 🤔 • Definition of Impacted session for app launch • Google Android Vitals >= 5 seconds • Discussion with product stakeholders Step 3: Extract Impacted sessions and collect prod traces   from the impacted users

Slide 107

Slide 107 text

Sit back and Strategize! 🤔 • Definition of Impacted session for app launch • Discussion with product stakeholders • Google Android Vitals >= 5 seconds • Query the users having more than X percent sessions   as Impacted Step 3: Extract Impacted sessions and collect prod traces   from the impacted users

Slide 108

Slide 108 text

Sit back and Strategize! 🤔 • Discussion with product stakeholders • Query the users having more than X percent sessions   as Impacted • Tools selection: Firebase Realtime Database, Firebase   User Properties and Debug API on the rescue • Google Android Vitals >= 5 seconds • Definition of Impacted session for app launch Step 3: Extract Impacted sessions and collect prod traces   from the impacted users

Slide 109

Slide 109 text

Sit back and Strategize! 🤔 Step 3: Extract Impacted sessions and collect prod traces   from the impacted users Segment   impacted users   with firebase Uploading   traces from   impacted users

Slide 110

Slide 110 text

Slide 111

Slide 111 text

Sit back and Strategize! 🤔 1. Create Firebase User Properties on Console

Slide 112

Slide 112 text

Sit back and Strategize! 🤔 1. Create Firebase User Properties on Console pro fi le_app_startup

Slide 113

Slide 113 text

Sit back and Strategize! 🤔 users ……. 2. Firebase Realtime Database

Slide 114

Slide 114 text

Sit back and Strategize! 🤔 2. Firebase Realtime Database users ……. ……. user-id-1 ……. user-id-2

Slide 115

Slide 115 text

Sit back and Strategize! 🤔 users ……. ……. user-id-1 ……. user_property_a: true ……. user-id-2 ……. user_property_a: true 2. Firebase Realtime Database

Slide 116

Slide 116 text

Sit back and Strategize! 🤔 users ……. ……. user-id-1 ……. user_property_a: true ……. user-id-2 ……. user_property_a: true ……. pro fi le_app_startup: true ……. user_property_b: true 2. Firebase Realtime Database

Slide 117

Slide 117 text

Slide 118

Slide 118 text

Sit back and Strategize! 🤔 3. Enable via remote config

Slide 119

Slide 119 text

Sit back and Strategize! 🤔 3. Enable via remote config • Create parameter enable_trace_app_launch

Slide 120

Slide 120 text

Sit back and Strategize! 🤔 3. Enable via remote config • Create parameter enable_trace_app_launch • Create condition with user property profile_app_startup

Slide 121

Slide 121 text

Sit back and Strategize! 🤔 3. Enable via remote config • Create parameter enable_trace_app_launch • Create condition with user property profile_app_startup • Enable the parameter for the condition and default to false

Slide 122

Slide 122 text

Slide 123

Slide 123 text

Sit back and Strategize! 🤔 4. Debug API •Fetch performance traces for the impacted users from   production

Slide 124

Slide 124 text

Sit back and Strategize! 🤔 4. Debug API •Fetch performance traces for the impacted users from   production class AppLaunchTracker(private val remoteConfig: RemoteConfig)

Slide 125

Slide 125 text

Sit back and Strategize! 🤔 4. Debug API •Fetch performance traces for the impacted users from   production class AppLaunchTracker(private val remoteConfig: RemoteConfig) fun startTracking() { ... if(remoteConfig.isEnabled(enable_trace_app_launch)) { Debug.startMethodTracingSampling( context.cacheDir, maxBufferSize, samplingIntervalUs ) } }

Slide 126

Slide 126 text

Sit back and Strategize! 🤔 4. Debug API •Fetch performance traces for the impacted users from   production class AppLaunchTracker(private val remoteConfig: RemoteConfig) fun stopTracking() { ... if(remoteConfig.isEnabled(enable_trace_app_launch)) { Debug.stopMethodTracing() } }

Slide 127

Slide 127 text

Sit back and Strategize! 🤔 4. Debug API •Fetch performance traces for the impacted users from   production •Upload traces from periodic work manager

Slide 128

Slide 128 text

Sit back and Strategize! 🤔 4. Debug API •Fetch performance traces for the impacted users from   production •Upload traces from periodic work manager • Firebase Storage • Multipart API upload

Slide 129

Slide 129 text

Slide 130

Slide 130 text

Slide 131

Slide 131 text

Sit back and Strategize! 🤔 Flamegraphs ✅

Slide 132

Slide 132 text

Sit back and Strategize! 🤔 What we achieved with flame graphs?

Slide 133

Slide 133 text

Sit back and Strategize! 🤔 •Study issues which are consistent in the impacted flame   graphs What we achieved with flame graphs?

Slide 134

Slide 134 text

Slide 135

Slide 135 text

Sit back and Strategize! 🤔 •Study issues which are consistent in the impacted flame   graphs What we achieved with flame graphs? •Creating a priority list of the issues that needs to be fixed •Performance gain they will provide

Slide 136

Slide 136 text

Slide 137

Slide 137 text

Step4:   Lets ship the fixes

Slide 138

Slide 138 text

Step4:   Lets ship the fixes Some of the issues the team found •Dexguard class encryption of classes increasing   initialisation time

Slide 139

Slide 139 text

Step4:   Lets ship the fixes Some of the issues the team found •Dexguard class encryption of classes increasing   initialisation time •List of consistent expensive initialisation of some SDKs

Slide 140

Slide 140 text

Step4:   Lets ship the fixes Some of the issues the team found •Dexguard class encryption of classes increasing   initialisation time •List of consistent expensive initialisation of some SDKs •Unwanted dependencies getting injected through dagger

Slide 141

Slide 141 text

Partayyy!! Some of the issues the team found •Dexguard class encryption of classes increasing   initialisation time •List of consistent expensive initialisation of some SDKs •Unwanted dependencies getting injected through dagger 🎉

Slide 142

Slide 142 text

Conclusion •Creating observability is the most important part, fixes can really be one line change

Slide 143

Slide 143 text

Conclusion •Creating observability is the most important part, fixes can really be one line change •Lets revisit our patterns:

Slide 144

Slide 144 text

Conclusion •Creating observability is the most important part, fixes can really be one line change • Pattern 1: Scalable ≠ High Performant Codebase •Lets revisit our patterns: ✅

Slide 145

Slide 145 text

Slide 146

Slide 146 text

Slide 147

Slide 147 text

Slide 148

Slide 148 text

Slide 149

Slide 149 text

Conclusion •Creating observability is the most important part, fixes can really be one line change • Pattern 1: Scalable ≠ High Performant Codebase •Lets revisit our patterns: ✅ • Pattern 2: We get to know from customer tickets or Android Vitals ✅ • Pattern 3: Prioritising performance issues with other feature work is an issue ✅ • Pattern 4: Drafting OKRs for platform teams is difficult • KR 1: Bringing observability to metric x, y, z, a, b, c (first screen, app launch time, user id) Proposal: