Slide 1

Slide 1 text

What Gets Measured Gets Fixed: Observability for Android at Scale Jitin Sharma Software Architect, Groww

Slide 2

Slide 2 text

Agenda What is observability Issues faced by mobile apps What should be measured How to measure Interpreting measurements Fixing issues

Slide 3

Slide 3 text

Understanding Observability in Software Systems Observability is the ability to understand the internal state of a system by examining its outputs. In software, it means having enough signals to diagnose issues without deploying new code or adding instrumentation after the fact. Unlike monitoring, which tells you when something breaks, observability helps you understand why it broke and how to fix it.

Slide 4

Slide 4 text

Mobile Observability? Crashes The obvious metrics everyone tracks first ANRs Application Not Responding events that frustrate users

Slide 5

Slide 5 text

Crashes

Slide 6

Slide 6 text

Crashes kotlin.UninitializedPropertyAccessException: lateinit property listener has not been initialized java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) java.lang.RuntimeException: java.lang.Throwable: A WebView method was called on thread 'DefaultDispatcher-worker-1'. All WebView methods must be called on the same thread. (Expected Looper Looper (main, tid 1) {730c4d78} called on Looper (null))

Slide 7

Slide 7 text

Crashes kotlin.UninitializedPropertyAccessException: lateinit property listener has not been initialized var listener = null java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) CopyOnWriteArrayList java.lang.RuntimeException: java.lang.Throwable: A WebView method was called on thread 'DefaultDispatcher-worker-1'. All WebView methods must be called on the same thread. (Expected Looper Looper (main, tid 1) {730c4d78} called on Looper (null)) Handler().post { }

Slide 8

Slide 8 text

Crashes try { // stuff } catch (e: Exception) { // pray ̬ }

Slide 9

Slide 9 text

ANR

Slide 10

Slide 10 text

How does your app look like 99.5% crash free rate | >4.0 rating 99.5% crash free rate | >4.5 rating 99.8% crash free rate | >4.8 rating 100% crash free rate | 5.0 rating

Slide 11

Slide 11 text

99.99% crash- free ≠ 5-star app

Slide 12

Slide 12 text

Real User Monitoring (RUM)

Slide 13

Slide 13 text

The Hidden Problems Users Face Slow Page Loads Screens that take longer to render content, cause users to tap repeatedly or abandon Silent Errors Failed API calls that show empty states instead of useful content, without any error message Janky Animations Stuttering scrolls and frame drops that make the app feel unpolished Incomplete Flows Users getting stuck mid-journey through critical features like checkout or registration

Slide 14

Slide 14 text

Measuring What Matters Page Load Measure initial and full load times Error Rates Track page failures Transactions Monitor successful completions App Smoothness Observe frame drops and jank

Slide 15

Slide 15 text

Page Load Latency 1 Fragment/Activity load Latency measurement between onCreate, onAttach, onViewCreated. Captures setup latencies 2 Viewmodel Success state Capture latency from loading to success state, involving i/o calls, serialisation etc. 3 UI render latency Success state to UI render, captures latencies around adapter, viewgroups etc. 4 Pagination etc Any latency on subsequent steps.

Slide 16

Slide 16 text

Page Error Rates Failures to display full page Failure to display partial sections Errors due to network fluctuations

Slide 17

Slide 17 text

Transaction Completion Monitoring critical user flows from start to finish 1 Search 2 Buy 3 Checkout

Slide 18

Slide 18 text

Frame Drops & Jank FPS Slow frames Frozen frames

Slide 19

Slide 19 text

The Reproducibility Challenge Reproducing mobile issues is notoriously difficult. The root causes are often elusive, hidden in the complex interplay of user sessions and environmental conditions— device specifications, network quality, OS versions, and countless other variables that differ from your development setup.

Slide 20

Slide 20 text

User Journey left app in background went into lift got PN after an hour opened an entirely new page

Slide 21

Slide 21 text

Network Observability: The Missing Piece Network conditions dramatically impact mobile app performance, yet they're often overlooked. Network Type WiFi, 4G, 5G, 3G—each behaves differently Signal Strength Weak signals cause retries and timeouts Network Changes Transitions between networks disrupt connections

Slide 22

Slide 22 text

What is observability Issues faced by mobile apps What should be measured How to measure Interpreting measurements Fixing issues

Slide 23

Slide 23 text

Network Signals

Slide 24

Slide 24 text

Network Signals

Slide 25

Slide 25 text

Lifecycle signals

Slide 26

Slide 26 text

Lifecycle signals

Slide 27

Slide 27 text

User Session Timeline To truly understand the user experience and debug complex issues, it's crucial to visualize a user's entire session as a timeline of events. This includes not just app interactions, but also underlying system and network conditions. 1 App Start & Initial Load User launches the app, home screen displays. Essential data fetched from api.example.com/init. 2 Network Fluctuation & API Errors User navigates to Product List. Network switches from Wi-Fi to cellular (low signal). Subsequent API call to api.example.com/products times out. 3 High Resource Consumption User scrolls rapidly through images. CPU spikes to 90%, memory usage increases significantly, leading to UI jank. 4 Background Process & Crash A background sync operation starts, consuming more CPU. User taps on an item, triggering a NullPointerException and app crash. This detailed timeline approach helps pinpoint the exact sequence of events that led to a problem, revealing hidden correlations between app behavior, device state, and user actions.

Slide 28

Slide 28 text

Data-Driven Insights APM Code External Systems Backend Infrastructure App State Lifecycle, Background work Network Type, strength & changes Device Manufacturer, OS

Slide 29

Slide 29 text

Observability Across System Boundaries Mobile App User-facing frontend API Gateway Request routing Backend Services Business logic Database Data persistence

Slide 30

Slide 30 text

Detecting issues faster Why Mobile Observability Matters The mobile app is a catch-all for all types of backend issues—it's the user-facing window into your entire system. Additionally mobile apps are themselves becoming more dynamic with feature flags and server driven UI.

Slide 31

Slide 31 text

What is observability Issues faced by mobile apps What should be measured How to measure Interpreting measurements Fixing issues

Slide 32

Slide 32 text

Standardisation with OpenTelemetry OpenTelemetry provides a vendor-neutral, standardised way to measure telemetry— traces, metrics, and logs—that works with multiple backend providers. Benefits Works with multiple vendors Consistent instrumentation Rich ecosystem support Unified SDK across platforms Automatic instrumentation for common libraries

Slide 33

Slide 33 text

Concepts of Observability Traces Debugging data that's sampled but rich with metadata. Shows the journey of a request across services. Distributed tracing Request flow visualization Span-level details Metrics Unsampled, low-size data ideal for alerting. Aggregated numbers that trend over time. Counters and gauges Histograms Real-time dashboards Logs Unsampled text with large context about user sessions. Detailed event records. Structured logging Session context Error details

Slide 34

Slide 34 text

Distributed tracing OpenTelemetry App Backend Database

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Distributed tracing

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Distributed tracing

Slide 39

Slide 39 text

Distributed tracing

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Distributed tracing

Slide 42

Slide 42 text

Distributed tracing

Slide 43

Slide 43 text

Trace Code

Slide 44

Slide 44 text

Tracing across systems

Slide 45

Slide 45 text

Instrumentations Auto Screen navigations Network calls Frame drops App launch Crash/ANR Manual Screen metrics - Latency, errors Custom tracing

Slide 46

Slide 46 text

Telemetry Data Handling Trace Metric Time series DB Logs File in bucket trigger Manual/Notification In memory collection Disk collection Alerts

Slide 47

Slide 47 text

What is observability Issues faced by mobile apps What should be measured How to measure Interpreting measurements Fixing issues

Slide 48

Slide 48 text

Observability vs Monitoring

Slide 49

Slide 49 text

Observability Debugging Understanding the internal state of a system to pinpoint issues. Metrics Traces Logs Monitoring Collecting and analyzing data about the system's performance and health. Metrics

Slide 50

Slide 50 text

Monitoring

Slide 51

Slide 51 text

Traces

Slide 52

Slide 52 text

Logs

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

What is observability Issues faced by mobile apps What should be measured How to measure Interpreting measurements Fixing issues

Slide 55

Slide 55 text

How many parallel requests okhttp makes per host? Infinite 64 10 5

Slide 56

Slide 56 text

Network Tracing

Slide 57

Slide 57 text

Optimise Caching and retries Smart Headers Persistent cache with max-age Cache + invalidation with e-tags Smart Retries exponential backoffs circuit breakers

Slide 58

Slide 58 text

Trace and Optimise Database Webviews Render layers - Recyclerview Serialisation

Slide 59

Slide 59 text

Building an Observability Mindset 01 Central Measurement Framework Create shared infrastructure for consistent metric tracking across all screens 02 Dashboard Culture Establish the practice of building and monitoring dashboards for every feature launch 03 Review Rituals Regular team reviews of metrics, anomalies, and trends to surface insights 04 Blameless Postmortems Use incidents as learning opportunities to improve observability and resilience

Slide 60

Slide 60 text

The Strategic Value of Observability Contribute During Downtimes When backend incidents occur, mobile observability helps you contribute meaningfully to debugging and resolution—even if the root cause isn't in your code. Increase Your Visibility Comprehensive observability data improves your team's visibility across engineering leadership and demonstrates impact on business metrics.

Slide 61

Slide 61 text

Thank you jitinsharma.com x: _jitinsharma linkedin: jitinsharma7