ANR overview at Uber + Leveraging ApplicationExitInfo API

App freezes (ANR) can be a very frustrating experience to
users

Agenda • Overview of ANR • Real world examples •
Different methods of detecting ANR at Uber ◦ Leveraging ApplicationExitInfo API • Managing your ANR ◦ Prevention during development and testing ◦ Monitoring for regression In Production • Available tools open source solutions

What is an ANR? • Stands for application not responding
(a.k.a as App freezes) What causes an ANR? • Occurs when the main thread (UI thread) is blocked for some time (at least 5 seconds) • Official documentation here

Common reasons for the main UI thread being blocked •
Expensive (long running) operations on UI thread. ◦ e.g. nested loops, expensive calculations, massive object creations, etc • Blocking IO operations on main UI thread: ◦ e.g. RxJava blockingGet/blockingFirst, coroutine runBlocking, etc. • Incorrect multi-threading handling ◦ Deadlocks involving the UI thread ◦ Incorrect synchronization locks holding the UI thread infinitely ◦ Thread starvation on background operations blocks UI thread

Demo app Source code here • Examples of different types
of ANR • Shows last reason for app termination via ApplicationExitInfo API

Case #1 - Long running method Problem • Expensive method
called on UI thread Real word example • Heatmap processing in Uber Driver app Recommendation • Identify bottleneck and use a background thread (Scheduler/Dispatcher)

Case #2 - RxBlocking API Problem • Rx blocking API
will immediately block execution of current caller thread until emission Real world scenario • Getting hostName via blockingGet at our NetworkInterceptor Recommendation • Avoid using blocking API (use the subscribe function instead), use lazy initialization or other approaches

Case #3 - BroadcastReceiver Problem • onReceive always run on
UI thread. Likely chance of ANR if long running method called there Real world scenario • Loading data cache from file Recommendation • Move expensive logic to a background thread or • Use WorkManager (see next slide for caveats)

Case #4 - WorkManager - RxWorker Problem • createWork is
called on the main Thread Real world scenario • Cleaning up big audio file Recommendation • Avoid RxJava APIs like Single.just as subscribeOn(Schedulers.io) won’t do anything

Case #5 - Deadlock Problem • Incorrect synchronization will freeze
main UI thread while waiting for another resource in a separate thread Real World Example • Network operation needs token, App needs user name, both user name and token uses disk cache Recommendation • Fix deadlock, check where the deadlock is happening by checking thread id being locked on ANR trace • Be mindful of synchronized keyword usage

Case #5 - Deadlock - Identification

ANR at Uber

Play console ANR Cons • Reports are hard to classify/identify
and sometimes lack stacktraces • Latest ANR rate is delayed (up to 2 days to get latest information) • Historical ANR rates only up to 30 days • Reports don’t contain critical information (session id, user uuid) required for investigation • Access management in a large organization is not easy Pros • Best source of truth for getting ANR rate, because it is directly related to user • Allow relative comparison with peer apps • Integration with other “vitals” like app launch, slow frames, wake locks, etc

Uber custom ANR detector • Inspired by ANR-WatchDog • Periodically,
send a task into Main thread looper and see if that task is executed within 5 sec • Capture stacktrace of main thread before and after the task to identify offender • Integrated into our in-house app health reporting sites

Uber custom ANR detector Cons • Hard to distinguish false
positive/negative • Doesn’t detect ANRs that were present on play console (e.g. BroadcastReceivers, SystemService, etc) • Only captures main stack trace, making it hard to detect deadlock • Critical thread id info is not captured ◦ Example: “held by thread id =2” from AppExitInfo Pros • Additional information available for investigation (session id, user uuid, etc) • Almost real-time reporting on internal tools • Available to all Android engs at Uber • Allows iteration and improvements easily ◦ Added all thread information in the app

Internal reporting using ApplicationExitInfo API • Leverages ApplicationExitInfo available on
OS 11 and up • ANR information will be send on next cold launch • Stacktrace is processed on device to keep most relevant information • Integrated into our in-house app health reporting sites

ApplicationExitInfo ANR Cons • Only available on OS 11 and
up. We still need custom ANR detectors below OS 11 • In some cases some ANR stacktraces are too generic (e.g. nativePollOnce) but we can use metadata added internally to investigate those cases Pros • Most accurate ANR detection coming directly from OS • Able to detect ANRs missed by internal detectors (e.g. BroadcastReceiver, SystemService, etc) • Full stacktrace information available (including thread id for deadlock identification) • Additional methods like getDescription() , getTimestamp() are useful during investigation/triaging

Leveraging ApplicationExitInfo API

Leveraging ApplicationExitInfo API • Since API 30 (OS 11), google
added ApplicationExitReason API • Reports exact reasons for app termination on next app launch: ◦ REASON_CRASH, ◦ REASON_ANR, ◦ REASON_LOW_MEMORY ◦ REASON_USER_STOPPED ◦ … • Opens up a big set of possibilities

Leveraging ApplicationExitInfo API • Added monitoring for all termination reasons
(crash, low memory, user requested, etc) • Only capturing the latest exit reason • Detecting/triaging internally ANRs by using REASON_ANR (available stacktrace and other useful metadata)

Leveraging ApplicationExitInfo API • Processed ANR stacktrace is sent to
backend • Reports all existing threads (including thread id, thread state, etc) • Other useful metadata is appended to each report (session id, user uuid, etc) • Custom ANR logic on backend to generate title for reports: ◦ ANR at java.lang.Thread.sleep vs ◦ ANR utils.AnrGenerator.sleep(24) Original Processed

History of ANR detectors at Uber

Managing your ANR

Why does ANR still happen? • By default most Android
components run on UI thread, including “background” operations, for example: ◦ Workmanager - RxWorker.createWork ◦ BroadcastReceiver.onReceive ◦ (Uber RIB framework) Worker.onStart • Chain of operations obscure thread usage, for example: ◦ RxJava, Single.just will run on current thread not Scheduler specified by subscribeOn ◦ Coroutine, Dispatcher.default uses only 2 threads in many devices

Why does ANR still happen? • Incorrect synchronization locks ◦
Synchronized block may lock the entire class ◦ Semaphore and Double-Checked locking is hard to get right • Unclear/Misunderstanding of API ◦ SharedPreferences.apply vs commit ◦ RxJava Schedulers.single literally uses only 1 thread for all subscriptions

Prevention during development and testing • Static code analysis to
avoid/warn unsafe pattern e.g. lint checks, detekt, Error Prone • Monitor long operations in RxJava ◦ Flipper plugin to monitor Rx thread duration (not open sourced yet) ◦ Crash on debug builds • Debug build ApplicationExitInfo warning (Toaster + Logs) • Enable different Strict Mode options

Monitoring for regression In Production • Establish baseline/goal, Google play
defines bad behavior threshold as: ◦ Exhibits at least one ANR in at least 0.47% of its daily sessions. ◦ Exhibits 2 or more ANRs in at least 0.24% of its daily sessions. • Apps above this recommended threshold can have reduced visibility on play console. Source here • Treat ANR equal to Crash ◦ Set a goal for your team ◦ Fix as early as possible to avoid accumulation of tech debt

Monitoring for regression In Production • ApplicationExitInfo based detection allows
more immediate result than play console • Every new feature rollout to run A/B analysis and check regression on core metrics (Crash, ANR, OOM, and other key business metrics) • If regression occurs we can determine top offender causing ANR

Available tools open source solutions

Available open source solutions • ANR-WatchDog ◦ Uber ANR detector
was inspired by this • Bugsnag ANR ◦ Using SIGQUIT signal from the app process • Firebase console ANR reporting ◦ Leverages ApplicationExitInfo • Infer by Facebook ◦ Deadlock detector

Questions? • Come see us in Uber booth • Contact
us using github issue in Source code here Thank you Fran Aguilera, Yohan Hartanto, (and everyone who helped us)

ANR overview at Uber + Leveraging ApplicationEx...

ANR overview at Uber + Leveraging ApplicationExitInfo API

More Decks by Yohan Hartanto

Other Decks in Programming

Featured

Transcript