Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ANR overview at Uber + Leveraging ApplicationExitInfo API

ANR overview at Uber + Leveraging ApplicationExitInfo API

Basic patterns how ANR occurs, how we detect ANR at Uber, and how you can manage your app's ANR

Presented at Droidcon San Francisco 2022 by Fran Aguilera and Yohan

Yohan Hartanto

June 02, 2022
Tweet

More Decks by Yohan Hartanto

Other Decks in Programming

Transcript

  1. Agenda • Overview of ANR • Real world examples •

    Different methods of detecting ANR at Uber ◦ Leveraging ApplicationExitInfo API • Managing your ANR ◦ Prevention during development and testing ◦ Monitoring for regression In Production • Available tools open source solutions
  2. ANR

  3. What is an ANR? • Stands for application not responding

    (a.k.a as App freezes) What causes an ANR? • Occurs when the main thread (UI thread) is blocked for some time (at least 5 seconds) • Official documentation here
  4. Common reasons for the main UI thread being blocked •

    Expensive (long running) operations on UI thread. ◦ e.g. nested loops, expensive calculations, massive object creations, etc • Blocking IO operations on main UI thread: ◦ e.g. RxJava blockingGet/blockingFirst, coroutine runBlocking, etc. • Incorrect multi-threading handling ◦ Deadlocks involving the UI thread ◦ Incorrect synchronization locks holding the UI thread infinitely ◦ Thread starvation on background operations blocks UI thread
  5. Demo app Source code here • Examples of different types

    of ANR • Shows last reason for app termination via ApplicationExitInfo API
  6. Case #1 - Long running method Problem • Expensive method

    called on UI thread Real word example • Heatmap processing in Uber Driver app Recommendation • Identify bottleneck and use a background thread (Scheduler/Dispatcher)
  7. Case #2 - RxBlocking API Problem • Rx blocking API

    will immediately block execution of current caller thread until emission Real world scenario • Getting hostName via blockingGet at our NetworkInterceptor Recommendation • Avoid using blocking API (use the subscribe function instead), use lazy initialization or other approaches
  8. Case #3 - BroadcastReceiver Problem • onReceive always run on

    UI thread. Likely chance of ANR if long running method called there Real world scenario • Loading data cache from file Recommendation • Move expensive logic to a background thread or • Use WorkManager (see next slide for caveats)
  9. Case #4 - WorkManager - RxWorker Problem • createWork is

    called on the main Thread Real world scenario • Cleaning up big audio file Recommendation • Avoid RxJava APIs like Single.just as subscribeOn(Schedulers.io) won’t do anything
  10. Case #5 - Deadlock Problem • Incorrect synchronization will freeze

    main UI thread while waiting for another resource in a separate thread Real World Example • Network operation needs token, App needs user name, both user name and token uses disk cache Recommendation • Fix deadlock, check where the deadlock is happening by checking thread id being locked on ANR trace • Be mindful of synchronized keyword usage
  11. Play console ANR Cons • Reports are hard to classify/identify

    and sometimes lack stacktraces • Latest ANR rate is delayed (up to 2 days to get latest information) • Historical ANR rates only up to 30 days • Reports don’t contain critical information (session id, user uuid) required for investigation • Access management in a large organization is not easy Pros • Best source of truth for getting ANR rate, because it is directly related to user • Allow relative comparison with peer apps • Integration with other “vitals” like app launch, slow frames, wake locks, etc
  12. Uber custom ANR detector • Inspired by ANR-WatchDog • Periodically,

    send a task into Main thread looper and see if that task is executed within 5 sec • Capture stacktrace of main thread before and after the task to identify offender • Integrated into our in-house app health reporting sites
  13. Uber custom ANR detector Cons • Hard to distinguish false

    positive/negative • Doesn’t detect ANRs that were present on play console (e.g. BroadcastReceivers, SystemService, etc) • Only captures main stack trace, making it hard to detect deadlock • Critical thread id info is not captured ◦ Example: “held by thread id =2” from AppExitInfo Pros • Additional information available for investigation (session id, user uuid, etc) • Almost real-time reporting on internal tools • Available to all Android engs at Uber • Allows iteration and improvements easily ◦ Added all thread information in the app
  14. Internal reporting using ApplicationExitInfo API • Leverages ApplicationExitInfo available on

    OS 11 and up • ANR information will be send on next cold launch • Stacktrace is processed on device to keep most relevant information • Integrated into our in-house app health reporting sites
  15. ApplicationExitInfo ANR Cons • Only available on OS 11 and

    up. We still need custom ANR detectors below OS 11 • In some cases some ANR stacktraces are too generic (e.g. nativePollOnce) but we can use metadata added internally to investigate those cases Pros • Most accurate ANR detection coming directly from OS • Able to detect ANRs missed by internal detectors (e.g. BroadcastReceiver, SystemService, etc) • Full stacktrace information available (including thread id for deadlock identification) • Additional methods like getDescription() , getTimestamp() are useful during investigation/triaging
  16. Leveraging ApplicationExitInfo API • Since API 30 (OS 11), google

    added ApplicationExitReason API • Reports exact reasons for app termination on next app launch: ◦ REASON_CRASH, ◦ REASON_ANR, ◦ REASON_LOW_MEMORY ◦ REASON_USER_STOPPED ◦ … • Opens up a big set of possibilities
  17. Leveraging ApplicationExitInfo API • Added monitoring for all termination reasons

    (crash, low memory, user requested, etc) • Only capturing the latest exit reason • Detecting/triaging internally ANRs by using REASON_ANR (available stacktrace and other useful metadata)
  18. Leveraging ApplicationExitInfo API • Processed ANR stacktrace is sent to

    backend • Reports all existing threads (including thread id, thread state, etc) • Other useful metadata is appended to each report (session id, user uuid, etc) • Custom ANR logic on backend to generate title for reports: ◦ ANR at java.lang.Thread.sleep vs ◦ ANR utils.AnrGenerator.sleep(24) Original Processed
  19. Why does ANR still happen? • By default most Android

    components run on UI thread, including “background” operations, for example: ◦ Workmanager - RxWorker.createWork ◦ BroadcastReceiver.onReceive ◦ (Uber RIB framework) Worker.onStart • Chain of operations obscure thread usage, for example: ◦ RxJava, Single.just will run on current thread not Scheduler specified by subscribeOn ◦ Coroutine, Dispatcher.default uses only 2 threads in many devices
  20. Why does ANR still happen? • Incorrect synchronization locks ◦

    Synchronized block may lock the entire class ◦ Semaphore and Double-Checked locking is hard to get right • Unclear/Misunderstanding of API ◦ SharedPreferences.apply vs commit ◦ RxJava Schedulers.single literally uses only 1 thread for all subscriptions
  21. Prevention during development and testing • Static code analysis to

    avoid/warn unsafe pattern e.g. lint checks, detekt, Error Prone • Monitor long operations in RxJava ◦ Flipper plugin to monitor Rx thread duration (not open sourced yet) ◦ Crash on debug builds • Debug build ApplicationExitInfo warning (Toaster + Logs) • Enable different Strict Mode options
  22. Monitoring for regression In Production • Establish baseline/goal, Google play

    defines bad behavior threshold as: ◦ Exhibits at least one ANR in at least 0.47% of its daily sessions. ◦ Exhibits 2 or more ANRs in at least 0.24% of its daily sessions. • Apps above this recommended threshold can have reduced visibility on play console. Source here • Treat ANR equal to Crash ◦ Set a goal for your team ◦ Fix as early as possible to avoid accumulation of tech debt
  23. Monitoring for regression In Production • ApplicationExitInfo based detection allows

    more immediate result than play console • Every new feature rollout to run A/B analysis and check regression on core metrics (Crash, ANR, OOM, and other key business metrics) • If regression occurs we can determine top offender causing ANR
  24. Available open source solutions • ANR-WatchDog ◦ Uber ANR detector

    was inspired by this • Bugsnag ANR ◦ Using SIGQUIT signal from the app process • Firebase console ANR reporting ◦ Leverages ApplicationExitInfo • Infer by Facebook ◦ Deadlock detector
  25. Questions? • Come see us in Uber booth • Contact

    us using github issue in Source code here Thank you Fran Aguilera, Yohan Hartanto, (and everyone who helped us)