Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crash reporting for large Android teams

Crash reporting for large Android teams

As part of this presentation we:
- Review how crash-reporting tools work, by intercepting uncaught exceptions
- Review how the Babylon team managed crash reporting as it scaled from 5 engineers to about 30 engineers sitting in numerous cross-functional teams.
- Review the New Relic mobile SDK and the New Relic Query Language (NRQL)

Sakis Kaliakoudas

June 25, 2020
Tweet

More Decks by Sakis Kaliakoudas

Other Decks in Programming

Transcript

  1. My name is Sakis Kaliakoudas I am an Android engineering

    manager for a team of 25 Android engineers
  2. My name is Sakis Kaliakoudas I am an Android engineering

    manager for a team of 25 Android engineers Currently working at Babylon Health
  3. My name is Sakis Kaliakoudas I am an Android engineering

    manager for a team of 25 Android engineers Currently working at Babylon Health On a mission to provide affordable and accessible healthcare to everyone on the planet!
  4. Android has a default uncaught exception handler It is called

    com.android.internal.os.RuntimeInit.KillApplicationHandler How does Android terminate an app
  5. @Override public void uncaughtException(Thread t, Throwable e) { try {

    // Log the crash // if profiling stop profiling // Show the crash dialog } catch (Throwable t) { // more logging } finally { // Try everything to make sure this process goes away. Process.killProcess(Process.myPid()); System.exit(10); } How does Android terminate an app
  6. @Override public void uncaughtException(Thread t, Throwable e) { try {

    // Log the crash // if profiling stop profiling // Show the crash dialog } catch (Throwable t) { // more logging } finally { // Try everything to make sure this process goes away. Process.killProcess(Process.myPid()); System.exit(10); } How does Android terminate an app
  7. @Override public void uncaughtException(Thread t, Throwable e) { try {

    // Log the crash // Stop profiling // Show the crash dialog } catch (Throwable t) { // more logging } finally { // Try everything to make sure this process goes away. Process.killProcess(Process.myPid()); System.exit(10); } How does Android terminate an app
  8. /** * Set the default handler invoked when a thread

    abruptly terminates * due to an uncaught exception, and no other handler has been defined * for that thread. */ Thread.setDefaultUncaughtExceptionHandler() Overriding the default exception handler
  9. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  10. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  11. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  12. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  13. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Not calling the default uncaught exception handler for Android Overriding the default exception handler
  14. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Not calling the default uncaught exception handler for Android Overriding the default exception handler Nothing there to terminate our application process – it will get stuck
  15. private fun changeUncaughtExceptionHandler() { val originalHandler = Thread.getDefaultUncaughtExceptionHandler() val handler

    = Thread.UncaughtExceptionHandler { thread, throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  16. private fun changeUncaughtExceptionHandler() { val originalHandler = Thread.getDefaultUncaughtExceptionHandler() val handler

    = Thread.UncaughtExceptionHandler { thread, throwable -> reportException(throwable) originalHandler.uncaughtException(thread, throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  17. Initially using Fabric Crashlytics That kind of made sense Team

    was pretty small Product was relatively small The whole company was structured around departments Babylon – the start
  18. Initially using Fabric Crashlytics That kind of made sense Team

    was pretty small Product was relatively small The whole company was structured around departments Every engineer would monitor the crash-rate in Fabric and try to fix the issues Babylon – the start
  19. Initially using Fabric Crashlytics That kind of made sense Team

    was pretty small Product was relatively small The whole company was structured around departments Every engineer would monitor the crash-rate in Fabric and try to fix the issues We were aiming to have about 99.5% crash-free sessions. Babylon – the start
  20. Team gradually growing We moved to the Spotify organizational model

    – structured around product teams Babylon – growing
  21. Team gradually growing We moved to the Spotify organizational model

    – structured around product teams The app crash-rate became less meaningful Babylon – growing
  22. Team gradually growing We moved to the Spotify organizational model

    – structured around product teams The app crash-rate became less meaningful Engineers didn’t have easy access to crashes from their areas Babylon – growing
  23. Every Android engineer should be able to see just the

    crashes for their area Babylon – the dream
  24. Every Android engineer should be able to see just the

    crashes for their area Every product team can have their own crash rate KPI Babylon – the dream
  25. Every Android engineer should be able to see just the

    crashes for their area Every product team can have their own crash rate KPI Every product team can get alerted for the crashes in their area Babylon – the dream
  26. Every Android engineer should be able to see just the

    crashes for their area Every product team can have their own crash rate KPI Every product team can get alerted for the crashes in their area As many breadcrumbs as possible Babylon – the dream
  27. A breadcrumb is an event that helps create a digital

    trail that allows engineers to diagnose a problem more easily What is a breadcrumb
  28. Main requirement was around having flexibility with the crash data

    Investigating other crash reporting tools
  29. Main requirement was around having flexibility with the crash data

    We decided to do a Proof Of Concept (POC) for: Firebase with Big Query support New Relic Investigating other crash reporting tools
  30. POCs – Firebase with big query Ability to export your

    crash data into big query We tried it out during a hackathon
  31. POCs – Firebase with big query Ability to export your

    crash data into big query We tried it out during a hackathon Looked pretty promising!
  32. POCs – Firebase with big query Ability to export your

    crash data into big query We tried it out during a hackathon Looked pretty promising! While promising, it would involve a lot of effort
  33. POCs – Firebase with big query Ability to export your

    crash data into big query We tried it out during a hackathon Looked pretty promising! While promising, it would involve a lot of effort We decided not to proceed further with this
  34. Looking at what the tool offered, it looked like a

    good candidate POCs – New Relic
  35. Looking at what the tool offered, it looked like a

    good candidate We decided to spend a bit of time to set it up in the codebase POCs – New Relic
  36. Looking at what the tool offered, it looked like a

    good candidate We decided to spend a bit of time to set it up in the codebase We reached out to New Relic to start a trial POCs – New Relic
  37. Send more breadcrumbs Analytic events Page views Product team names

    Create some dashboards, with a focus on metrics around product team crash rates POCs – New Relic
  38. Send more breadcrumbs Analytic events Page views Product team names

    Create some dashboards, with a focus on metrics around product team crash rates Assess New Relic alerts POCs – New Relic
  39. This was pretty straightforward because of the layers of abstraction

    in place All analytic frameworks are wrapped Sending analytic events
  40. This was pretty straightforward because of the layers of abstraction

    in place All analytic frameworks are wrapped All wrappers implement the same interface, TrackingGateway Sending analytic events
  41. This was pretty straightforward because of the layers of abstraction

    in place All analytic frameworks are wrapped All wrappers implement the same interface, TrackingGateway Each wrapper class is added to a java.util.Set through Dagger Sending analytic events
  42. This was pretty straightforward because of the layers of abstraction

    in place All analytic frameworks are wrapped All wrappers implement the same interface, TrackingGateway Each wrapper class is added to a java.util.Set through Dagger That Set is injected into a Usecase that deals with forwarding events to all trackers Sending analytic events
  43. Sending analytic events interface TrackingGateway { fun track(action: Action) }

    class NewRelicTrackingGateway : TrackingGateway { override fun track(action: Action) { NewRelic.recordBreadcrumb(action.name, action.data) } }
  44. @Module internal abstract class AnalyticsModule { @Binds @IntoSet abstract fun

    bind(tracker: NewRelicTrackingGateway): TrackingGateway } Sending analytic events
  45. @Module internal abstract class AnalyticsModule { @Binds @IntoSet abstract fun

    bind(tracker: NewRelicTrackingGateway): TrackingGateway } Sending analytic events
  46. Page views are mostly tracked automatically Main implementation mechanism uses

    the Application.ActivityLifecycleCallbacks Sending page views
  47. Page views are mostly tracked automatically Main implementation mechanism uses

    the Application.ActivityLifecycleCallbacks Mechanism for sending the page views to New Relic is similar to the analytic events Sending page views
  48. interface TrackingGateway { fun track(action: Action) fun trackScreenView(event: ScreenViewTrackingEvent) }

    Sending page views data class ScreenViewTrackingEvent( val name: String, val screen: Any )
  49. class NewRelicTrackingGateway : TrackingGateway { override fun track(action: Action) {

    NewRelic.recordBreadcrumb(action.name, action.data) } override fun trackScreenView(event: ScreenViewTrackingEvent) { val breadcrumbName = "ScreenView: '${event.name}'" NewRelic.recordBreadcrumb(breadcrumbName) } } Sending page views
  50. Sending product team names The idea: Associate in the codebase

    each screen with the product team that owns it
  51. Sending product team names The idea: Associate in the codebase

    each screen with the product team that owns it As users navigate in the app, send that team name for each screen to New Relic
  52. Sending product team names The idea: Associate in the codebase

    each screen with the product team that owns it As users navigate in the app, send that team name for each screen to New Relic When a crash occurs, use New Relic to associate the crash with the last team name reported.
  53. Sending product team names Book appointment screen Product team: Appointments

    Monitor screen Product team: Monitor Home screen Product team: Core experience
  54. Our app is one activity per screen at the moment

    Connecting screens with teams
  55. Our app is one activity per screen at the moment

    Based on this we decided to add an annotation on every activity Connecting screens with teams
  56. @Retention(AnnotationRetention.RUNTIME) @Target(AnnotationTarget.CLASS) annotation class OwnedByTeams(val teams: Array<Team>) enum class Team(val

    teamName: String) { TRIAGE("Triage"), CHAT_PLATFORM("Chat platform"), APPOINTMENTS("Appointments"), MONITOR("Monitor"), HEALTHCHECK("Healthcheck"), MAPLE("Maple"), PAYMENTS_AND_ELIGIBILITY("Payments and Eligibility") } Connecting screens with teams
  57. class NewRelicTrackingGateway : TrackingGateway { override fun track(action: Action) {

    NewRelic.recordBreadcrumb(action.name, action.data) } override fun trackScreenView(event: ScreenViewTrackingEvent) { val breadcrumbName = "ScreenView: '${event.name}'" NewRelic.recordBreadcrumb(breadcrumbName) } } Sending product team names
  58. override fun trackScreenView(event: ScreenViewTrackingEvent) { val breadcrumbName = "ScreenView: '${event.name}’”

    NewRelic.recordBreadcrumb(breadcrumbName) if (event.screen is Activity) { event.screen::class.java.getAnnotation(OwnedByTeams::class.java)?.let { NewRelic.setAttribute(“Team", it.teams.joinToString { team -> team.teamName }) } } } Sending product team names
  59. Using a tool called Arch Unit (https://www.archunit.org/) Allows to write

    unit tests for your architecture with a nice API Enforcing the @OwnedByTeams
  60. Using a tool called Arch Unit (https://www.archunit.org/) Allows to write

    unit tests for your architecture with a nice API Enforcing the @OwnedByTeams
  61. Enforcing the @OwnedByTeams @Test fun `all activities should be annotated

    with OwnedByTeams annotation`() { val classes = ClassFileImporter().importPackages("com.babylon") val classesToCheck = classes().that().areAssignableTo(AppCompatActivity::class.java) classesToCheck.should().beAnnotatedWith(OwnedByTeams::class.java) .because(“You should always assign an owner to a screen") .check(classes) }
  62. Enforcing the @OwnedByTeams @Test fun `all activities should be annotated

    with OwnedByTeams annotation`() { val classes = ClassFileImporter().importPackages("com.babylon") val classesToCheck = classes().that().areAssignableTo(AppCompatActivity::class.java) classesToCheck.should().beAnnotatedWith(OwnedByTeams::class.java) .because(“You should always assign an owner to a screen") .check(classes) }
  63. Enforcing the @OwnedByTeams @Test fun `all activities should be annotated

    with OwnedByTeams annotation`() { val classes = ClassFileImporter().importPackages("com.babylon") val classesToCheck = classes().that().areAssignableTo(AppCompatActivity::class.java) classesToCheck.should().beAnnotatedWith(OwnedByTeams::class.java) .because(“You should always assign an owner to a screen") .check(classes) }
  64. Enforcing the @OwnedByTeams @Test fun `all activities should be annotated

    with OwnedByTeams annotation`() { val classes = ClassFileImporter().importPackages("com.babylon") val classesToCheck = classes().that().areAssignableTo(AppCompatActivity::class.java) classesToCheck.should().beAnnotatedWith(OwnedByTeams::class.java) .because(“You should always assign an owner to a screen") .check(classes) }
  65. At this point we had all the relevant data in

    New Relic Getting crash-rates per team
  66. At this point we had all the relevant data in

    New Relic We just had to create queries Getting crash-rates per team
  67. At this point we had all the relevant data in

    New Relic We just had to create queries New Relic has its own query language called NRQL Getting crash-rates per team
  68. At this point we had all the relevant data in

    New Relic We just had to create queries New Relic has its own query language called NRQL Similar to SQL, but not as powerful Getting crash-rates per team
  69. MobileSession App version Country Device model OS version Session duration

    MobileCrash Exception Available size on disk Orientation Architecture MobileHandledException (non-fatal exceptions) Data captured by New Relic MobileRequest Bytes sent & received Connection type (2G, 3G etc.) Request URL Response time REST status code MobileRequestError Similar to “MobileRequest” Error type (cellular issue or REST error) Response body MobileBreadcrumb
  70. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  71. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  72. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  73. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  74. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  75. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  76. Breadcrumbs in New Relic HTTP Response: 200 to request in

    event 92 612 ms https://prod.babylonpartners.com/api/v2/video_sessions/1233 eventType: MobileCrash ScreenView ‘VideoConsultationsActivity’ Incoming videocall/notification_accepted
  77. Breadcrumbs in New Relic HTTP Response: 200 to request in

    event 92 612 ms https://prod.babylonpartners.com/api/v2/video_sessions/1233 eventType: MobileCrash ScreenView ‘VideoConsultationsActivity’ Incoming videocall/notification_accepted
  78. Breadcrumbs in New Relic HTTP Response: 200 to request in

    event 92 612 ms https://prod.babylonpartners.com/api/v2/video_sessions/1233 eventType: MobileCrash ScreenView ‘VideoConsultationsActivity’ Incoming videocall/notification_accepted
  79. Breadcrumbs in New Relic HTTP Response: 200 to request in

    event 92 612 ms https://prod.babylonpartners.com/api/v2/video_sessions/1233 eventType: MobileCrash ScreenView ‘VideoConsultationsActivity’ Incoming videocall/notification_accepted
  80. Breadcrumbs in New Relic HTTP Response: 200 to request in

    event 92 612 ms https://prod.babylonpartners.com/api/v2/video_sessions/1233 eventType: MobileCrash ScreenView ‘VideoConsultationsActivity’ Incoming videocall/notification_accepted
  81. KPIs per team Using New Relic alerts Using this for

    iOS as well Defining higher level organizational structures of ownership The future
  82. With 1 activity and multiple fragments the only change would

    be moving the ownership annotation from each activity to each fragment Q: How would the ownership mechanism work with a single activity app?
  83. In the process of modularizing the app. This mechanism can

    stay as is even with feature modules Q: Is your app modularized, and how would that affect this mechanism?
  84. There’s no one universal answer NASA would probably not tolerate

    a single crash! Q: What’s a good crash-free rate?
  85. There’s no one universal answer. NASA would probably not tolerate

    a single crash! For the Babylon Android team, the number started from about 99.5% Q: What’s a good crash-free rate?
  86. There’s no one universal answer. NASA would probably not tolerate

    a single crash! For the Babylon Android team, the number started from about 99.5% Currently sitting at about 99.9% Q: What’s a good crash-free rate?
  87. There’s no one universal answer. NASA would probably not tolerate

    a single crash! For the Babylon Android team, the number started from about 99.5% Currently sitting at about 99.9% Seems like it is easier to increase it with our MVI architecture Q: What’s a good crash-free rate?