Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crash reporting for large Android teams

Crash reporting for large Android teams

As part of this presentation we:
- Review how crash-reporting tools work, by intercepting uncaught exceptions
- Review how the Babylon team managed crash reporting as it scaled from 5 engineers to about 30 engineers sitting in numerous cross-functional teams.
- Review the New Relic mobile SDK and the New Relic Query Language (NRQL)

12b0491316db4d532327bbcf41de2052?s=128

Sakis Kaliakoudas

June 25, 2020
Tweet

Transcript

  1. Crash reporting for large Android teams @skaliakoudas

  2. My name is Sakis Kaliakoudas

  3. My name is Sakis Kaliakoudas I am an Android engineering

    manager for a team of 25 Android engineers
  4. My name is Sakis Kaliakoudas I am an Android engineering

    manager for a team of 25 Android engineers Currently working at Babylon Health
  5. My name is Sakis Kaliakoudas I am an Android engineering

    manager for a team of 25 Android engineers Currently working at Babylon Health On a mission to provide affordable and accessible healthcare to everyone on the planet!
  6. What is a crash?

  7. What is a crash? private fun likelyToCrash() { throw NullPointerException()

    }
  8. What is a crash? private fun likelyToCrash() { throw NullPointerException()

    }
  9. How does crash reporting work?

  10. How does crash reporting work?

  11. How does crash reporting work?

  12. How does crash reporting work?

  13. How does crash reporting work?

  14. How does crash reporting work?

  15. How does crash reporting work?

  16. Android has a default uncaught exception handler How does Android

    terminate an app
  17. Android has a default uncaught exception handler It is called

    com.android.internal.os.RuntimeInit.KillApplicationHandler How does Android terminate an app
  18. @Override public void uncaughtException(Thread t, Throwable e) { try {

    // Log the crash // if profiling stop profiling // Show the crash dialog } catch (Throwable t) { // more logging } finally { // Try everything to make sure this process goes away. Process.killProcess(Process.myPid()); System.exit(10); } How does Android terminate an app
  19. @Override public void uncaughtException(Thread t, Throwable e) { try {

    // Log the crash // if profiling stop profiling // Show the crash dialog } catch (Throwable t) { // more logging } finally { // Try everything to make sure this process goes away. Process.killProcess(Process.myPid()); System.exit(10); } How does Android terminate an app
  20. @Override public void uncaughtException(Thread t, Throwable e) { try {

    // Log the crash // Stop profiling // Show the crash dialog } catch (Throwable t) { // more logging } finally { // Try everything to make sure this process goes away. Process.killProcess(Process.myPid()); System.exit(10); } How does Android terminate an app
  21. /** * Set the default handler invoked when a thread

    abruptly terminates * due to an uncaught exception, and no other handler has been defined * for that thread. */ Thread.setDefaultUncaughtExceptionHandler() Overriding the default exception handler
  22. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  23. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  24. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  25. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  26. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Not calling the default uncaught exception handler for Android Overriding the default exception handler
  27. private fun changeUncaughtExceptionHandler() { val handler = Thread.UncaughtExceptionHandler { thread,

    throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Not calling the default uncaught exception handler for Android Overriding the default exception handler Nothing there to terminate our application process – it will get stuck
  28. private fun changeUncaughtExceptionHandler() { val originalHandler = Thread.getDefaultUncaughtExceptionHandler() val handler

    = Thread.UncaughtExceptionHandler { thread, throwable -> reportException(throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  29. private fun changeUncaughtExceptionHandler() { val originalHandler = Thread.getDefaultUncaughtExceptionHandler() val handler

    = Thread.UncaughtExceptionHandler { thread, throwable -> reportException(throwable) originalHandler.uncaughtException(thread, throwable) } Thread.setDefaultUncaughtExceptionHandler(handler) } Overriding the default exception handler
  30. Initially using Fabric Crashlytics Babylon – the start

  31. Initially using Fabric Crashlytics Babylon – the start

  32. Initially using Fabric Crashlytics Babylon – the start

  33. Initially using Fabric Crashlytics That kind of made sense Team

    was pretty small Product was relatively small The whole company was structured around departments Babylon – the start
  34. Initially using Fabric Crashlytics That kind of made sense Team

    was pretty small Product was relatively small The whole company was structured around departments Every engineer would monitor the crash-rate in Fabric and try to fix the issues Babylon – the start
  35. Initially using Fabric Crashlytics That kind of made sense Team

    was pretty small Product was relatively small The whole company was structured around departments Every engineer would monitor the crash-rate in Fabric and try to fix the issues We were aiming to have about 99.5% crash-free sessions. Babylon – the start
  36. Babylon – growing

  37. Team gradually growing Babylon – growing

  38. Team gradually growing We moved to the Spotify organizational model

    – structured around product teams Babylon – growing
  39. Team gradually growing We moved to the Spotify organizational model

    – structured around product teams The app crash-rate became less meaningful Babylon – growing
  40. Team gradually growing We moved to the Spotify organizational model

    – structured around product teams The app crash-rate became less meaningful Engineers didn’t have easy access to crashes from their areas Babylon – growing
  41. Babylon – the dream

  42. Every Android engineer should be able to see just the

    crashes for their area Babylon – the dream
  43. Every Android engineer should be able to see just the

    crashes for their area Every product team can have their own crash rate KPI Babylon – the dream
  44. Every Android engineer should be able to see just the

    crashes for their area Every product team can have their own crash rate KPI Every product team can get alerted for the crashes in their area Babylon – the dream
  45. Every Android engineer should be able to see just the

    crashes for their area Every product team can have their own crash rate KPI Every product team can get alerted for the crashes in their area As many breadcrumbs as possible Babylon – the dream
  46. What is a breadcrumb

  47. A breadcrumb is an event that helps create a digital

    trail that allows engineers to diagnose a problem more easily What is a breadcrumb
  48. What is a breadcrumb

  49. What is a breadcrumb

  50. What is a breadcrumb

  51. What is a breadcrumb

  52. What is a breadcrumb

  53. What is a breadcrumb

  54. Investigating other crash reporting tools

  55. Main requirement was around having flexibility with the crash data

    Investigating other crash reporting tools
  56. Main requirement was around having flexibility with the crash data

    We decided to do a Proof Of Concept (POC) for: Firebase with Big Query support New Relic Investigating other crash reporting tools
  57. POCs – Firebase with big query

  58. POCs – Firebase with big query Ability to export your

    crash data into big query
  59. POCs – Firebase with big query Ability to export your

    crash data into big query We tried it out during a hackathon
  60. POCs – Firebase with big query Ability to export your

    crash data into big query We tried it out during a hackathon Looked pretty promising!
  61. POCs – Firebase with big query Ability to export your

    crash data into big query We tried it out during a hackathon Looked pretty promising! While promising, it would involve a lot of effort
  62. POCs – Firebase with big query Ability to export your

    crash data into big query We tried it out during a hackathon Looked pretty promising! While promising, it would involve a lot of effort We decided not to proceed further with this
  63. POCs – New Relic

  64. Looking at what the tool offered, it looked like a

    good candidate POCs – New Relic
  65. Looking at what the tool offered, it looked like a

    good candidate We decided to spend a bit of time to set it up in the codebase POCs – New Relic
  66. Looking at what the tool offered, it looked like a

    good candidate We decided to spend a bit of time to set it up in the codebase We reached out to New Relic to start a trial POCs – New Relic
  67. POCs – New Relic

  68. Send more breadcrumbs POCs – New Relic

  69. Send more breadcrumbs Analytic events POCs – New Relic

  70. Send more breadcrumbs Analytic events Page views POCs – New

    Relic
  71. Send more breadcrumbs Analytic events Page views Product team names

    POCs – New Relic
  72. Send more breadcrumbs Analytic events Page views Product team names

    Create some dashboards, with a focus on metrics around product team crash rates POCs – New Relic
  73. Send more breadcrumbs Analytic events Page views Product team names

    Create some dashboards, with a focus on metrics around product team crash rates Assess New Relic alerts POCs – New Relic
  74. Sending analytic events

  75. This was pretty straightforward because of the layers of abstraction

    in place Sending analytic events
  76. This was pretty straightforward because of the layers of abstraction

    in place All analytic frameworks are wrapped Sending analytic events
  77. This was pretty straightforward because of the layers of abstraction

    in place All analytic frameworks are wrapped All wrappers implement the same interface, TrackingGateway Sending analytic events
  78. This was pretty straightforward because of the layers of abstraction

    in place All analytic frameworks are wrapped All wrappers implement the same interface, TrackingGateway Each wrapper class is added to a java.util.Set through Dagger Sending analytic events
  79. This was pretty straightforward because of the layers of abstraction

    in place All analytic frameworks are wrapped All wrappers implement the same interface, TrackingGateway Each wrapper class is added to a java.util.Set through Dagger That Set is injected into a Usecase that deals with forwarding events to all trackers Sending analytic events
  80. Sending analytic events interface TrackingGateway { fun track(action: Action) }

  81. Sending analytic events interface TrackingGateway { fun track(action: Action) }

    class NewRelicTrackingGateway : TrackingGateway { override fun track(action: Action) { NewRelic.recordBreadcrumb(action.name, action.data) } }
  82. @Module internal abstract class AnalyticsModule { @Binds @IntoSet abstract fun

    bind(tracker: NewRelicTrackingGateway): TrackingGateway } Sending analytic events
  83. @Module internal abstract class AnalyticsModule { @Binds @IntoSet abstract fun

    bind(tracker: NewRelicTrackingGateway): TrackingGateway } Sending analytic events
  84. Sending analytic events

  85. Sending analytic events

  86. Sending analytic events

  87. Sending analytic events

  88. Sending page views

  89. Page views are mostly tracked automatically Sending page views

  90. Page views are mostly tracked automatically Main implementation mechanism uses

    the Application.ActivityLifecycleCallbacks Sending page views
  91. Page views are mostly tracked automatically Main implementation mechanism uses

    the Application.ActivityLifecycleCallbacks Mechanism for sending the page views to New Relic is similar to the analytic events Sending page views
  92. interface TrackingGateway { fun track(action: Action) } Sending page views

  93. interface TrackingGateway { fun track(action: Action) fun trackScreenView(event: ScreenViewTrackingEvent) }

    Sending page views
  94. interface TrackingGateway { fun track(action: Action) fun trackScreenView(event: ScreenViewTrackingEvent) }

    Sending page views data class ScreenViewTrackingEvent( val name: String, val screen: Any )
  95. class NewRelicTrackingGateway : TrackingGateway { override fun track(action: Action) {

    NewRelic.recordBreadcrumb(action.name, action.data) } override fun trackScreenView(event: ScreenViewTrackingEvent) { val breadcrumbName = "ScreenView: '${event.name}'" NewRelic.recordBreadcrumb(breadcrumbName) } } Sending page views
  96. Sending product team names

  97. Sending product team names The idea:

  98. Sending product team names The idea: Associate in the codebase

    each screen with the product team that owns it
  99. Sending product team names The idea: Associate in the codebase

    each screen with the product team that owns it As users navigate in the app, send that team name for each screen to New Relic
  100. Sending product team names The idea: Associate in the codebase

    each screen with the product team that owns it As users navigate in the app, send that team name for each screen to New Relic When a crash occurs, use New Relic to associate the crash with the last team name reported.
  101. Sending product team names

  102. Sending product team names Book appointment screen Monitor screen Home

    screen
  103. Sending product team names Book appointment screen Product team: Appointments

    Monitor screen Product team: Monitor Home screen Product team: Core experience
  104. Attributes Team: no value New Relic backend Sending product team

    names
  105. New Relic backend Sending product team names NewRelic.setAttribute(“Team", “Appointments”) Attributes

    Team: no value
  106. New Relic backend Sending product team names NewRelic.setAttribute(“Team", “Appointments”) Attributes

    Team: Appointments
  107. New Relic backend Sending product team names NewRelic.setAttribute(“Team", “Monitor”) Attributes

    Team: Appointments
  108. New Relic backend Sending product team names NewRelic.setAttribute(“Team", “Monitor”) Attributes

    Team: Monitor
  109. New Relic backend NewRelic.setAttribute(“Team", “Core Experience”) Sending product team names

    Attributes Team: Monitor
  110. New Relic backend NewRelic.setAttribute(“Team", “Core Experience”) Sending product team names

    Attributes Team: Core Experience
  111. New Relic backend Sending product team names NewRelic.setAttribute(“Team", “Core Experience”)

    Attributes Team: Core Experience
  112. Connecting screens with teams

  113. Our app is one activity per screen at the moment

    Connecting screens with teams
  114. Our app is one activity per screen at the moment

    Based on this we decided to add an annotation on every activity Connecting screens with teams
  115. @Retention(AnnotationRetention.RUNTIME) @Target(AnnotationTarget.CLASS) annotation class OwnedByTeams(val teams: Array<Team>) Connecting screens with

    teams
  116. @Retention(AnnotationRetention.RUNTIME) @Target(AnnotationTarget.CLASS) annotation class OwnedByTeams(val teams: Array<Team>) enum class Team(val

    teamName: String) { TRIAGE("Triage"), CHAT_PLATFORM("Chat platform"), APPOINTMENTS("Appointments"), MONITOR("Monitor"), HEALTHCHECK("Healthcheck"), MAPLE("Maple"), PAYMENTS_AND_ELIGIBILITY("Payments and Eligibility") } Connecting screens with teams
  117. @OwnedByTeams (teams = [Team.APPOINTMENTS]) class AppointmentDetailsActivity : AppCompatActivity() { ...

    } Connecting screens with teams
  118. @OwnedByTeams (teams = [Team.APPOINTMENTS, Team.ANOTHER_TEAM]) class AppointmentDetailsActivity : AppCompatActivity() {

    ... } Connecting screens with teams
  119. class NewRelicTrackingGateway : TrackingGateway { override fun track(action: Action) {

    NewRelic.recordBreadcrumb(action.name, action.data) } override fun trackScreenView(event: ScreenViewTrackingEvent) { val breadcrumbName = "ScreenView: '${event.name}'" NewRelic.recordBreadcrumb(breadcrumbName) } } Sending product team names
  120. override fun trackScreenView(event: ScreenViewTrackingEvent) { val breadcrumbName = "ScreenView: '${event.name}'"

    NewRelic.recordBreadcrumb(breadcrumbName) } } Sending product team names
  121. override fun trackScreenView(event: ScreenViewTrackingEvent) { val breadcrumbName = "ScreenView: '${event.name}’”

    NewRelic.recordBreadcrumb(breadcrumbName) if (event.screen is Activity) { event.screen::class.java.getAnnotation(OwnedByTeams::class.java)?.let { NewRelic.setAttribute(“Team", it.teams.joinToString { team -> team.teamName }) } } } Sending product team names
  122. Enforcing the @OwnedByTeams

  123. Using a tool called Arch Unit (https://www.archunit.org/) Enforcing the @OwnedByTeams

  124. Using a tool called Arch Unit (https://www.archunit.org/) Allows to write

    unit tests for your architecture with a nice API Enforcing the @OwnedByTeams
  125. Using a tool called Arch Unit (https://www.archunit.org/) Allows to write

    unit tests for your architecture with a nice API Enforcing the @OwnedByTeams
  126. Enforcing the @OwnedByTeams @Test fun `all activities should be annotated

    with OwnedByTeams annotation`() { val classes = ClassFileImporter().importPackages("com.babylon") val classesToCheck = classes().that().areAssignableTo(AppCompatActivity::class.java) classesToCheck.should().beAnnotatedWith(OwnedByTeams::class.java) .because(“You should always assign an owner to a screen") .check(classes) }
  127. Enforcing the @OwnedByTeams @Test fun `all activities should be annotated

    with OwnedByTeams annotation`() { val classes = ClassFileImporter().importPackages("com.babylon") val classesToCheck = classes().that().areAssignableTo(AppCompatActivity::class.java) classesToCheck.should().beAnnotatedWith(OwnedByTeams::class.java) .because(“You should always assign an owner to a screen") .check(classes) }
  128. Enforcing the @OwnedByTeams @Test fun `all activities should be annotated

    with OwnedByTeams annotation`() { val classes = ClassFileImporter().importPackages("com.babylon") val classesToCheck = classes().that().areAssignableTo(AppCompatActivity::class.java) classesToCheck.should().beAnnotatedWith(OwnedByTeams::class.java) .because(“You should always assign an owner to a screen") .check(classes) }
  129. Enforcing the @OwnedByTeams @Test fun `all activities should be annotated

    with OwnedByTeams annotation`() { val classes = ClassFileImporter().importPackages("com.babylon") val classesToCheck = classes().that().areAssignableTo(AppCompatActivity::class.java) classesToCheck.should().beAnnotatedWith(OwnedByTeams::class.java) .because(“You should always assign an owner to a screen") .check(classes) }
  130. Getting crash-rates per team

  131. At this point we had all the relevant data in

    New Relic Getting crash-rates per team
  132. At this point we had all the relevant data in

    New Relic We just had to create queries Getting crash-rates per team
  133. At this point we had all the relevant data in

    New Relic We just had to create queries New Relic has its own query language called NRQL Getting crash-rates per team
  134. At this point we had all the relevant data in

    New Relic We just had to create queries New Relic has its own query language called NRQL Similar to SQL, but not as powerful Getting crash-rates per team
  135. MobileSession App version Country Device model OS version Session duration

    MobileCrash Exception Available size on disk Orientation Architecture MobileHandledException (non-fatal exceptions) Data captured by New Relic MobileRequest Bytes sent & received Connection type (2G, 3G etc.) Request URL Response time REST status code MobileRequestError Similar to “MobileRequest” Error type (cellular issue or REST error) Response body MobileBreadcrumb
  136. NRQL examples SELECT * FROM MobileCrash SINCE last week

  137. SELECT count(*) FROM MobileRequest WHERE requestPath LIKE '%patient%’ SINCE last

    week TIMESERIES 5 hours NRQL examples
  138. SELECT average(bytesReceived) FROM MobileRequest WHERE requestMethod = 'PATCH’ AND requestPath

    LIKE '%appointment%’ SINCE last month NRQL examples
  139. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  140. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  141. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  142. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  143. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  144. SELECT (filter(count(sessionId), WHERE category ='Crash' AND Team = 'Appointments' )

    / count(sessionId)) * 100 FROM MobileCrash, MobileSession SINCE 1 week ago Crash-rates per product team
  145. Crash-rates per product team

  146. Breadcrumbs in New Relic HTTP Response: 200 to request in

    event 92 612 ms https://prod.babylonpartners.com/api/v2/video_sessions/1233 eventType: MobileCrash ScreenView ‘VideoConsultationsActivity’ Incoming videocall/notification_accepted
  147. Breadcrumbs in New Relic HTTP Response: 200 to request in

    event 92 612 ms https://prod.babylonpartners.com/api/v2/video_sessions/1233 eventType: MobileCrash ScreenView ‘VideoConsultationsActivity’ Incoming videocall/notification_accepted
  148. Breadcrumbs in New Relic HTTP Response: 200 to request in

    event 92 612 ms https://prod.babylonpartners.com/api/v2/video_sessions/1233 eventType: MobileCrash ScreenView ‘VideoConsultationsActivity’ Incoming videocall/notification_accepted
  149. Breadcrumbs in New Relic HTTP Response: 200 to request in

    event 92 612 ms https://prod.babylonpartners.com/api/v2/video_sessions/1233 eventType: MobileCrash ScreenView ‘VideoConsultationsActivity’ Incoming videocall/notification_accepted
  150. Breadcrumbs in New Relic HTTP Response: 200 to request in

    event 92 612 ms https://prod.babylonpartners.com/api/v2/video_sessions/1233 eventType: MobileCrash ScreenView ‘VideoConsultationsActivity’ Incoming videocall/notification_accepted
  151. Alerts

  152. Alerts New Relic provides a very flexible alerting framework that

    can integrate with many tools
  153. Going for New Relic

  154. Going for New Relic We went for it

  155. Bonus: Ownership in debug builds

  156. Bonus: Ownership in debug builds

  157. Bonus: Ownership in other places

  158. Bonus: Ownership in other places Feature flags

  159. Bonus: Ownership in other places Feature flags Notifications on slack

    grouped by teams
  160. Bonus: Ownership in other places Feature flags Notifications on slack

    grouped by teams Some UI Tests
  161. The future

  162. KPIs per team The future

  163. KPIs per team Using New Relic alerts The future

  164. KPIs per team Using New Relic alerts Using this for

    iOS as well The future
  165. KPIs per team Using New Relic alerts Using this for

    iOS as well Defining higher level organizational structures of ownership The future
  166. Thanks!

  167. With 1 activity and multiple fragments the only change would

    be moving the ownership annotation from each activity to each fragment Q: How would the ownership mechanism work with a single activity app?
  168. In the process of modularizing the app. This mechanism can

    stay as is even with feature modules Q: Is your app modularized, and how would that affect this mechanism?
  169. Q: What’s a good crash-free rate?

  170. There’s no one universal answer Q: What’s a good crash-free

    rate?
  171. There’s no one universal answer NASA would probably not tolerate

    a single crash! Q: What’s a good crash-free rate?
  172. There’s no one universal answer. NASA would probably not tolerate

    a single crash! For the Babylon Android team, the number started from about 99.5% Q: What’s a good crash-free rate?
  173. There’s no one universal answer. NASA would probably not tolerate

    a single crash! For the Babylon Android team, the number started from about 99.5% Currently sitting at about 99.9% Q: What’s a good crash-free rate?
  174. There’s no one universal answer. NASA would probably not tolerate

    a single crash! For the Babylon Android team, the number started from about 99.5% Currently sitting at about 99.9% Seems like it is easier to increase it with our MVI architecture Q: What’s a good crash-free rate?
  175. https://github.com/babylonhealth/orbit-mvi Q: What’s a good crash-free rate?

  176. Other questions?