Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from automating crash handling at Unity

Avatar for Igor Igor
November 10, 2017

Lessons from automating crash handling at Unity

Challenges we faced and solutions we made in order to automate handling of crash reports at Unity

https://www.youtube.com/watch?v=CUCPUqwmlS4 (BuildStuffLT, EN)
https://www.youtube.com/watch?v=LCfm70wY31w (XPDays Ukraine, RU)

Avatar for Igor

Igor

November 10, 2017
Tweet

More Decks by Igor

Other Decks in Technology

Transcript

  1. What dog are you? Building client-server and distributed solutions since

    2007 C# .NET, Python developer Religious about good code, software design, TDD, SOLID Love to learn new stuff Toolsmith for Unity Technologies Fun Microsoft booth at NDC Oslo 2016
  2. How complex is Unity? Editor is available for all desktop

    platforms (Win, Mac, Linux) Almost 30 target platforms (Desktop, Mobile, Consoles, Smart TVs, VR)
  3. Toolsmiths make sure things run smoothly ❏ Testing frameworks (native,

    runtime, graphics, performance, etc) ❏ Unified Test Runner and Smart Tests Selection ❏ Bug reporting infrastructure (bug reporting tool and backend, bug tracking system, crash analyzer) @k04a
  4. @k04a Reporting bugs in Unity ❏ Ships with internal tool

    called Bug Reporter ❏ Allows you to send editor logs, project, along with description of the problem ❏ Uses FogBugz as a bug tracking system Crashes are no different, crash data is written to the editor log
  5. @k04a When the old tricks stopped working With over a

    million of registered users QA were unable to keep up with amount of reports coming in (~6k monthly) Every report has to be reproduced, before handed to devs for resolution, but only 15% turn out to be real bugs Crashes cause high user pain and should be addressed first
  6. @k04a The problems to address Can’t tell a crash amongst

    other bugs Can’t tell if two crashes are the same or different No metrics to tell the frequency of certain crashes, hence no way to prioritize Lots of duplicate work
  7. @k04a Socorro: Mozilla’s Backend plus Reporting UI Pros: ❏ Rich

    functionality ❏ Open source Cons: ❏ Hard to run and modify, highly focused on Mozilla ❏ Expects Breakpad as client lib
  8. @k04a Breakpad: Google’s cross platform crash collecting Pros: ❏ Cross

    platform ❏ Open source ❏ Symbols are separated from the app Cons: ❏ Set of libraries to produce callstacks but no processing server ❏ Cannot resolve Mono (C#) frames
  9. @k04a But we already have crash collecting! It is cross

    platform and it works (including Mono) We have to deal with reports already sitting in our bug tracking system Maybe we are better off just building our own processing tool (in the words of Bert Lance: “if it ain’t broke don’t fix it”)?!
  10. @k04a Grouping callstacks into buckets Inspired by Windows Error Reporting

    (WER) Goal: one root cause per bucket Divide and Conquer: find (and fix) the biggest ones first
  11. @k04a Full stack matching vs top frame matching Strict matching:

    20k crash reports produced 13k buckets - ⅔ turned out to be unique Top frame only matching: if it crashed in the same function, maybe it is the same problem
  12. @k04a Middle ground - the notion of GIST Stripped down

    version of the callstack Recursion and system calls removed Allows for better grouping
  13. @k04a 2 levels of bucketing Report - user reported incident

    (case from FogBugz) Crash - lower level container, combines reports with (nearly) identical callstacks (GIST) Bucket - top level container, combines reports with identical crashed function (same top frame) Recent numbers: 80k reports in 22k crashes across 4.5k buckets (4.5k reports in the biggest bucket)
  14. @k04a Easy navigation to / from FogBugz Open all the

    cases from the same bucket or crash in Fogbugz If you have a case at hand to investigate - easily find out the bucket it belongs to
  15. @k04a Use case: monthly report on top crashes ❏ Fixed

    ❏ Fix in progress or needs backporting ❏ Not fixed or not reproduced yet ❏ Won’t fix - external
  16. @k04a Finding out about new crash is even more promising

    Slack channel & integration to let everyone interested know Alone with brief description (version, callstack), so if it is your area as the developer, you can try work on it before it becomes massive problem for everyone
  17. @k04a The notion of the duplicate and repro Do we

    have report with the similar callstack turned into a bug (reproduced)? Is the bug fixed? Is the fix available (released)?
  18. @k04a As new report comes in we know if it’s

    a duplicate Message contains: • Unity version • Bucket name (top frame, crashed function) • Ids of the case and possible repro Still requires human to make the final call
  19. @k04a Other things to consider or memo to ourselves Online

    crash processing - detect duplicates on the client Out-of-process crash collecting and automated mini dump sending Bucketing is still hard - allow to manually split or combine buckets, maybe apply machine learning to learn more Map buckets to areas in the code (Physics, Graphics, etc)
  20. @k04a Conclusions & takeaways If it’s not your core business,

    try to find out-of-the-box solutions instead of building your own Consider using Breakpad if you need cross platform crash collecting Focus on one problem at a time, instead of building a starship At large scale try to automate and bucket things up, prioritize work Split crash reporting from bug reporting
  21. @k04a Useful links Microsoft WER: https://docs.microsoft.com/en-us/windows-hardware/drivers/dashboard/how-we r-collects-and-classifies-error-reports WER - 10

    years of debugging in the large (article): http://www.sigops.org/sosp/sosp09/papers/glerum-sosp09.pdf Breakpad: https://chromium.googlesource.com/breakpad/breakpad/ Socorro: https://wiki.mozilla.org/Socorro