Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Debugging in the (Very) Large

raven
November 18, 2019
11

Debugging in the (Very) Large

Debugging in the (Very) Large: Ten Years of Implementation and ExperienceKirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt

raven

November 18, 2019
Tweet

Transcript

  1. Debugging in the (Very) Large: Ten Years of Implementation and

    Experience Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt presented by raven 2019/7/1 This slides based on author’s slide.
  2. Basic Data • Report of Windows Error Reporting (WER) which

    supports debug for large system. • Authors: Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt • Authors’ institutions: Microsoft Corporation • Appear in: Symposium on Operating Systems Principles 2013 2
  3. Key Quotes • Debugging in the large is harder. When

    the number of software components in a single system grows to the hundreds and the number of deployed systems grows to the millions, strategies that worked in the small, like asking programmers to triage individual error reports, fail. With hundreds of components, it becomes much harder to isolate the root cause of an error. With millions of systems, the sheer volume of error reports for even obscure bugs can become overwhelming. Worse still, prioritizing error reports from millions of users becomes arbitrary and ad hoc. 3
  4. Overview • Microsoft deployed error reports system for supporting debugging

    in large system. • Windows Error Reporting(WER) – generate error reporting. – collect error reporting and other data. – specify important bugs. 4
  5. Two Definitions • Bug: a flaw in program logic •

    Error: a failure in execution caused by a bug - Run it 5,000 times, you’ll get 5000 errors. - One bug may cause many errors. 5
  6. Goals • Microsoft ships software to 1 billion users. –

    How do we find out when things go wrong • They want to ◦ fix bugs on every Windows system. ◦ collect every error. ◦ prioritize bugs that affect the most users. ◦ generalize the solution to be used by any programmer 6
  7. 7

  8. 9

  9. 10

  10. 11

  11. 12

  12. 13

  13. !analyze • Engine for WER bucketing heuristics • Extension to

    the Debugging Tools for Windows - input is a error report, output is bucket ID - runs on WER servers(and programmers desktops) • 500 heuristics - grows ~1 heuristics/week 14
  14. Flow of WER • client automatically collects a error report

    • sends report to servers • !analyze buckets the error with similar reports • increments the bucket count • programmers prioritize buckets with highest count • Problems - only upload first few hits on a bucket. - programmers request additional data as needed 15
  15. 16

  16. Bucketing Mostly Works • One bug can hit multiple buckets

    - up to 40% of error reports - extra server load - duplicate buckets must be triaged • Multiple bugs can hit one bucket - up to 4% of error reports - harder to isolate each bug 17
  17. Evaluation • Scalability - the number of error reports processed

    by WER grew by a factor of 30(2003-2009) • Finding Bugs - The Windows Vista programmers fixed 5,000 bugs. • Bucketing Effectiveness - The top 500 buckets account for 65% of all error reports for Vista. 18
  18. Summary • Windows Error Reporting (WER) - the first modern

    reporting system with automatic diagnosis - the largest client-server system in the world (by installs) - helped 700 companies fix 1000s of bugs and billions of errors - fundamentally changed software development at MS 19