License Performance Debugging Status Quo • Prevention – Functional Performance testing • testutils.org/multi-mechanize – Before production • You do that, right?!
License With Reproduction Instructions • “statprof” – Sampling of tracebacks → statistics – More relevant info than “cProfile” – And lighter – Still doesn't “tell the story”
License Without reproduction instructions • “profile/cProfile” – Way too slow for production • Sprinkle timing calls – log(time.time() - start_time) – Time consuming... – On the production environment?! • DeadlockDebugger
License DeadlockDebugger for thread_id, frame in sys._current_frames().items(): output = StringIO() traceback.print_stack(frame, file=output) res.append("Thread %s:\n%s" % (thread_id, output.getvalue())) res.append("End of dump") return '\n'.join(res) • Directly from the Listening thread – At special URL – Works even if system stuck
License sys._current_frames() • “threadframe” (Python 2.4) • Python dict – Key: thread id (int) • Same as thread.get_ident() for each thread – Value: Stack Frame • Currently executing frame on this thread • Tip of the thread's stack
License Proactive Monitoring • Parsing of HTTP server logs – Reports slow URLs – But many “false” positives • Maatkit (now Percona) – Reports slow MySQL Queries • Still not enough info – Just sprinkling timing calls
License LongRequestLogger • Intercepts and Monitores each request • Logfile: – Request X, thread Y, running for T seconds • Request Info • Stack trace – Request X, thread Y, running for T+2 seconds • Request Info • Stack trace – [...]
License LongRequestLogger shows • Which requests were slow – Not the collateral damage • Why exactly they were slow • The right spot to add optimizations
License ERP5 Case Study • “T-Shirt Custom Printing company” • ERP5: Zope based ERP from Nexedi • MRP, CRM, Accounting, ... • Random peaks of horrible performance • Never twice on the same URL • Always at moments of high volume
License ERP5 Case Study • Culprit: Sale Order Line Price • Calculated on demand – Only the first time – Usually by background activities • Stored on the Sale Order Line • Death by 1000 cuts – Solved in 15 LOCs
License Plone Case Study • Large government portal • LongRequestLogger: – Slowdowns at specific times – Culprit: database network access – Slowed down by scheduled backup
License What to optimize: Limit between Hot and Cold • Stack Trace at T – A • calls B – B • calls C – C • calls D – D • calls E – E • calls F • Stack Trace at T+1 – A • calls B – B • calls C – C • calls X – X • calls Y – Y • calls Z
License Why it works • Statistic Sampling Profiling • Pure probability – What is slow, shows • Stack traces “tell the story” – Enriched by additional info • Captures even the most un-reproduceable • User Independence – Diagnose/Optimize before complaints
License Open Questions • True performance impact – How short can the recurring interval be? – GIL contention if too short? – Single thread to monitor all others?