Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Avoiding software fails. Few metrics to improve...

Riga Dev Day
March 13, 2016
72

Avoiding software fails. Few metrics to improve application reliability by Slawomir Michalik

Riga Dev Day

March 13, 2016
Tweet

Transcript

  1. 3 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Dynatrace … if

    it fails to (deliver) reach the finish line
  2. Performance issues increase costs 63% of IT organizations spend 20%+

    of the time working on performance issues Inability to Innovate 40% of Developers’ time is wasted in triage, stealing a focus from activities that innovates
  3. Web Site: this shoudn’t happen Some Ad Company during American

    Super-Bowl Total size ~ 20MB 434 Resources in on that page
  4. Web Site: this could be easily eliminated Obama Care 16

    individual jQuery -related files that should be merged Most JavaScript files contains Dev documentation, which makes up to 80% of the file size
  5. • Developers not using the browser built-in diagnostics tools •

    Testers not doing a sanity checks with the same tools • Some tools for you • Built-in Inspectors via Ctrl-Shift-I in Chrome and Firefox • YSlow, PageSpeed • Dynatrace Ajax Edition • Level-Up: Automate Testing & Diagnostics Check Lessons Learnt – NO Excuses for …
  6. • Symptoms • HTML takes 60-120s to render • High

    GC Time • Developer Assumptions • Bad GC Tuning • Probably bad DB performance as rendering was simple • Resulted in: months of finger-pointing between Dev & DBA Project: Online Room Reservation System
  7. Developers-built monitoring void roomreservationReport(int officeId) { long startTime = System.currentTimeMillis();

    Object data = loadDataForOffice(officeId); long dataLoadTime = System.currentTimeMillis() - startTime; generateReport(data, officeId); } Result: Avg. Data Load Time: 41s! DB Tool says: Avg. SQL Query: <1ms!
  8. #1: Loading too much data 24889! Calls to the DB

    API High CPU & High Memory Usage to keep all data in Memory
  9. #3: Putting all data in temp Hashtable Lots of time

    spent in Hashtable.get Called from their Entity Objects
  10. • …You know what code is doing • Challenge the

    developers • Don’t use Hashtabels as workaround, use O/R mappers • Explore Tools that “might seem” out of your league! • Built-In Database Analysis Tools • “Logging” options of Frameworks such as Hibernate, … • JMX, Perf Counters, … of your Application Servers • APM (Performance Tracing) Tools: Dynatrace, Ruxit,… Lessons Learned – Don’t Assume …
  11. Production Deployment leads to Log SYNC Issues Log message Time

    In Sync Two calls comming from Customr coded methods
  12. Test Environment Production Environment That’s Normal: Having I/O for Web

    Request as main contributor Hibernate, Classloading, XML – The Key Hotspots I/O for Web Requests doesn’t even show up!
  13. These calls all originate form thousands of calls to find

    item by code Top Contributor Class.getInterfaces Called from Hibernates FieldInterceptionHelper
  14. • Plan enough time for proper testing • Anticipate changed

    user behavior during peak load • Only test what really ends up in Production Lessons Learned
  15. #1 Time really spent in IIS? Tip: Elapsed Time tells

    us WHEN a Method was executed! Tip: Thread# gives us insight on Thread Queues / Switches Finding: Thread 32 in IIS waited 87s to pass control to Thread 30 in ASP.NET
  16. #2 What about these SQL Executions? Finding: EVERY SQL statement

    is executed on ITS OWN Connection! Tip: Look at “GetConnection”
  17. #2 SQL Executions! continued … #1: Same SQL is executed

    67! times #2: NO PREPARATION because everything executed on new Connection
  18. Natural (data) context ln ulc oIief do sbemwp itsTiu ea

    p.leyokefhtu'm rrcereit' euacfttat, ts Ch Theoretically I'm super fast Coherence, but it's difficult to tweak me up.
  19. 23s for One click 22s $3-5M worth Data grid New

    Generation CRM: Angular.js / Coherence
  20. # Images # Redirects # and Size of Resources #

    SQL Executions # of SAME SQLs # Items per Page # AJAX per Page Remember: New Metrics When Testing Apps Time Spent in API # Calls into API # Functional Errors # 3rd Party calls # of Domains Total Size Resource (W3C) Timings: PLT, DOM Processing/Ready, Page Interactive
  21. Putting it into a Test Automation 12 0 120ms 3

    1 68ms Build 20 testPurchase OK testSearch OK Build 17 testPurchase OK testSearch OK Build 18 testPurchase FAILED testSearch OK Build 19 testPurchase OK testSearch OK Build # Test Case Status # SQL # Excep CPU 12 0 120ms 3 1 68ms 12 5 60ms 3 1 68ms 75 0 230ms 3 1 68ms Test Framework Results Architectural Data We identified a regression Problem solved Exceptions probably reason for failed tests Problem fixed but now we have an architectural regression Problem fixed but now we have an architectural regression Now we have the functional and architectural confidence Let’s look behind the scenes