Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Artur Speth - Microsoft Developer Divisions Weg...

Artur Speth - Microsoft Developer Divisions Weg ins nächste Agile Zeitalter - DevDay 2016

More Decks by Software Architektur Entwickler Community Dresden

Other Decks in Technology

Transcript

  1. Visual Studio & TFS Update 1 Visual Studio & TFS

    Update 2 Visual Studio & TFS Update n VS Team Services
  2. Planning Customer feedback – we should change the way a

    feature works. We didn’t get it quite right… … but we’re booked solid already. 2 years
  3. Sprint 3 week 3 Plan 3 sprint Season 6 month

    Scenario 18 month 3 6 Spring Fall Spring Fall Aspirational 60%
  4. Sprint 3 week Plan 3 sprint 3 Season 6 month

    Scenario 18 month 3 6 Spring Fall Spring Fall Hopeful 80% What Epics are we lighting up
  5. Sprint 3 week 3 Plan 3 sprint Season 6 month

    Scenario 18 month 3 6 Spring Fall Spring Fall Thoughtful 90% What features are planned?
  6. Sprint 3 week 3 Plan 3 sprint Scenario 18 month

    3 6 Spring Fall Spring Fall Confident 95% What stories are we complete? What features are shipping? Season 6 month
  7. Week 1 Week 2 Week 3 Week 1 Week 2

    Week 3 Week 2 Week 3 Sprint 98 Sprint 97 Sprint 99 The sprint plan What we accomplished
  8. • Updates were large • Months apart • Lots of

    problems! 4/1/2010 4/23/2012 5/3/2010 TFS 2010 RTM 4/23/2011 Service Deployment 8/5/2011 Service Update 9/26/2011 //BUILD 2011 12/7/2011 Service Update 1/30/2012 Service Update 2/20/2012 Service Update 3/12/2012 Service Update 4/2/2012 Service Update
  9. Week 1 Week 2 Week 3 Week 1 Week 2

    Week 3 Week 2 Week 3 Sprint 98 Sprint 97 Sprint 99 Deployment Sprint Planning Done
  10. ONE

  11. VSO SU1 Chicago VSO SU0 San Antonio VSO SU4 Amsterdam

    Shared Platform Services San Antonio
  12. Getting the availability model right 0,8 0,82 0,84 0,86 0,88

    0,9 0,92 0,94 0,96 0,98 1 -200 0 200 400 600 800 1000 1200 1400 1600 9.25.13 2:24 PM 9.25.13 3:36 PM 9.25.13 4:48 PM 9.25.13 6:00 PM 9.25.13 7:12 PM 9.25.13 8:24 PM 9.25.13 9:36 PM 9.25.13 10:48 PM Sept 25th 2013 LSI FailedExecutionCount SlowExecutionCount Start End Availability (ID4 - Activity Only) Availability (Current)
  13. Alerting is key to fast detection Every alert must be

    actionable and represent a real issue with the system. Alerts should create a sense of urgency – false alerts dilutes that Redundant alerts for same the issue Needed to set right thresholds and tune often Stateless alerts contributed to further noise
  14. Health model in action • 3 errors for memory and

    performance • All 3 related to same code defect • APM component mapped to feature team • Auto-dialer engaged Global DRI Eliminated alert noise ~928 alerts per week to ~22 and reduced DRI escalations by ~56%
  15. Time to Mitigate Time to Detect % of Incidents DRAFT

    DRAFT Microsoft Confidential 52 Service Availability & Health Metrics DRAFT DRAFT DRAFT Incident Count Incident Count DRAFT DRAFT DRAFT % of Incidents User Minutes DRAFT DRAFT DRAFT Error By Source Incidents by Severity User Impact Minutes During Incidents [TFS Only] 3 2 1 4 1. TFS Availability is on an improving trend. No Sev0/Sev1 LSIs for July. 2. App Insights switched from synthetic availability to real-user experience in Ibiza portal. A high volume of SEV-2 LSIs (72) contributed to customer impact in addition to intermittent UX errors. (UX fixes applied on 8/11 that improves availability) 3. App Insights was impacted by 3 long running LSIs related to ES maintenance, Ibiza updates and an Azure Storage outage. 4. TFS Service attainment (SLO) improved significantly MoM with focus on minimizing failed/slow commands and reviewing in weekly LiveSite reviews