Slide 1

Slide 1 text

ddegrandis.com @dominicad DevOps Metrics 101: What really matters when measuring performance from a DevOps angle

Slide 2

Slide 2 text

ddegrandis.com @dominicad DevOps is the outcome of applying the most trusted principles from manufacturing & leadership to the IT value stream Gene Kim, Jez Humble, Patrick Debois, John Willis “DevOps relies on bodies of knowledge from Lean, Theory of Constraints, the Toyota Production System, resilience engineering, learning organizations, and human factors.”

Slide 3

Slide 3 text

ddegrandis.com @dominicad “The result is world-class quality, reliability, stability, security at lower costs/effort; and accelerated flow & reliability throughout the tech value stream, including Product Management, Dev, QA, ITOps, & Infosec.”

Slide 4

Slide 4 text

ddegrandis.com @dominicad Utilization LT, CFR, TP WIP

Slide 5

Slide 5 text

ddegrandis.com @dominicad Learning Outcomes 1. Define the types of metrics used for DevOps transformations. 2. Show how these metrics are measured and interpreted. 3. Identify top three ways to begin capturing and using DevOps metrics.

Slide 6

Slide 6 text

ddegrandis.com @dominicad Literature on DevOps performance metrics 1. Delivery lead time (speed) 2. Deploy frequency (batch size) 3. Mean time to recover (adapt) 4. Change failure rate (quality)

Slide 7

Slide 7 text

ddegrandis.com @dominicad 1. Delivery Lead Time From code commit to code running successfully in prod From customer request to code running successfully in prod

Slide 8

Slide 8 text

ddegrandis.com @dominicad http://www.rundeck.comvops/ Damon Edwards @damonedwards Rework during build/test/deploy can increase when technical debt is not addressed. Measuring DLT over time helps us see trends, discover what needs to improve in build/test/deploy/secure part of the value stream. Dev want change, Ops wants stability

Slide 9

Slide 9 text

ddegrandis.com @dominicad Jim Grafmeyer & Cindy Payne https://www.youtube.com/watch?v=9WAiFAgkO5g. DevOps Handbook experiments in accelerating delivery at Nationwide Why Delivery Lead Time matters

Slide 10

Slide 10 text

ddegrandis.com @dominicad 2. Deploy Frequency Code spoils quickly if not integrated into production.

Slide 11

Slide 11 text

ddegrandis.com @dominicad Why Deployment Frequency matters The more frequent deployments are, the smaller the batch size is. Small batches accelerates feedback and reduces WIP which improves lead times, quality, & efficiency.

Slide 12

Slide 12 text

ddegrandis.com @dominicad Dominica DeGrandis Thief Unplanned Work Transaction costs: Low for a one-time 6 month supply High for a one day supply. Knowledge work is perishable

Slide 13

Slide 13 text

ddegrandis.com @dominicad Dominica DeGrandis Thief Unplanned Work Unplanned Work: Interruptions that prevent you from finishing something or from stopping at a better breaking point. Unplanned Work is a time thief b/c unplanned work usurps planned work @dominicad While economies of scale can reduce costs in manufacturing, software is a different story. Two things to consider: • Transaction cost • Holding cost

Slide 14

Slide 14 text

ddegrandis.com @dominicad 3. Mean Time to Recover (MTTR) 2 incidents in Dec had combined downtime of 120 min. Dec MTTR is 60 min. MTTR = downtime / # of incidents How fast we can respond to change? MTTR is a measure of adaptivity.

Slide 15

Slide 15 text

ddegrandis.com @dominicad Why MTTR matters Hardware & software are going to fail. Hope is not a strategy. DevOps outcomes rely on resilience engineering https://www.youtube.com/watch?v=2S0k12uZR14 Velocity 2012: Dr. Richard Cook, "How Complex Systems Fail"

Slide 16

Slide 16 text

ddegrandis.com @dominicad Working at the Center of the Cyclone - Dr. Richard Cook - https://www.youtube.com/watch?v=3ZP98stDUf0 Systems fail

Slide 17

Slide 17 text

ddegrandis.com @dominicad Systems fail Working at the Center of the Cyclone - Dr. Richard Cook - https://www.youtube.com/watch?v=3ZP98stDUf0 Failures are inevitable

Slide 18

Slide 18 text

ddegrandis.com @dominicad 50+ companies that failed to stay relevant Burroughs - Univac - Honeywell - Control Data - MSA McCormack & Dodge - Cullinet - Cincom - ADR - CA - DEC Data General - Wang - Prime - Tandem - Daisy - Calma Valid Apollo - Silicon Graphics - Sun - Atari - Osborne - Commodore Sacio - Palm - Sega - WordPerfect - Lotus - Ashton Tate Borland - Informix - Ingress - Sybase - BEA - Seibel Powersoft - Nortel - Pacific Bell - Qwest - America West Nynex - Bell South - Netscape - MySpace - Inktomi Ask Jeeves - AOL - Blackberry - Motorola - Nokia - Sony General Electric? Geoffrey Moore - Zone To Win - https://www.amazon.com/Zone-Win-Organizing-Compete-Disruption-ebook/dp/B016R3G2GY

Slide 19

Slide 19 text

ddegrandis.com @dominicad 4. Change Failure Rate (CFR) Answers the Q: What % of changes to prod fail? CFR = # of failed items / total # of work items completed Ex: 60 items completed in Dec, 20 of them resulted in a failure. Dec CFR is 30 %. Failure - a change resulting in an outage or degraded service where hotfix, rollback or patch required.

Slide 20

Slide 20 text

ddegrandis.com @dominicad Why Change Failure Rate (CFR) matters DevOps outcomes include “world-class quality”. CFR provides an effective way to identify opportunities to improve quality.

Slide 21

Slide 21 text

ddegrandis.com @dominicad “When you focus solely on shallow data you give up the return on investments that can be realized by deeper and more elaborate analysis.” ~John Allspaw Ex: Instead of blame, ask, “why did it make sense for someone to do that at that time?” Learning from incidents requires psychological safety. http://www.adaptivecapacitylabs.com/blog/2018/03/23/moving-past-shallow-incident-data/

Slide 22

Slide 22 text

ddegrandis.com @dominicad 5. A Culture metric to gage team safety Examples: • On my team, failure causes inquiry and not blame. • Our leadership is open to hearing bad news • In my org, failures are learning opportunities and messengers are not punished. @nicolefv https://www.youtube.com/watch?v=avauW5FAWCw promoters passives detractors

Slide 23

Slide 23 text

ddegrandis.com @dominicad

Slide 24

Slide 24 text

ddegrandis.com @dominicad Typology of Organizational Culture Westrum) https://puppet.com/resources/white-paper/2015-state-devops-report

Slide 25

Slide 25 text

ddegrandis.com @dominicad Adding a culture metric to previous 4 metrics • Delivery Lead Time (Speed) • Deploy Frequency (batch size) • MTTR (capability to adapt quickly) • Change Failure Rate (quality) and you are off to a good start on your DevOps journey. But…. The reason DevOps conversations began in 2009 was to address problems with local optimization & siloes.

Slide 26

Slide 26 text

ddegrandis.com @dominicad It doesn’t matter how fast one piece of the value stream moves when other parts of the system lag. We are so freaking AGILE, Yay! @jonsmart The PMO is Dead, Long Live the PMO – Barclays https://www.youtube.com/watch?v=R-fol1vkPlM.

Slide 27

Slide 27 text

ddegrandis.com @dominicad Improve your decision making even more with The five best metrics you’ve never met. 1. Flow time 2. Flow efficiency 3. The WIP report 4. The Aging report 5. Work type distribution

Slide 28

Slide 28 text

ddegrandis.com @dominicad 1. Flow Time Yes, let’s do this! Yes, let’s do this!

Slide 29

Slide 29 text

ddegrandis.com @dominicad Why Flow time matters Understanding the elapsed time it takes a request to go from, “Yes, let’s do this”, to working in production, helps you be more predictable.

Slide 30

Slide 30 text

ddegrandis.com @dominicad https://techbeacon.com/lesson-agile-how-one-team-ended-dependency- delays?utm_content=buffera8491&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer Upstream Discovery Transparency included in team space Specialists supporting multiple teams are pulled in different directions resulting in conflicting priorities. Dependencies on specialists mean that people aren’t available when needed.

Slide 31

Slide 31 text

ddegrandis.com @dominicad 2. Flow Efficiency

Slide 32

Slide 32 text

ddegrandis.com @dominicad Dev & Ops are more reliant upon Product Owners/Product Mgrs, who prioritize the work tech does. We need their help to ensure that non-functional requirements get prioritized. Why Flow Efficiency Matters IT needs Product Leadership to conquer tech debt, especially when doing so the 1st time

Slide 33

Slide 33 text

ddegrandis.com @dominicad https://www.youtube.com/watch?v=WEJVE6PITJE When the Business Partners with Tech and they do a Dojo Why Flow Efficiency matters IT needs Biz Leadership to conquer tech debt, especially for the 1st time https://www.youtube.com/watch?v=WEJVE6PITJE When the Business Partners with Tech and they do a Dojo Whole team learning model pioneered by Target. Place of the way

Slide 34

Slide 34 text

ddegrandis.com @dominicad https://www.youtube.com/watch?v=WEJVE6PITJE IT needs Biz Leadership to conquer tech debt, especially for the 1st time Speed of iterating went from months to hours

Slide 35

Slide 35 text

ddegrandis.com @dominicad https://www.youtube.com/watch?v=WEJVE6PITJE When the Business Partners with Tech and they do a Dojo Why Flow Efficiency matters IT needs Biz Leadership to conquer tech debt, especially for the 1st time Test time reduced from three weeks to three hours https://www.youtube.com/watch?v=WEJVE6PITJE

Slide 36

Slide 36 text

ddegrandis.com @dominicad 3. The WIP Report

Slide 37

Slide 37 text

ddegrandis.com @dominicad @dominicad https://itrevolution.com/book/the-cornerstone-for-winning/ https://www.youtube.com/watch?v=qav1y7G15JQ People have a finite amount of capacity Why WIP matters

Slide 38

Slide 38 text

ddegrandis.com @dominicad @dominicad https://itrevolution.com/book/the-cornerstone-for-winning/ https://www.youtube.com/watch?v=qav1y7G15JQ People have a finite amount of capacity Why WIP matters

Slide 39

Slide 39 text

ddegrandis.com @dominicad Dominica DeGrandis Thief Too much Work-in-progress (WIP) High WIP means that other items sit waiting for service longer. The single most important factor that affects queue size is capacity utilization.

Slide 40

Slide 40 text

ddegrandis.com @dominicad Dominica DeGrandis Thief Too much Work-in-progress (WIP) Queuing Theory allows us to quantify the relationship between wait times and capacity utilization. Wait times increase exponentially as utilization approaches 100%. Queuing Theory: Applied statistics that studies waiting lines If the goal is speed, consider managing work by queues. http://reinertsenassociates.com/books/

Slide 41

Slide 41 text

ddegrandis.com @dominicad Dominica DeGrandis WIPis a leading indicator

Slide 42

Slide 42 text

ddegrandis.com @dominicad The WIP Report

Slide 43

Slide 43 text

ddegrandis.com @dominicad 4. The Aging Report

Slide 44

Slide 44 text

ddegrandis.com @dominicad Why Age of work items matter

Slide 45

Slide 45 text

ddegrandis.com @dominicad 5. Work Type Distribution

Slide 46

Slide 46 text

ddegrandis.com @dominicad How to capture Work Type Distribution

Slide 47

Slide 47 text

ddegrandis.com @dominicad Not a DevOps metric Beware the Red Yellow Green (RYG) Report Think about when you visit a badly designed website and how little you trust it. “If we have data, let’s look at data. If all we have are opinions, let’s go with mine.” ~ Jim Barksdale

Slide 48

Slide 48 text

ddegrandis.com @dominicad Three ways to begin capturing and using DevOps metrics 1. Safe to fail experiments 2. Make work visible 3. Automatically capture data with tools

Slide 49

Slide 49 text

ddegrandis.com @dominicad 1. Safe to fail experiments A complex system has no repeating relationships between cause and effect. When dealing with complex systems there is the need for experimentation. Dave Snowden: http://cognitive-edge.com/methods/safe-to-fail-probes/

Slide 50

Slide 50 text

ddegrandis.com @dominicad 2. Make work & metrics visible

Slide 51

Slide 51 text

ddegrandis.com @dominicad ServiceNow – Jira – HPE ALM 3. Automate – let your workflow mgmt tools automatically capture flow data.

Slide 52

Slide 52 text

ddegrandis.com @dominicad 3. Automate – let your workflow mgmt tools automatically capture flow data. Microsoft Project – VSTS

Slide 53

Slide 53 text

ddegrandis.com @dominicad A metrics learning experiment 1 metric trend in 4 areas: • Speed • Productivity • Quality • Predictability See impacts of change in 1 metric by showing all 4 metrics Inspired by Troy Magennis & Larry Maccherone, “Doing Team Metrics Right,” http://focusedobjective.com/team-metrics-right/

Slide 54

Slide 54 text

ddegrandis.com @dominicad Look at Flow time 1/4 How fast? Flow Time Influence others using the power of visualization date Unplanned work delays Planned work

Slide 55

Slide 55 text

ddegrandis.com @dominicad Look at Throughput 2/4 How productive? Throughput date Question: Does TP improve when there are fewer conflicting priorities (less WIP)?

Slide 56

Slide 56 text

ddegrandis.com @dominicad 3/4 How good? Quality Change Failure Rate # FD done items # of total done items date Oh - ok – I see what you mean!!! What we measure impacts people b/c people value what is measured.

Slide 57

Slide 57 text

ddegrandis.com @dominicad When people complain that things take too long, measure actuals. It’s useful to test opinions against data. 90th percentile filtered on business requests 4/4 Balanced Flow chart exercise – How predictable? date Percentiles answers Q: “What’s the probability of completing work in x days?”

Slide 58

Slide 58 text

ddegrandis.com @dominicad 1. Capture & present metrics to help others see the problems & risks in order to provoke necessary conversations for change. 2. Implement change using experiments and a humble approach to get the buy-in you need for change. 3. Shift left – visualize upstream work along with your work to see the value stream to optimize the whole vs. individual teams/siloes. Three Takeaways

Slide 59

Slide 59 text

ddegrandis.com @dominicad Email: [email protected] Subject: flow To receive: • copy of this presentation deck • excerpts of Making Work Visible • Tasktop video on TFS/SN tool integration • Forrester article: Agile-Plus-DevOps With Value Stream Management