Upgrade to Pro — share decks privately, control downloads, hide ads and more …

People Metrics

People Metrics

Amin Astaneh (@aastaneh) from Acquia shares his experience with using metrics to track unplanned work, delivery effectiveness and team happiness.

Presented at the July 2016 Boston DevOps meetup.

http://www.meetup.com/Boston-Devops/events/232251543/

Boston DevOps

July 20, 2016
Tweet

More Decks by Boston DevOps

Other Decks in Technology

Transcript

  1. Who am I? Senior Manager, Infrastructure Services at Acquia •

    Was in Operations Team from 2011-2015 • Formalized incident response and ticketing process • Wrote automation tools to manage a rapidly-growing fleet (now ~15000) • Implemented Kanban process in 2015 to track work-in-progress • Currently tech lead for Tools Team, people manager for Tier2 Operations
  2. So.. Metrics. What do we usually think of when comes

    to “metrics” for a product or service? • Application Response Times • Application Uptime • Application Error Rates • CPU/Memory/Disk/Network
  3. So.. Metrics. We use this information to drive decisions around:

    • Should a person get paged? • Do we need to scale our infrastructure? • Should we revert a code change? • ..etc
  4. It’s not the whole picture! Humans build and operate software.

    They are a key piece of the mechanism that keeps a service up and customers happy. Therefore: It stands to reason that we should be measuring them too!
  5. What can ‘People Metrics’ accomplish? If you are a manager

    trying to keep your team engaged, happy, and retained, such metrics can enable you to: • be proactive about quality-of-life issues (alerts fatigue, toil, etc) • make team status transparent to the rest of the company • make justification for funding for more staffing/resources • identify opportunities for process improvement
  6. What can ‘People Metrics’ accomplish? If you are trying to

    raise awareness and urgency around an opportunity/problem in your organization, ‘people metrics’ can: • convert anecdotal experience into empirical data • reveal the operational cost of current conditions to leadership • identify constraints in key business functions • win members of leadership to your cause http://www.kotterinternational.com/the-8-step-process-for-leading-change https://en.wikipedia.org/wiki/Theory_of_constraints
  7. What can ‘People Metrics’ accomplish? Simply complaining about a problem

    isn’t going to work. From Eli Goldratt’s The Goal: “The goal of an organization is to increase throughput while reducing both the inventory and operating expense.” YOU HAVE TO COMMUNICATE WITH LEADERSHIP IN THOSE TERMS!
  8. What can ‘People Metrics’ accomplish? What will influence decision makers

    more effectively? “Working on Team X stinks. We’re always firefighting and doing tickets.” OR “40% of Team X’s time is spent on incident response, and 30% is spent on manual tasks that the business needs. That is 70% of their time not spent on making improvements to the product or streamlining current processes.”
  9. Metric: Time/Effort Spent in ‘4 Types of Work’ The Phoenix

    Project posits that there are four types of work in IT Operations. I argue the same is true for development teams too! • Business Projects: new features • Internal Projects: cleaning up tech debt, investment in CI/CD • Operational Change: releasing, provisioning, configuring • Unplanned Work: outages, firefighting, etc. If we somehow measure the quantity and percentage of each type of work over time, the business can know where their money is being spent and ensure the max return-on-investment.
  10. Metric: Time/Effort Spent in ‘4 Types of Work’ So what

    do you do with such data? Of course you want to keep unplanned work to a minimum. For an Dev Team, you want maximum time spent on business projects (writing features). Business > internal > ops change > unplanned. For an Ops Team, you want maximum time spent on internal projects (to keep the service online and issues resolved as efficiently as possible). Internal > business > ops change > unplanned.
  11. Unplanned Work is WASTE “If more than 25% of a

    team needs to be dedicated to ticket duty and oncall, there is a serious problem with firefighting and a lack of automation.” - Tom Limoncelli, The Practice of Cloud System Administration, Volume 2
  12. A Simpler Metric: Operational Load Operational load is the percentage

    of time spent towards the upkeep of your service. It’s time not writing code or making improvements. Google SRE Team caps this metric at 50%. When exceeded, it overflows to the engineers.
  13. A Simpler Metric: Operational Load Why 50%? Remember the ‘wait

    time’ graph from The Phoenix Project? For one thing, once you exceed 50%, customers will start to wait longer for work to get done. As you approach 80% and beyond, it really gets out of hand.
  14. SLACK is your FRIEND I don’t mean the corporate chat

    service. I mean ‘idle time’. Slack means that your team can be responsive to bursts of unplanned work without a business impact. Slack means opportunities to improve skillsets and team morale. The 20’th century management style of keeping slack lean/nonexistent doesn't work (that creates constraints!) Flow of work can be inconsistent. Be prepared!
  15. Measure Happiness! Every $INTERVAL, ask your team these questions: From

    a scale of 1-5: • How happy are you doing your job? • How happy are you working at your company? Also: • What makes you the most happy? • What makes you the least happy? • What, if changed, would most improve your happiness?
  16. Measure Happiness! What does this do for you? • Identify

    common morale from the team (effects of toil, crisis, etc). • Identify common improvement opportunities • Prevent burnout, employee churn, etc.
  17. Other Metrics • Cycle Time: How long will someone wait

    on a given ticket? • Throughput: Issues closed per day/week/month • Frequency by request type: what should be automated first? • Frequency by root cause: what bug is causing the most pain? • Per-Person time tracking should be 5.5-6h per day ◦ If too little, you’re not getting good enough metrics ◦ If too much, the person is working too much and will burn out eventually • Reopened issues/bugs: defects going downstream
  18. Other Metrics If you’re struggling to figure out what to

    measure: Brendan Gregg (performance guru at Netflix) talks about the USE Method: “For every resource: check utilization, saturation, and errors.” Why not do the same for people? (http://www.brendangregg.com/usemethod.html)
  19. All This Sounds Good. How Do I Get Started? 1.

    Track your work using a project management or ticket system (Jira, Redmine) 2. Seriously, you should be doing this anyway 3. Log time spent on each issue a. Ops Teams should log all their time b. Dev/SRE Teams should log time on tasks that is toil 4. Track non-issue data using custom tools 5. Create Reports/Dashboards 6. Use the data to make/influence decisions
  20. Ok, let’s say I got this data. What next? Make

    dashboards and put them in a prominent place in the office. ◦ Document them so people know what they mean! ◦ This creates empathy for your team and current conditions
  21. Ok, let’s say I got this data. What next? Review

    them daily/weekly as part of your standups. ◦ It isn’t useful if you don’t look at them! ◦ Ask questions and dig into the ticket system to find why graphs look the way they do. ◦ Create action items to address common themes
  22. Ok, let’s say I got this data. What next? Share

    with management. Again, it’s all about operational cost, inventory and throughput, so speak in terms of TIME and MONEY. ◦ $5000 of Team X’s time was spent rebooting servers due to Bug Y. ◦ Customers are waiting up to 2 weeks for Team X to respond to requests. ◦ It takes one hour on average to perform Task X. ◦ We need double our current EC2 spend while Bug X is unresolved.
  23. Ok, let’s say I got this data. What next? Define

    a ‘target condition’ and set goals to achieve it, eg: ◦ Reduce 90’th percentile cycle time for tickets from two weeks to one week in 3 months ◦ Reduce operational load to < 50% in 6 months (https://www.amazon.com/Toyota-Kata-Managing-Improvement-Adaptiveness/dp/0071635238?ie=UTF8&*Version*=1&*entries*=0)
  24. All This Sounds Good. How Do I Get Started? #!/bin/bash

    read $HAPPINESS echo "team.$(whoami).happiness:$HAPPINESS|g" \ | nc -w 1 -u statsd.server.tld 8125 Quick and Dirty Happiness Metrics! (Example of using gauges in statsd)
  25. All This Sounds Good. How Do I Get Started? #!/bin/bash

    echo "team.$(whoami).interruptions:1|c" \ | nc -w 1 -u statsd.server.tld 8125 Quick and Dirty Interruption Tracking! (Example of using counters in statsd)
  26. But What if I Can’t Code? • Jira has MANY

    reporting capabilities (built-in and in plugins) • Business Intelligence Tools (Domo, Amazon Quicksight) • Google Forms!
  27. If you are using Jira and need to track time

    spent.. • Create a custom field called ‘Work Type’ with values ‘Business’, ‘Internal’, ‘Ops Change’, and ‘Unplanned’. • The tables that you care about are: ◦ ‘worklog’: time tracking events ◦ ‘customfieldvalue’: maps custom field values to issues • If you look for entries in the work log where worklog.issueid is equal to customfieldvalue.ISSUE and look for a specific customfieldvalue. CUSTOMFIELD.. you can sum up the results for a given worktype and generate your own metrics.. hint hint nudge nudge . (https://developer.atlassian.com/jiradev/jira-platform/jira-architecture/database-schema/database-custom-fields)
  28. Quick Google Forms Demo Let’s create a quick happiness metric

    survey in Google Forms to demonstrate how easy it is!