The Sociotechnical Path to High-Performing Teams

The Sociotechnical Path to High-Performing Teams

How to measure and build high-performing engineering teams, and why it starts with observability.

Ac734fc32781678475b577944bb5a9ae?s=128

Charity Majors

April 17, 2020
Tweet

Transcript

  1. 3.
  2. 6.

    bad jobs can be bad in so, so many different

    ways… • harmful product • glorified the results of
 poor planning • alienated from coworkers • long commute • indifferent manager • cargo-culted the worst of 
 Silicon Valley startup culture • aging, obsolete tech • high operational toil • fragile, flappy systems • complacency • low eng skill level • command-and-control 
 leadership
  3. 7.

    autonomy, learning, high- achieving, learned from our mistakes, curious, responsibility,

    ownership, inspiring, camaraderie, pride, collaboration, career growth, rewarding, motivating manual labor, sacred cows, wasted effort, stale tech, ass-covering, fear, fiefdoms, excessive toil, command-and-control, cargo culting, enervating, discouraging, lethargy, indifference
  4. 8.

    sociotechnical (n) “Technology is the sum of ways in which

    social groups construct the material objects of their civilizations. The things made are socially constructed just as much as technically constructed. The merging of these two things, construction and insight, is sociotechnology” — wikipedia if you change the tools people use, you can change how they behave and even who they are.
  5. 11.

    they perform A high-performing team isn’t just fun to be

    on. Nice coworkers who mean well and work/life balance are a good start, but
  6. 13.

    1 — How frequently do you deploy? 2 — How

    long does it take for code to go live? 3 — How many of your deploys fail? 4 — How long does it take to recover from an outage? 5 — How often are you paged outside work hours?
  7. 16.

    It really, really, really, really, really pays off to be

    on a high performing team. Like REALLY.
  8. 17.

    Q: What happens when an engineer from the elite yellow

    bubble joins a team in the blue bubble? A: Your productivity tends to rise (or fall) to the level of the team you join.
  9. 18.

    Who is going to be the better engineer two years

    from now? 3000 deploys/year 9 outages/year 6 hours firefighting 5 deploys/year 65 outages/year firefighting: constant
  10. 19.

    So how do we build high-performing teams? “Just hire the

    best engineers, and you’ll get the best team” Hire people who share your values and have the needed skills, and then the work of building a team can begin.
  11. 20.

    High-performing teams are continuously iterating towards production excellence. The work

    consists of cultivating sociotechnical feedback loops but it begins with observability. Happier customers, happier teams.
  12. 21.

    observability(n): “In control theory, observability is a measure of how

    well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia
  13. 22.

    Observability is not the same as monitoring. monitor your known-unknowns,

    instrument for observability into unknown-unknowns
  14. 23.

    Can you understand what’s happening inside your systems, just by

    asking questions from the outside? Can you debug your code and its behavior using its output? Can you answer new questions without shipping new code? o11y for software engineers:
  15. 24.

    You have an observable system when your team can quickly

    and reliably track down any new problem with no prior knowledge. For software engineers, this means being able to reason about your code, identify and fix bugs, and understand user experiences and behaviors ... via your instrumentation.
  16. 25.

    Observability requirements… https://www.honeycomb.io/blog/so-you-want-to-build-an-observability-tool/ • High cardinality • High dimensionality •

    Exploratory, open-ended investigation based on raw events • Service Level Objectives. No preaggregation. • Based on arbitrarily-wide structured events with span support • No indexes, schemas, or predefined structure • Bundling the full context of the request across network hops • Metrics != observability. Unstructured logs != observability.
  17. 26.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior https://www.honeycomb.io/wp-content/uploads/2019/06/Framework-for-an-Observability-Maturity-Model.pdf Observability Maturity Model … find your weakest category, and tackle that first
  18. 27.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior
  19. 28.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior
  20. 29.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior
  21. 30.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior
  22. 31.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior
  23. 32.

    Why are computers hard? Because we don't understand them And

    we keep shipping things anyway Our tools have rewarded guessing over debugging And vendors have happily misled you for $$$$ It’s time to fix this problem.
  24. 33.

    • Ephemeral and dynamic • Far-flung and loosely coupled •

    Partitioned, sharded • Distributed and replicated • Containers, schedulers • Service registries • Polyglot persistence strategies • Autoscaled, multiple failover • Emergent behaviors • ... etc Complexity is soaring
  25. 34.

    We don’t *know* what the questions are, all we have

    are unreliable symptoms or reports. Complexity is exploding everywhere, but our tools were designed for predictable worlds As soon as we know the question, we usually know the answer too.
  26. 35.

    We used to be able to reason about our architecture.

    Not anymore. 2003 2013 Now we have to instrument for observability. or we are screwed
  27. 36.

    Observability is the key to making the leap from known-unknowns

    to unknown- unknowns. unknown-unknowns known-unknowns monitoring observability
  28. 37.

    kick-start the virtuous cycle of you build it, you own

    it instrumenting two steps in front of you as you build never accept a PR unless you can explain it if it breaks watch your code go out as it deploys is it working as intended? does anything look weird look through the lens of your instrumentation
  29. 38.

    for extra fun … let’s examine the sociotechnical implications of

    the predominant architecture models of the past two decades: monoliths and microservices
  30. 39.

    Monolith • THE database • THE application • Known-unknowns and

    mostly predictable failures • Many monitoring checks/paging alerts • "Flip a switch" to deploy, changes are big bang and binary (all on/all off) • Failures to be prevented • Production is to be feared • Debug by intuition and scar tissue of past outages • Canned dashboards, runbooks, playbooks • Deploys are scary • Masochistic on-call culture sociotechnical causes & effects
  31. 40.

    Monolith • We built our systems like glass castles —

    a fragile, forbidding edifice that we could tightly control access to. • Very hostile to exploration or experimentation
  32. 41.

    • Many storage systems, many services, many polyglot technologies •

    Unknown-unknowns dominate • Every alert is a novel question • Rich, flexible instrumentation • Few paging alerts, tied to SLOs and keying off user pain • A deploy is just the beginning of gaining confidence in your code • Failures are your friend • Production is where your users live, you should be in there too, watching them every day • Debug methodically by examining the evidence and following the clues • Inspect the full context of the event • Deploys are opportunities • On-call must be sustainable, humane sociotechnical causes & effects Microservices
  33. 42.

    • Software ownership -- you build it, you run it

    • Robust, resilient, built for experimentation and testing in prod • Human scale, with guard rails for safety Microservices
  34. 43.

    Here's the dirty little secret. The next generation of systems

    won't be built and run by burned out, exhausted people, or command-and-control teams just following orders. It can't be done. they've become too complicated. too hard.
  35. 44.

    We can no longer fit these systems in our heads

    and reason about them -- if we try, we'll be outcompeted by teams who use proper tools. Our systems are emergent and unpredictable. We need more than just your logical brain; we need your full creative self.
  36. 45.

    "I don't have time to invest in observability right now.

    Maybe later” You can't afford not to.
  37. 47.

    on call will be shared by everyone who writes code.

    on call must be not-terrible. invest in your deploys, democratize production curate feedback loops (don’t be scared by regulations)
  38. 48.