The Sociotechnical Path to High-Performing Teams II

The Sociotechnical Path to High-Performing Teams II

Ac734fc32781678475b577944bb5a9ae?s=128

Charity Majors

May 26, 2020
Tweet

Transcript

  1. 2.
  2. 5.

    autonomy, learning, high- achieving, learned from our mistakes, curious, responsibility,

    ownership, inspiring, camaraderie, pride, collaboration, career growth, rewarding, motivating manual labor, sacred cows, wasted effort, stale tech, ass-covering, fear, fiefdoms, excessive toil, command-and-control, cargo culting, enervating, discouraging, lethargy, indifference
  3. 6.

    they perform A high-performing team isn’t just fun to be

    on. Kind, inclusive coworkers and a great work/life balance are good things, but …
  4. 8.

    1 — How frequently do you deploy? 2 — How

    long does it take for code to go live? 3 — How many of your deploys fail? 4 — How long does it take to recover from an outage? 5 — How often are you paged outside work hours?
  5. 10.

    It really, really, really, really, really pays off to be

    on a high performing team. Like REALLY.
  6. 11.

    Q: What happens when an engineer from the elite yellow

    bubble joins a team in the blue bubble? A: Your productivity tends to rise (or fall) to match that of the team you join.
  7. 13.

    How do we build high-performing teams? “Just hire the BEST

    ENGINEERS” (It is probably more accurate to say that high-performing teams produce great engineers than vice versa.)
  8. 14.

    Who will be the better engineer in two years? 3000

    deploys/year 9 outages/year 6 hours firefighting 5 deploys/year 65 outages/year firefighting: constant Compelling Anecdata!
  9. 15.

    How do we improve the functioning of our sociotechnical system,

    so that the team can operate at a higher level? This is a systems problem. How do we build high-performing teams?
  10. 16.

    sociotechnical (n) “Technology is the sum of ways in which

    social groups construct the material objects of their civilizations. The things made are socially constructed just as much as technically constructed. The merging of these two things, construction and insight, is sociotechnology” — wikipedia if you change the tools people use, you can change how they behave and even who they are.
  11. 21.

    Why are computers hard? Because we don't understand them And

    we keep shipping things anyway Our tools have rewarded guessing over debugging And vendors have happily misled you for $$$$ It’s time to change this, by hooking up sociotechnical loops with o11y
  12. 22.

    tools+processes Use your tools and processes to improve your tools

    and processes. “if you change the tools people use, you can change how they behave and even who they are.” Practice Observability-Driven Development (ODD)
  13. 23.

    observability(n): “In control theory, observability is a measure of how

    well internal states of a system can be inferred from knowledge of its external outputs. The observability** and controllability of a system are mathematical duals." — wikipedia **observability is not monitoring, though both are forms of telemetry.
  14. 24.

    Can you understand what’s happening inside your systems, just by

    asking questions from the outside? Can you figure out what transpired and identify any system state? Can you answer any arbitrary new question … without shipping new code? o11y for software engineers:
  15. 25.

    The Bar: It’s not observability unless it meets these reqs.

    For more — read https://www.honeycomb.io/blog/so-you-want-to-build-an-observability-tool/ • High cardinality. High dimensionality • Composed of arbitrarily-wide structured events (!metrics,! unstructured logs) • Exploratory, open-ended investigation instead of dashboards • Can visualize in waterfall trace by time if span_id fields are included • No indexes, schemas, or predefined structure • Bundles the full context of the request across service hops • Aggregates only at compute/read time across raw events
  16. 26.

    You have an observable system when your team can quickly

    and reliably diagnose any new behavior with no prior knowledge. observability begins with rich instrumentation, putting you in constant conversation with your code well-understood systems require minimal time spent firefighting
  17. 27.

    The app tier capacity is exceeded. Maybe we rolled out

    a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. It looks like the disk write throughput is saturated on the db data volume. Errors are high. Check the dashboard with a breakdown of error types and look for when it changed. “Photos are loading slowly for some people. Why?” monitor these things Monitoring Examples for a LAMP stack
  18. 28.

    “Photos are loading slowly for some people. Why?” Any microservices

    running on c2.4xlarge instances and PIOPS storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model. Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake. wtf do i ‘monitor’ for?! (Parse/Instagram questions, these require o11y)
  19. 29.

    "I have twenty microservices and a sharded db and three

    other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays. “All twenty app micro services have 10% of available nodes enter a simultaneous crash loop cycle, about five times a day, at unpredictable intervals. They have nothing in common afaik and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time.” “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.” “Disney is complaining that once in a while, but not always, they don’t see the photo they expected to see — they see someone else’s photo! When they refresh, it’s fixed. Actually, we’ve had a few other people report this too, we just didn’t believe them.” “Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long long time to track down which app or user is generating disproportionate pressure on shared components of our system (esp databases). It’s different every time.” (continued)
  20. 30.

    • Ephemeral and dynamic • Far-flung and loosely coupled •

    Partitioned, sharded • Distributed and replicated • Containers, schedulers • Service registries • Polyglot persistence strategies • Autoscaled, multiple failover • Emergent behaviors • ... etc Complexity is soaring; the ratio of unknown-unknowns to known-unknowns has flipped Why now?
  21. 31.

    With a LAMP stack, you could lean on playbooks, guesses,

    pattern-matching and monitoring tools. 2003 2013 Now we have to instrument for observability. or we are screwed known-unknowns -> unknown-unknowns
  22. 32.

    Complexity is exploding everywhere, but our tools were designed for

    a predictable world Observability is the first step to high-performing teams because most teams are flying in the dark and don’t even know it, and everything gets so much easier once you can SEE.WHERE.YOU.ARE.GOING. They are using logs (where you have to know what you’re looking for) or metrics (pre-aggregated and don’t support high cardinality, so you can’t ask any detailed question or iterate/drill down on a question).
  23. 33.

    Without observability, your team must resort to guessing, pattern-matching and

    arguments from authority, and you will struggle to connect simple feedback loops in a timely manner. It’s like putting your glasses on before you drive off down the highway. Observability enables you to inspect cause and effect at a granular level — at the level of functions, endpoints and requests. This is a prerequisite for software engineers to own their code in production.
  24. 34.

    "I don't have time to invest in observability right now.

    Maybe later” You can't afford not to.
  25. 35.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior https://www.honeycomb.io/wp-content/uploads/2019/06/Framework-for-an-Observability-Maturity-Model.pdf Observability Maturity Model … find your weakest category, and tackle that first. Rinse, repeat.
  26. 36.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior observability maturity model (OMM)
  27. 37.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior observability maturity model (OMM)
  28. 38.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior observability maturity model (OMM)
  29. 39.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior observability maturity model (OMM)
  30. 40.

    1. Resiliency to failure 2. High-quality code 3. Manage complexity

    and technical debt 4. Predictable releases 5. Understand user behavior observability maturity model (OMM)
  31. 42.

    never accept a pull request unless you can answer, “how

    will I know when this breaks?” via your instrumentation deploy one mergeset at a time. watch your code roll out, then look thru the lens of your instrumentation and ask: “working as intended? anything else look weird?” and always wrap code in feature flags. “O.D.D.”
  32. 43.

    tools+processes Practice Observability-Driven Development (ODD) What you need to do

    is improve your tools and processes with your tools and processes. For example: • Connect output with actor upon action. Include rich context. • Shorten the intervals between action and result. • Signal-boost warnings, errors, and unexpected results • Ship smaller changes more often, with clear atomic owners • Instrument vigorously. Develop rich conventions and patterns for telemetry • Decouple deploys from releases • Reward curiosity with meaningful answers (and more questions) • Make it easy to be data-driven. Make it a cultural virtue. • Embrace software engineers into production, build guard rails • Make code go live by default after merge. DTRT by default with no manual action.
  33. 44.

    engineer merges diff. hours pass, multiple other engineers merge too

    someone triggers a deploy with a few days worth of merges the deploy fails, takes down the site, and pages on call who manually rolls back, then begins git bisecting this eats up her day and multiple other engineers everybody bitches about how on call sucks insidious loop 50+ engineer-hours to ship this change
  34. 45.

    engineer merges diff, which kicks off an automatic CI/CD and

    deploy deploy fails; notifies the engineer who merged, reverts to safety who swiftly spots the error via his instrumentation then adds tests & instrumentation to better detect it and promptly commits a fix eng time to ship this change: 10 min virtuous loop: it doesn’t have to be that bad.
  35. 46.

    Who will be happier and more fulfilled? 3000 deploys/year 9

    outages/year 6 hours firefighting 5 deploys/year 65 outages/year firefighting: constant
  36. 47.

    team of humans production systems tools+processes stop flying blind. instrument

    for o11y, modernize your toolset, move swiftly with confidence. use the principles of O.D.D. — measure, instrument, test, inspect, repeat — and the four core DORA metrics to ship faster and safer
  37. 48.

    In order to spend more of your time on productive

    activities, instrument, observe, and iterate on the tools and processes that gather, validate and ship your collective output as a team. Join teams that honor and value this work and are committed to consistently improving how they operate — not just shipping features. Look for teams that are humble and relentlessly focused on investing in their core business differentiators. Join teams that value junior engineers, and invest in their potential.
  38. 49.

    look for ways to save time; ship smaller changesets more

    often instrument, observe, measure before you act connect output directly to the actor with context shorten intervals between action and effect instrument vigorously, boost negative signals decouple deploys and releases iterate and optimize
  39. 51.

    Here's the dirty little secret. The next generation of systems

    won't be built and run by burned out, exhausted people, or command-and-control teams just following orders. It can't be done. they've become too complicated. too hard.
  40. 52.

    You can no longer model these systems in your head

    and leap to the solution -- you will be readily outcompeted by teams with modern tools. Our systems are emergent and unpredictable. Runbooks and canned playbooks be damned; we need your full creative self.
  41. 53.

    on call will be shared by everyone who writes code.

    on call must be not-terrible. invest in your deploys, instrument everything, democratize ownership over production, craft and curate feedback loops (don’t be scared by regulations)
  42. 54.

    Your labor is a scarce and precious resource. Lend it

    to those who are worthy of it. You only get one career; high-performing teams will let us spend more time learning and building, not mired in tech debt and shitty processes which are a waste of your life force.