Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Sociotechnical Path to High-Performing Teams II

The Sociotechnical Path to High-Performing Teams II

Charity Majors

May 26, 2020
Tweet

More Decks by Charity Majors

Other Decks in Technology

Transcript

  1. @mipsytipsy
    The Socio-Technical Path
    to ✨High-Performing✨ Teams
    Observability and the Glorious Future
    @mipsytipsy

    View Slide

  2. the irreducible building block
    by which we organize ourselves
    and coordinate and scale our labor.
    Teams.

    View Slide

  3. @mipsytipsy
    engineer/cofounder/CTO
    https://charity.wtf

    View Slide

  4. The teams you join will define your career
    more than any other single factor.

    View Slide

  5. autonomy, learning, high-
    achieving, learned from
    our mistakes, curious,
    responsibility, ownership,
    inspiring, camaraderie,
    pride, collaboration, career
    growth, rewarding,
    motivating
    manual labor, sacred cows,
    wasted effort, stale tech,
    ass-covering, fear,
    fiefdoms, excessive toil,
    command-and-control,
    cargo culting, enervating,
    discouraging, lethargy,
    indifference

    View Slide

  6. they perform
    A high-performing team isn’t just fun to be on.
    Kind, inclusive coworkers and a great
    work/life balance are good things, but …

    View Slide

  7. How well does YOUR team perform?
    https://services.google.com/fh/files/misc/state-of-devops-2019.pdf
    4
    key
    metrics.

    View Slide

  8. 1 — How frequently do you deploy?
    2 — How long does it take for code to go live?
    3 — How many of your deploys fail?
    4 — How long does it take to recover from an outage?
    5 — How often are you paged outside work hours?

    View Slide

  9. There is a wide gap between elite teams and the bottom 50%.

    View Slide

  10. It really, really, really,
    really, really
    pays off
    to be on a
    high performing team.
    Like REALLY.

    View Slide

  11. Q: What happens when an
    engineer from the elite yellow
    bubble joins a team in the blue
    bubble?
    A: Your productivity tends
    to rise (or fall) to match that
    of the team you join.

    View Slide

  12. Also, we waste a LOT of time.
    https://stripe.com/reports/developer-coefficient-2018
    42%!!!

    View Slide

  13. How do we build
    high-performing teams?
    “Just hire the BEST
    ENGINEERS”
    (It is probably more accurate to say that
    high-performing teams produce great engineers
    than vice versa.)

    View Slide

  14. Who will be the better engineer in two years?
    3000 deploys/year
    9 outages/year
    6 hours firefighting
    5 deploys/year
    65 outages/year
    firefighting: constant
    Compelling
    Anecdata!

    View Slide

  15. How do we improve the functioning of
    our sociotechnical system, so that the
    team can operate at a higher level?
    This is a systems problem.
    How do we build
    high-performing teams?

    View Slide

  16. sociotechnical (n)
    “Technology is the sum of ways in which social groups construct the
    material objects of their civilizations. The things made are socially
    constructed just as much as technically constructed. The merging of these
    two things, construction and insight, is sociotechnology” — wikipedia
    if you change the tools people use,
    you can change how they behave and even who they are.

    View Slide

  17. sociotechnical (n)
    Values
    Practices
    Tools

    View Slide

  18. sociotechnical (n)
    Values
    Practices
    Tools

    View Slide

  19. team of humans
    production systems
    tools+processes
    Values
    Practices
    Tools

    View Slide

  20. team of humans
    production systems
    tools+processes
    sociotechnical (n)

    View Slide

  21. Why are computers hard?
    Because we don't understand them
    And we keep shipping things anyway
    Our tools have rewarded guessing over debugging
    And vendors have happily misled you for $$$$
    It’s time to change this, by hooking up sociotechnical loops with o11y

    View Slide

  22. tools+processes
    Use your tools and processes to
    improve your tools and processes.
    “if you change the tools people use, you can change how they behave and even who they are.”
    Practice Observability-Driven Development (ODD)

    View Slide

  23. observability(n):
    “In control theory, observability is a measure of how well
    internal states of a system can be inferred from knowledge of
    its external outputs. The observability** and controllability of a
    system are mathematical duals." — wikipedia
    **observability is not monitoring, though both are forms of telemetry.

    View Slide

  24. Can you understand what’s happening inside your systems, just
    by asking questions from the outside? Can you figure out what
    transpired and identify any system state?
    Can you answer any arbitrary new question …
    without shipping new code?
    o11y for software engineers:

    View Slide

  25. The Bar: It’s not observability unless it meets these reqs.
    For more — read https://www.honeycomb.io/blog/so-you-want-to-build-an-observability-tool/
    • High cardinality. High dimensionality
    • Composed of arbitrarily-wide structured events (!metrics,!
    unstructured logs)
    • Exploratory, open-ended investigation instead of dashboards
    • Can visualize in waterfall trace by time if span_id fields are included
    • No indexes, schemas, or predefined structure
    • Bundles the full context of the request across service hops
    • Aggregates only at compute/read time across raw events

    View Slide

  26. You have an observable system
    when your team can quickly and reliably diagnose
    any new behavior with no prior knowledge.
    observability begins with
    rich instrumentation, putting you in
    constant conversation with your code
    well-understood systems require
    minimal time spent firefighting

    View Slide

  27. The app tier capacity is exceeded. Maybe we
    rolled out a build with a perf regression, or
    maybe some app instances are down.
    DB queries are slower than normal. It looks
    like the disk write throughput is saturated on
    the db data volume.
    Errors are high. Check the dashboard with a
    breakdown of error types and look for when
    it changed.
    “Photos are loading slowly for some people. Why?”
    monitor these things
    Monitoring Examples for a LAMP stack

    View Slide

  28. “Photos are loading slowly for some people. Why?”
    Any microservices running on c2.4xlarge instances and
    PIOPS storage in us-east-1b has a 1/20 chance of
    running on degraded hardware, and will take 20x
    longer to complete for requests that hit the disk with a
    blocking call. This disproportionately impacts people
    looking at older archives due to our fanout model.
    Canadian users who are using the French language
    pack on the iPad running iOS 9, are hitting a
    firmware condition which makes it fail saving to local
    cache … which is why it FEELS like photos are
    loading slowly
    Our newest SDK makes db queries sequentially if
    the developer has enabled an optional feature flag.
    Working as intended; the reporters all had debug
    mode enabled. But flag should be renamed for
    clarity sake.
    wtf do i ‘monitor’ for?!
    (Parse/Instagram questions, these require o11y)

    View Slide

  29. "I have twenty microservices and a sharded db and
    three other data stores across three regions, and
    everything seems to be getting a little bit slower
    over the past two weeks but nothing has changed
    that we know of, and oddly, latency is usually back to
    the historical norm on Tuesdays.
    “All twenty app micro services have 10% of available
    nodes enter a simultaneous crash loop cycle, about
    five times a day, at unpredictable intervals. They have
    nothing in common afaik and it doesn’t seem to
    impact the stateful services. It clears up before we
    can debug it, every time.”
    “Our users can compose their own queries that we
    execute server-side, and we don’t surface it to them
    when they are accidentally doing full table scans or
    even multiple full table scans, so they blame us.”
    “Disney is complaining that once in a while, but not
    always, they don’t see the photo they expected to
    see — they see someone else’s photo! When they
    refresh, it’s fixed. Actually, we’ve had a few other
    people report this too, we just didn’t believe them.”
    “Sometimes a bot takes off, or an app is featured on
    the iTunes store, and it takes us a long long time to
    track down which app or user is generating
    disproportionate pressure on shared components of
    our system (esp databases). It’s different every time.”
    (continued)

    View Slide

  30. • Ephemeral and dynamic
    • Far-flung and loosely coupled
    • Partitioned, sharded
    • Distributed and replicated
    • Containers, schedulers
    • Service registries
    • Polyglot persistence strategies
    • Autoscaled, multiple failover
    • Emergent behaviors
    • ... etc
    Complexity is soaring;
    the ratio of unknown-unknowns to
    known-unknowns has flipped
    Why now?

    View Slide

  31. With a LAMP stack, you could lean on playbooks,
    guesses, pattern-matching and monitoring tools.
    2003 2013
    Now we have to instrument for observability.
    or we are screwed
    known-unknowns -> unknown-unknowns

    View Slide

  32. Complexity is exploding everywhere,
    but our tools were designed for a predictable world
    Observability is the first step to high-performing teams because most
    teams are flying in the dark and don’t even know it, and everything
    gets so much easier once you can SEE.WHERE.YOU.ARE.GOING.
    They are using logs (where you have to know what you’re looking for) or metrics (pre-aggregated and don’t
    support high cardinality, so you can’t ask any detailed question or iterate/drill down on a question).

    View Slide

  33. Without observability, your team must resort to
    guessing, pattern-matching and arguments from
    authority, and you will struggle to connect simple
    feedback loops in a timely manner.
    It’s like putting your glasses on before you drive
    off down the highway.
    Observability enables you to inspect cause and effect
    at a granular level — at the level of functions, endpoints
    and requests. This is a prerequisite for software
    engineers to own their code in production.

    View Slide

  34. "I don't have time to invest in observability right now. Maybe later”
    You can't afford not to.

    View Slide

  35. 1. Resiliency to failure
    2. High-quality code
    3. Manage complexity and technical debt
    4. Predictable releases
    5. Understand user behavior
    https://www.honeycomb.io/wp-content/uploads/2019/06/Framework-for-an-Observability-Maturity-Model.pdf
    Observability Maturity Model
    … find your weakest category, and tackle that first. Rinse, repeat.

    View Slide

  36. 1. Resiliency to failure
    2. High-quality code
    3. Manage complexity and technical debt
    4. Predictable releases
    5. Understand user behavior observability maturity model (OMM)

    View Slide

  37. 1. Resiliency to failure
    2. High-quality code
    3. Manage complexity and technical debt
    4. Predictable releases
    5. Understand user behavior observability maturity model (OMM)

    View Slide

  38. 1. Resiliency to failure
    2. High-quality code
    3. Manage complexity and technical debt
    4. Predictable releases
    5. Understand user behavior observability maturity model (OMM)

    View Slide

  39. 1. Resiliency to failure
    2. High-quality code
    3. Manage complexity and technical debt
    4. Predictable releases
    5. Understand user behavior observability maturity model (OMM)

    View Slide

  40. 1. Resiliency to failure
    2. High-quality code
    3. Manage complexity and technical debt
    4. Predictable releases
    5. Understand user behavior observability maturity model (OMM)

    View Slide

  41. T.D.D. + Prod == O.D.D.
    production systems
    Observability-Driven Development

    View Slide

  42. never accept a pull request unless you can answer,
    “how will I know when this breaks?” via your instrumentation
    deploy one mergeset at a time. watch your code roll out,
    then look thru the lens of your instrumentation and ask:
    “working as intended? anything else look weird?”
    and always wrap code in feature flags.
    “O.D.D.”

    View Slide

  43. tools+processes
    Practice Observability-Driven Development (ODD)
    What you need to do is improve your tools and processes with your tools and
    processes. For example:
    • Connect output with actor upon action. Include rich context.
    • Shorten the intervals between action and result.
    • Signal-boost warnings, errors, and unexpected results
    • Ship smaller changes more often, with clear atomic owners
    • Instrument vigorously. Develop rich conventions and patterns for telemetry
    • Decouple deploys from releases
    • Reward curiosity with meaningful answers (and more questions)
    • Make it easy to be data-driven. Make it a cultural virtue.
    • Embrace software engineers into production, build guard rails
    • Make code go live by default after merge. DTRT by default with no manual
    action.

    View Slide

  44. engineer merges diff. hours pass, multiple other engineers merge too
    someone triggers a deploy with a few days worth of merges
    the deploy fails, takes down the site, and pages on call
    who manually rolls back, then begins git bisecting
    this eats up her day and multiple other engineers
    everybody bitches about how on call sucks
    insidious
    loop
    50+ engineer-hours to ship this change

    View Slide

  45. engineer merges diff, which kicks off an automatic CI/CD and deploy
    deploy fails; notifies the engineer who merged, reverts to safety
    who swiftly spots the error via his instrumentation
    then adds tests & instrumentation to better detect it
    and promptly commits a fix
    eng time to ship this change: 10 min
    virtuous
    loop:
    it doesn’t have to be that bad.

    View Slide

  46. Who will be happier and more fulfilled?
    3000 deploys/year
    9 outages/year
    6 hours firefighting
    5 deploys/year
    65 outages/year
    firefighting: constant

    View Slide

  47. team of humans
    production systems
    tools+processes
    stop flying blind.
    instrument for o11y, modernize your toolset,
    move swiftly with confidence.
    use the principles of O.D.D.
    — measure, instrument, test, inspect, repeat —
    and the four core DORA metrics
    to ship faster and safer

    View Slide

  48. In order to spend more of your time on productive activities,
    instrument, observe, and iterate on the tools and processes
    that gather, validate and ship your collective output as a team.
    Join teams that honor and value this work and are committed to
    consistently improving how they operate — not just shipping features.
    Look for teams that are humble and relentlessly focused
    on investing in their core business differentiators.
    Join teams that value junior engineers, and invest in their potential.

    View Slide

  49. look for ways to save time; ship smaller changesets more often
    instrument, observe, measure before you act
    connect output directly to the actor with context
    shorten intervals between action and effect
    instrument vigorously, boost negative signals
    decouple deploys and releases
    iterate and
    optimize

    View Slide

  50. where
    are we
    going?

    View Slide

  51. Here's the dirty little secret.
    The next generation of systems won't be built and run by burned out, exhausted
    people, or command-and-control teams just following orders.
    It can't be done.
    they've become too complicated. too hard.

    View Slide

  52. You can no longer model these systems in your head
    and leap to the solution -- you will be readily
    outcompeted by teams with modern tools.
    Our systems are emergent and unpredictable. Runbooks and
    canned playbooks be damned; we need your full creative self.

    View Slide

  53. on call will be shared by everyone who writes code.
    on call must be not-terrible.
    invest in your deploys, instrument everything,
    democratize ownership over production,
    craft and curate feedback loops
    (don’t be scared by regulations)

    View Slide

  54. Your labor is a scarce and precious resource.
    Lend it to those who are worthy of it.
    You only get one career; high-performing teams will let us spend
    more time learning and building, not mired in tech debt and
    shitty processes which are a waste of your life force.

    View Slide

  55. we have an opportunity here to make things better
    let's do it <3

    View Slide

  56. Charity Majors
    @mipsytipsy

    View Slide