The Sociotechnical Path to High-Performing Teams II

Slide 1

Slide 1 text

@mipsytipsy The Socio-Technical Path to ✨High-Performing✨ Teams Observability and the Glorious Future @mipsytipsy

Slide 2

Slide 2 text

the irreducible building block by which we organize ourselves and coordinate and scale our labor. Teams.

Slide 3

Slide 3 text

@mipsytipsy engineer/cofounder/CTO https://charity.wtf

Slide 4

Slide 4 text

The teams you join will deﬁne your career more than any other single factor.

Slide 5

Slide 5 text

autonomy, learning, high- achieving, learned from our mistakes, curious, responsibility, ownership, inspiring, camaraderie, pride, collaboration, career growth, rewarding, motivating manual labor, sacred cows, wasted effort, stale tech, ass-covering, fear, ﬁefdoms, excessive toil, command-and-control, cargo culting, enervating, discouraging, lethargy, indifference

Slide 6

Slide 6 text

they perform A high-performing team isn’t just fun to be on. Kind, inclusive coworkers and a great work/life balance are good things, but …

Slide 7

Slide 7 text

How well does YOUR team perform? https://services.google.com/fh/ﬁles/misc/state-of-devops-2019.pdf 4 key metrics.

Slide 8

Slide 8 text

1 — How frequently do you deploy? 2 — How long does it take for code to go live? 3 — How many of your deploys fail? 4 — How long does it take to recover from an outage? 5 — How often are you paged outside work hours?

Slide 9

Slide 9 text

There is a wide gap between elite teams and the bottom 50%.

Slide 10

Slide 10 text

It really, really, really, really, really pays off to be on a high performing team. Like REALLY.

Slide 11

Slide 11 text

Q: What happens when an engineer from the elite yellow bubble joins a team in the blue bubble? A: Your productivity tends to rise (or fall) to match that of the team you join.

Slide 12

Slide 12 text

Also, we waste a LOT of time. https://stripe.com/reports/developer-coefﬁcient-2018 42%!!!

Slide 13

Slide 13 text

How do we build high-performing teams? “Just hire the BEST ENGINEERS” (It is probably more accurate to say that high-performing teams produce great engineers than vice versa.)

Slide 14

Slide 14 text

Who will be the better engineer in two years? 3000 deploys/year 9 outages/year 6 hours firefighting 5 deploys/year 65 outages/year firefighting: constant Compelling Anecdata!

Slide 15

Slide 15 text

How do we improve the functioning of our sociotechnical system, so that the team can operate at a higher level? This is a systems problem. How do we build high-performing teams?

Slide 16

Slide 16 text

sociotechnical (n) “Technology is the sum of ways in which social groups construct the material objects of their civilizations. The things made are socially constructed just as much as technically constructed. The merging of these two things, construction and insight, is sociotechnology” — wikipedia if you change the tools people use, you can change how they behave and even who they are.

Slide 17

Slide 17 text

sociotechnical (n) Values Practices Tools

Slide 18

Slide 18 text

sociotechnical (n) Values Practices Tools

Slide 19

Slide 19 text

team of humans production systems tools+processes Values Practices Tools

Slide 20

Slide 20 text

team of humans production systems tools+processes sociotechnical (n)

Slide 21

Slide 21 text

Why are computers hard? Because we don't understand them And we keep shipping things anyway Our tools have rewarded guessing over debugging And vendors have happily misled you for $$$$ It’s time to change this, by hooking up sociotechnical loops with o11y

Slide 22

Slide 22 text

tools+processes Use your tools and processes to improve your tools and processes. “if you change the tools people use, you can change how they behave and even who they are.” Practice Observability-Driven Development (ODD)

Slide 23

Slide 23 text

observability(n): “In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability** and controllability of a system are mathematical duals." — wikipedia **observability is not monitoring, though both are forms of telemetry.

Slide 24

Slide 24 text

Can you understand what’s happening inside your systems, just by asking questions from the outside? Can you ﬁgure out what transpired and identify any system state? Can you answer any arbitrary new question … without shipping new code? o11y for software engineers:

Slide 25

Slide 25 text

The Bar: It’s not observability unless it meets these reqs. For more — read https://www.honeycomb.io/blog/so-you-want-to-build-an-observability-tool/ • High cardinality. High dimensionality • Composed of arbitrarily-wide structured events (!metrics,! unstructured logs) • Exploratory, open-ended investigation instead of dashboards • Can visualize in waterfall trace by time if span_id ﬁelds are included • No indexes, schemas, or predeﬁned structure • Bundles the full context of the request across service hops • Aggregates only at compute/read time across raw events

Slide 26

Slide 26 text

You have an observable system when your team can quickly and reliably diagnose any new behavior with no prior knowledge. observability begins with rich instrumentation, putting you in constant conversation with your code well-understood systems require minimal time spent ﬁreﬁghting

Slide 27

Slide 27 text

The app tier capacity is exceeded. Maybe we rolled out a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. It looks like the disk write throughput is saturated on the db data volume. Errors are high. Check the dashboard with a breakdown of error types and look for when it changed. “Photos are loading slowly for some people. Why?” monitor these things Monitoring Examples for a LAMP stack

Slide 28

Slide 28 text

“Photos are loading slowly for some people. Why?” Any microservices running on c2.4xlarge instances and PIOPS storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model. Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake. wtf do i ‘monitor’ for?! (Parse/Instagram questions, these require o11y)

Slide 29

Slide 29 text

"I have twenty microservices and a sharded db and three other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays. “All twenty app micro services have 10% of available nodes enter a simultaneous crash loop cycle, about ﬁve times a day, at unpredictable intervals. They have nothing in common afaik and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time.” “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.” “Disney is complaining that once in a while, but not always, they don’t see the photo they expected to see — they see someone else’s photo! When they refresh, it’s ﬁxed. Actually, we’ve had a few other people report this too, we just didn’t believe them.” “Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long long time to track down which app or user is generating disproportionate pressure on shared components of our system (esp databases). It’s different every time.” (continued)

Slide 30

Slide 30 text

• Ephemeral and dynamic • Far-ﬂung and loosely coupled • Partitioned, sharded • Distributed and replicated • Containers, schedulers • Service registries • Polyglot persistence strategies • Autoscaled, multiple failover • Emergent behaviors • ... etc Complexity is soaring; the ratio of unknown-unknowns to known-unknowns has ﬂipped Why now?

Slide 31

Slide 31 text

With a LAMP stack, you could lean on playbooks, guesses, pattern-matching and monitoring tools. 2003 2013 Now we have to instrument for observability. or we are screwed known-unknowns -> unknown-unknowns

Slide 32

Slide 32 text

Complexity is exploding everywhere, but our tools were designed for a predictable world Observability is the ﬁrst step to high-performing teams because most teams are ﬂying in the dark and don’t even know it, and everything gets so much easier once you can SEE.WHERE.YOU.ARE.GOING. They are using logs (where you have to know what you’re looking for) or metrics (pre-aggregated and don’t support high cardinality, so you can’t ask any detailed question or iterate/drill down on a question).

Slide 33

Slide 33 text

Without observability, your team must resort to guessing, pattern-matching and arguments from authority, and you will struggle to connect simple feedback loops in a timely manner. It’s like putting your glasses on before you drive off down the highway. Observability enables you to inspect cause and effect at a granular level — at the level of functions, endpoints and requests. This is a prerequisite for software engineers to own their code in production.

Slide 34

Slide 34 text

"I don't have time to invest in observability right now. Maybe later” You can't afford not to.

Slide 35

Slide 35 text

1. Resiliency to failure 2. High-quality code 3. Manage complexity and technical debt 4. Predictable releases 5. Understand user behavior https://www.honeycomb.io/wp-content/uploads/2019/06/Framework-for-an-Observability-Maturity-Model.pdf Observability Maturity Model … ﬁnd your weakest category, and tackle that ﬁrst. Rinse, repeat.