Lessons from building an Internal Platform

Slide 1

Slide 1 text

Lessons from building an Internal . . . Platform

Slide 2

Slide 2 text

on-call CRITICAL VULN PAGES ALERT #423: Critical Exploit in Flarelang v1.5.1 and lower detected. now

Slide 3

Slide 3 text

on-call disrupts

Slide 4

Slide 4 text

on-call mission

Slide 5

Slide 5 text

on-call identify every on-call engineer of every service we run publicly mission

Slide 6

Slide 6 text

affected teams and codebases 100% Platform Team Mapping the impact

Slide 7

Slide 7 text

affected teams and codebases 95% Platform Team Mapping the impact

Slide 8

Slide 8 text

affected teams and codebases 63% 32% Platform Team Mapping the impact

Slide 9

Slide 9 text

find a way to reach the 63% mission

Slide 10

Slide 10 text

Principal Site Reliability Engineer, Internal Platform and Services he / him @martinb3 Martin Smith

Slide 11

Slide 11 text

find who is running the latest release of Nomad mission

Slide 12

Slide 12 text

Senior Developer Advocate, Infrastructure & Orchestration he / him @ksatirli Kerim Satirli

Slide 13

Slide 13 text

func main() { services := []string{"service-1", "service-2", "service-n"} for _, service := range services { } } // identify affected `services` internal_platform.go Collecting the data

Slide 14

Slide 14 text

affected teams and codebases Platform Team 63% 32% Mapping the impact

Slide 15

Slide 15 text

Collectors GitHub Incidents Datadog team health service health product health monitors dashboards errors / traces static collectors teams heuristics groupings Platform Team repo contents api data event driven Collecting the data

Slide 16

Slide 16 text

Rendering the data

Slide 17

Slide 17 text

Rendering the data

Slide 18

Slide 18 text

02 Why we are building it

Slide 19

Slide 19 text

GitHub static Datadog Incidents monitors dashboards errors and traces team health service health product health teams codebases Slack channels repo metadata full source code deploy health Classifying the data

Slide 20

Slide 20 text

GitHub static Datadog Incidents logging best practices service attribution passing SLOs synthetic monitors team attribution escalation policy silenced incidents incident volume known teams Go versions SDK versions partitions code owners PR templates branch protection secret scanning Classifying the data

Slide 21

Slide 21 text

problems lots of data = lots of

Slide 22

Slide 22 text

lots of data = lots of use cases

Slide 23

Slide 23 text

"A collection of tools, services, and infrastructure components that help developers manage the entire application lifecycle." " "

Slide 24

Slide 24 text

Attribute Ensure Detect Standardize Are there errors? Has it been deployed recently? Has it been scanned for vulnerabilities? Is it on a recent version? Is it using our standard monitoring? Does it lose alerts? Does it have SLOs, and are they being met? Does it have synthetic monitoring? Are there wake-up monitors? What team owns this? Where is this in GitHub, Slack or PagerDuty? Where are the dashboards? Mapping the problems

Slide 25

Slide 25 text

How far behind is my service on Go or internal SDKs? How do I enable DR for my service? Engineering Does this service have SLOs and how is it tracking against those SLOs? Is the owner team on-call for this service? SREs Is code scanning enabled on this repository? Is this service subject to PCI controls? Governance Mapping the personas

Slide 26

Slide 26 text

open and opinionated everyone can submit data everyone can check data one-stop data vending data is ephemeral schemas as safeguards common helpers Mapping the guardrails

Slide 27

Slide 27 text

What developers need to know and have access to for their job and services. Engineering wants a Developer Platform What leaders need to be aware of related to customer experience and priorization. SRE wants a Reliability Platform What security and systems engineering teams need to understand how operators are supporting their services. Governance wants an Intelligence Platform Mapping the requirements

Slide 28

Slide 28 text

loose data team health service health product health monitors dashboard errors / traces teams heuristics groupings api data events repo contents structured data product and service teams scenarios and checks service stacks generic assets insights campaigns quality score custom checks . . . profit? better service health

Slide 29

Slide 29 text

Our definition of "service" is strongly tied to how we operate.

Slide 30

Slide 30 text

Data Integration Remediation Community Inability to tag Change over time Outreach Making truth consistency of data new tags or data types discovery of data Freshness Critical path Exporting Result Scoring freshness of data volume and scaling structure tradeoffs Problems we solve

Slide 31

Slide 31 text

03 Looking forward

Slide 32

Slide 32 text

Automation Reliability Targeting Understanding What PCI-covered repos have a Dependabot warning? What are we doing internally when we run our products? Company-wide SRE use cases — workflows and tiering * Feeding data to AI applications could much it much easier to operate services Future direction

Slide 33

Slide 33 text

Are we able to handle unexpected situations within our SLO budget?

Slide 34

Slide 34 text

Does a more complete set of data equate to higher service quality?

Slide 35

Slide 35 text

Does the company care if a dependency is flaky across many services?

Slide 36

Slide 36 text

Platform(s) Internal Developer Platform What developers need to know and have access to for their job and services. Reliability Data Platform What leaders need to know to improve our customer experience and feature set. Service Intelligence Platform What Governance need to know to influence on how operators run their services.