Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from building an Internal Platform

Lessons from building an Internal Platform

In this talk, Martin Smith (Principal SRE, HashiCorp), and I look at what we learned from building an internal platform that helps developers, site reliability engineers, and governance teams to make more educated choices.

This version of the talk was given at DASH by Datadog in June 2024.

Kerim Satirli

June 28, 2024
Tweet

Resources

More Decks by Kerim Satirli

Other Decks in Programming

Transcript

  1. func main() { services := []string{"service-1", "service-2", "service-n"} for _,

    service := range services { } } // identify affected `services` internal_platform.go Collecting the data
  2. Collectors GitHub Incidents Datadog team health service health product health

    monitors dashboards errors / traces static collectors teams heuristics groupings Platform Team repo contents api data event driven Collecting the data
  3. GitHub static Datadog Incidents monitors dashboards errors and traces team

    health service health product health teams codebases Slack channels repo metadata full source code deploy health Classifying the data
  4. GitHub static Datadog Incidents logging best practices service attribution passing

    SLOs synthetic monitors team attribution escalation policy silenced incidents incident volume known teams Go versions SDK versions partitions code owners PR templates branch protection secret scanning Classifying the data
  5. "A collection of tools, services, and infrastructure components that help

    developers manage the entire application lifecycle." " "
  6. Attribute Ensure Detect Standardize Are there errors? Has it been

    deployed recently? Has it been scanned for vulnerabilities? Is it on a recent version? Is it using our standard monitoring? Does it lose alerts? Does it have SLOs, and are they being met? Does it have synthetic monitoring? Are there wake-up monitors? What team owns this? Where is this in GitHub, Slack or PagerDuty? Where are the dashboards? Mapping the problems
  7. How far behind is my service on Go or internal

    SDKs? How do I enable DR for my service? Engineering Does this service have SLOs and how is it tracking against those SLOs? Is the owner team on-call for this service? SREs Is code scanning enabled on this repository? Is this service subject to PCI controls? Governance Mapping the personas
  8. open and opinionated everyone can submit data everyone can check

    data one-stop data vending data is ephemeral schemas as safeguards common helpers Mapping the guardrails
  9. What developers need to know and have access to for

    their job and services. Engineering wants a Developer Platform What leaders need to be aware of related to customer experience and priorization. SRE wants a Reliability Platform What security and systems engineering teams need to understand how operators are supporting their services. Governance wants an Intelligence Platform Mapping the requirements
  10. loose data team health service health product health monitors dashboard

    errors / traces teams heuristics groupings api data events repo contents structured data product and service teams scenarios and checks service stacks generic assets insights campaigns quality score custom checks . . . profit? better service health
  11. Data Integration Remediation Community Inability to tag Change over time

    Outreach Making truth consistency of data new tags or data types discovery of data Freshness Critical path Exporting Result Scoring freshness of data volume and scaling structure tradeoffs Problems we solve
  12. Automation Reliability Targeting Understanding What PCI-covered repos have a Dependabot

    warning? What are we doing internally when we run our products? Company-wide SRE use cases — workflows and tiering * Feeding data to AI applications could much it much easier to operate services Future direction
  13. Platform(s) Internal Developer Platform What developers need to know and

    have access to for their job and services. Reliability Data Platform What leaders need to know to improve our customer experience and feature set. Service Intelligence Platform What Governance need to know to influence on how operators run their services.