Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The SRE journey at RVU

The SRE journey at RVU

Google’s Site Reliability Engineering books lay out the principles and practices of SRE and the workbooks provide great practical examples of implementing these practices. However, anyone who has tried to roll out such practices across an organisation will no doubt have run into some hurdles.

In this talk we will dig into how RVU got started on their SRE journey. From what prompted the initial discussions, to how we rolled out new tooling to automate away some of the pains of adoption. We’ll cover the interfaces we built to engage with teams and what other possibilities we see in the future of our SRE automation journey.

- Data-driven conversations and visibility are the key ingredients in winning over teams and ultimately operating systems more reliably
- Rolling out SRE practices is difficult, find ways to make this simpler for all involved to adopt
- Automation is for everyone and should reach far beyond just deployments and infrastructure

Dewald Viljoen

April 08, 2020
Tweet

More Decks by Dewald Viljoen

Other Decks in Programming

Transcript

  1. 3 Uswitch is the UK’s top comparison website for home

    services switching. Money is one of the UK’s leading comparison websites for financial services. Bankrate is changing the mortgage market. We create innovative and personalised products that help you make smarter decisions. You know us better than you think
  2. 6 Broadband ⚙ Services Data Mobiles ⚙ Services Data Energy

    ⚙ Services Data Financial ⚙ Services Data
  3. 7 Cloud Infrastructure Broadband ⚙ Services Data Mobiles ⚙ Services

    Data Energy ⚙ Services Data Financial ⚙ Services Data
  4. 8 Cloud Infrastructure Financial ⚙ Services Data Broadband ⚙ Services

    Data Mobiles ⚙ Services Data Energy ⚙ Services Data
  5. 10 Heimdall the alerter https://github.com/uswitch/heimdall • Define alert templates •

    Watches for annotations on Ingress objects • Creates alerts using the templates from those Annotations
  6. 12 The was the start of a new interface between

    teams, infrastructure and operations
  7. 16 Energy Broadband Mobiles Financial Cloud Infrastructure ⚙ Services Data

    ⚙ Services Data ⚙ Services Data ⚙ Services Data
  8. 18 Vault the rotator • Define a binding between a

    service account and a database • Inject sidecars into pods • Sidecars fetch credentials and keep them refreshed https://github.com/uswitch/vault-webhook
  9. 23 Energy Broadband Mobiles Financial Cloud Infrastructure ⚙ Services Data

    ⚙ Services Data ⚙ Services Data ⚙ Services Data
  10. 25 Energy We’re seeing some performance problems Cloud Odd, no

    one else has raised anything Energy We’ve tested our app and compared it to our existing setup and performance is definitely worse Cloud Everything seems ok on our end
  11. 26 1. Energy expected performance on the shared platform to

    match their existing infrastructure 2. Cloud had no indicator for the expected performance level and whether it was being met Two problems
  12. 27 2017 ❤ 2018 2018 Energy team migration begins Questions

    arise around the platform’s service levels GDPR helps drive more teams’ migrations Vault CRDs introduced
  13. 29 1. Energy expected performance on the shared platform to

    match their existing infrastructure 2. Cloud had no indicator for the expected performance level and whether it was being met Two problems
  14. 31 1. Define a Service Level Indicator (SLI) to specify

    the desired reliability SLI SLO 2. Set a Service Level Objective (SLO) for this indicator
  15. 34 With this measured they put a plan in motion

    to bring the service level up to the objective of 99%
  16. 35 The energy team now knows what to expect Cloud

    now has a defined service level that allows it to prioritise reliability work As long as Cloud meets or exceeds its objective it is free to pursue other tasks
  17. 36 2017 ❤ 2018 GDPR helps drive more teams’ migrations

    Vault CRDs introduced 2018 Energy team migration begins Questions arise around the platform’s service levels 2018 Cloud introduces their first Service Level Objectives for the platform
  18. 37 With the value of the first SLOs proven Cloud

    continued with a roll out more to cover more of the platform
  19. 38 Category SLI SLO Availability Proportion of successful requests as

    measured by DNS probes 99.5% Latency Latency of DNS requests as measured by DNS probes < 45ms 99% Category SLI SLO Freshness Average time lag (current time - timestamp on newest log) as measured across the core indices < 1 minute 99% Cluster DNS Logging
  20. 39 2017 ❤ 2018 GDPR helps drive more teams’ migrations

    Vault CRDs introduced 2018 Energy team migration begins Questions arise around the platform’s service levels 2018 Cloud introduces their first Service Level Objectives for the platform 2018 Cloud extends SLO coverage to additional parts of the platform
  21. 41 Mobiles Financial Cloud Infrastructure Broadband ⚙ Services Data ⚙

    Services Data ⚙ Services Data Energy ⚙ Services Data SRE
  22. 45 SREs at RVU work alongside our platform and product

    teams to adopt modern operational practices
  23. 46 We help teams adopt SRE practices to better define

    the relationship between them and their customers
  24. 48 2017 ❤ 2018 GDPR helps drive more teams’ migrations

    Vault CRDs introduced 2018 Energy team migration begins Questions arise around the platform’s service levels 2018 Cloud introduces their first Service Level Objectives for the platform 2018 2019 The SRE team is formally announced Cloud extends SLO coverage to additional parts of the platform
  25. 51 SLO Controller for everyone • Define a set of

    SLOs as Metrics, Selectors and Objectives • Creates Prometheus recording rules to calculate values • Generates dashboards on Grafana for teams to view
  26. 54 Teams saw the value of SRE practices especially after

    being shown their current performance using SLOs
  27. 55 Teams that have adopted SRE practices show a bias

    towards talking in relative terms about reasonable reliability
  28. 57 Fastly Metrics for prometheus • Add syslog output to

    Fastly • Add JS to log performance metrics to a endpoint in Fastly • Ingest logs for services • Create metrics for service side and client side metrics
  29. 61 Incident bot for those bad days • Ask it

    to start a `/incident` • Creates a channel in Slack • Invites people • Generates a report when you tell it you’re done
  30. 64 • Automated alerts on Error budget burn rates •

    Time remaining until error budget depletion • Escalation policy from service classifications • Automated load testing platform
  31. 65 • Using excess error budget for load and chaos

    tests in production • Error budget and SLOs for non-technical solutions • Discovering new measurements for SLIs
  32. 71 And finally, know that SRE is hard to get

    right but we can find ways to make it simpler to adopt