Save 37% off PRO during our Black Friday Sale! »

The SRE journey at RVU

The SRE journey at RVU

Google’s Site Reliability Engineering books lay out the principles and practices of SRE and the workbooks provide great practical examples of implementing these practices. However, anyone who has tried to roll out such practices across an organisation will no doubt have run into some hurdles.

In this talk we will dig into how RVU got started on their SRE journey. From what prompted the initial discussions, to how we rolled out new tooling to automate away some of the pains of adoption. We’ll cover the interfaces we built to engage with teams and what other possibilities we see in the future of our SRE automation journey.

- Data-driven conversations and visibility are the key ingredients in winning over teams and ultimately operating systems more reliably
- Rolling out SRE practices is difficult, find ways to make this simpler for all involved to adopt
- Automation is for everyone and should reach far beyond just deployments and infrastructure

Ebbf05aa4a4b254779f6d95cb4811808?s=128

Dewald Viljoen

April 08, 2020
Tweet

Transcript

  1. The SRE journey at RVU Dewald Viljoen Lead SRE @dewald_v

  2. 2 Who exactly is RVU?

  3. 3 Uswitch is the UK’s top comparison website for home

    services switching. Money is one of the UK’s leading comparison websites for financial services. Bankrate is changing the mortgage market. We create innovative and personalised products that help you make smarter decisions. You know us better than you think
  4. The journey begins

  5. 5 Uswitch is running on ECS

  6. 6 Broadband ⚙ Services Data Mobiles ⚙ Services Data Energy

    ⚙ Services Data Financial ⚙ Services Data
  7. 7 Cloud Infrastructure Broadband ⚙ Services Data Mobiles ⚙ Services

    Data Energy ⚙ Services Data Financial ⚙ Services Data
  8. 8 Cloud Infrastructure Financial ⚙ Services Data Broadband ⚙ Services

    Data Mobiles ⚙ Services Data Energy ⚙ Services Data
  9. 9 As more teams on-boarded we saw opportunities to simplify

    operations
  10. 10 Heimdall the alerter https://github.com/uswitch/heimdall • Define alert templates •

    Watches for annotations on Ingress objects • Creates alerts using the templates from those Annotations
  11. 11 Heimdall the alerter https://github.com/uswitch/heimdall

  12. 12 The was the start of a new interface between

    teams, infrastructure and operations
  13. 13 2017 ❤

  14. The great migration

  15. 15 Accelerated by the GDPR more teams on-board onto the

    Kube platform
  16. 16 Energy Broadband Mobiles Financial Cloud Infrastructure ⚙ Services Data

    ⚙ Services Data ⚙ Services Data ⚙ Services Data
  17. 17 GDPR provided another good reason to expand our interfaces

    between teams
  18. 18 Vault the rotator • Define a binding between a

    service account and a database • Inject sidecars into pods • Sidecars fetch credentials and keep them refreshed https://github.com/uswitch/vault-webhook
  19. 19 Vault the rotator https://github.com/uswitch/vault-webhook

  20. 20 2017 ❤ 2018 GDPR helps drive more teams’ migrations

    Vault CRDs introduced
  21. The last migration

  22. 22 The biggest of our businesses was yet to on-board

  23. 23 Energy Broadband Mobiles Financial Cloud Infrastructure ⚙ Services Data

    ⚙ Services Data ⚙ Services Data ⚙ Services Data
  24. 24 Our Energy business had a history of managing their

    own platform very effectively
  25. 25 Energy We’re seeing some performance problems Cloud Odd, no

    one else has raised anything Energy We’ve tested our app and compared it to our existing setup and performance is definitely worse Cloud Everything seems ok on our end
  26. 26 1. Energy expected performance on the shared platform to

    match their existing infrastructure 2. Cloud had no indicator for the expected performance level and whether it was being met Two problems
  27. 27 2017 ❤ 2018 2018 Energy team migration begins Questions

    arise around the platform’s service levels GDPR helps drive more teams’ migrations Vault CRDs introduced
  28. The first Service Level Objectives emerge

  29. 29 1. Energy expected performance on the shared platform to

    match their existing infrastructure 2. Cloud had no indicator for the expected performance level and whether it was being met Two problems
  30. 30 Cloud needed a way of defining, measuring and publishing

    their service levels
  31. 31 1. Define a Service Level Indicator (SLI) to specify

    the desired reliability SLI SLO 2. Set a Service Level Objective (SLO) for this indicator
  32. 32 SLI SLO Proportion of requests to a dummy application

    that respond in < 20ms 99%
  33. 33 Cloud started measuring the reliability indicator and found it

    was around 86%
  34. 34 With this measured they put a plan in motion

    to bring the service level up to the objective of 99%
  35. 35 The energy team now knows what to expect Cloud

    now has a defined service level that allows it to prioritise reliability work As long as Cloud meets or exceeds its objective it is free to pursue other tasks
  36. 36 2017 ❤ 2018 GDPR helps drive more teams’ migrations

    Vault CRDs introduced 2018 Energy team migration begins Questions arise around the platform’s service levels 2018 Cloud introduces their first Service Level Objectives for the platform
  37. 37 With the value of the first SLOs proven Cloud

    continued with a roll out more to cover more of the platform
  38. 38 Category SLI SLO Availability Proportion of successful requests as

    measured by DNS probes 99.5% Latency Latency of DNS requests as measured by DNS probes < 45ms 99% Category SLI SLO Freshness Average time lag (current time - timestamp on newest log) as measured across the core indices < 1 minute 99% Cluster DNS Logging
  39. 39 2017 ❤ 2018 GDPR helps drive more teams’ migrations

    Vault CRDs introduced 2018 Energy team migration begins Questions arise around the platform’s service levels 2018 Cloud introduces their first Service Level Objectives for the platform 2018 Cloud extends SLO coverage to additional parts of the platform
  40. The birth of SRE @ RVU

  41. 41 Mobiles Financial Cloud Infrastructure Broadband ⚙ Services Data ⚙

    Services Data ⚙ Services Data Energy ⚙ Services Data SRE
  42. 42 Is SRE just another team then?

  43. 43 Yes and no

  44. 44 Teams still build, own and run their services and

    applications
  45. 45 SREs at RVU work alongside our platform and product

    teams to adopt modern operational practices
  46. 46 We help teams adopt SRE practices to better define

    the relationship between them and their customers
  47. 47 Maximise change velocity without breaching service level objectives

  48. 48 2017 ❤ 2018 GDPR helps drive more teams’ migrations

    Vault CRDs introduced 2018 Energy team migration begins Questions arise around the platform’s service levels 2018 Cloud introduces their first Service Level Objectives for the platform 2018 2019 The SRE team is formally announced Cloud extends SLO coverage to additional parts of the platform
  49. Automated operations for everyone

  50. 50 Remember that interface between teams that cloud kept expanding?

  51. 51 SLO Controller for everyone • Define a set of

    SLOs as Metrics, Selectors and Objectives • Creates Prometheus recording rules to calculate values • Generates dashboards on Grafana for teams to view
  52. 52 SLO Controller for everyone

  53. 53 SLO Controller for everyone

  54. 54 Teams saw the value of SRE practices especially after

    being shown their current performance using SLOs
  55. 55 Teams that have adopted SRE practices show a bias

    towards talking in relative terms about reasonable reliability
  56. 56 Can we apply this even further, beyond server side?

  57. 57 Fastly Metrics for prometheus • Add syslog output to

    Fastly • Add JS to log performance metrics to a endpoint in Fastly • Ingest logs for services • Create metrics for service side and client side metrics
  58. 58 Fastly Metrics for prometheus

  59. 59 Fastly Metrics for prometheus

  60. 60 What about when things go wrong?

  61. 61 Incident bot for those bad days • Ask it

    to start a `/incident` • Creates a channel in Slack • Invites people • Generates a report when you tell it you’re done
  62. 62 And that was just the beginning

  63. The future of SRE at RVU

  64. 64 • Automated alerts on Error budget burn rates •

    Time remaining until error budget depletion • Escalation policy from service classifications • Automated load testing platform
  65. 65 • Using excess error budget for load and chaos

    tests in production • Error budget and SLOs for non-technical solutions • Discovering new measurements for SLIs
  66. 66 Open source as much as possible

  67. The end (for now)

  68. 68 Don’t adopt SRE practices because it’s the next cool

    thing after Kube
  69. 69 Do listen for those conversations where engineers feel or

    think that their systems are ok
  70. 70 Automate as much as possible, far beyond just deployments

    and infrastructure
  71. 71 And finally, know that SRE is hard to get

    right but we can find ways to make it simpler to adopt
  72. Thank you! Dewald Viljoen Lead SRE @dewald_v