The SRE journey at RVU

The SRE journey at RVU Dewald Viljoen Lead SRE @dewald_v

2 Who exactly is RVU?

3 Uswitch is the UK’s top comparison website for home
services switching. Money is one of the UK’s leading comparison websites for ﬁnancial services. Bankrate is changing the mortgage market. We create innovative and personalised products that help you make smarter decisions. You know us better than you think

The journey begins

5 Uswitch is running on ECS

6 Broadband ⚙ Services Data Mobiles ⚙ Services Data Energy
⚙ Services Data Financial ⚙ Services Data

7 Cloud Infrastructure Broadband ⚙ Services Data Mobiles ⚙ Services
Data Energy ⚙ Services Data Financial ⚙ Services Data

8 Cloud Infrastructure Financial ⚙ Services Data Broadband ⚙ Services
Data Mobiles ⚙ Services Data Energy ⚙ Services Data

9 As more teams on-boarded we saw opportunities to simplify
operations

10 Heimdall the alerter https://github.com/uswitch/heimdall • Deﬁne alert templates •
Watches for annotations on Ingress objects • Creates alerts using the templates from those Annotations

11 Heimdall the alerter https://github.com/uswitch/heimdall

12 The was the start of a new interface between
teams, infrastructure and operations

13 2017 ❤

The great migration

15 Accelerated by the GDPR more teams on-board onto the
Kube platform

16 Energy Broadband Mobiles Financial Cloud Infrastructure ⚙ Services Data
⚙ Services Data ⚙ Services Data ⚙ Services Data

17 GDPR provided another good reason to expand our interfaces
between teams

18 Vault the rotator • Deﬁne a binding between a
service account and a database • Inject sidecars into pods • Sidecars fetch credentials and keep them refreshed https://github.com/uswitch/vault-webhook

19 Vault the rotator https://github.com/uswitch/vault-webhook

20 2017 ❤ 2018 GDPR helps drive more teams’ migrations
Vault CRDs introduced

The last migration

22 The biggest of our businesses was yet to on-board

23 Energy Broadband Mobiles Financial Cloud Infrastructure ⚙ Services Data
⚙ Services Data ⚙ Services Data ⚙ Services Data

24 Our Energy business had a history of managing their
own platform very effectively

25 Energy We’re seeing some performance problems Cloud Odd, no
one else has raised anything Energy We’ve tested our app and compared it to our existing setup and performance is definitely worse Cloud Everything seems ok on our end

26 1. Energy expected performance on the shared platform to
match their existing infrastructure 2. Cloud had no indicator for the expected performance level and whether it was being met Two problems

27 2017 ❤ 2018 2018 Energy team migration begins Questions
arise around the platform’s service levels GDPR helps drive more teams’ migrations Vault CRDs introduced

The ﬁrst Service Level Objectives emerge

29 1. Energy expected performance on the shared platform to
match their existing infrastructure 2. Cloud had no indicator for the expected performance level and whether it was being met Two problems

30 Cloud needed a way of deﬁning, measuring and publishing
their service levels

31 1. Deﬁne a Service Level Indicator (SLI) to specify
the desired reliability SLI SLO 2. Set a Service Level Objective (SLO) for this indicator

32 SLI SLO Proportion of requests to a dummy application
that respond in < 20ms 99%

33 Cloud started measuring the reliability indicator and found it
was around 86%

34 With this measured they put a plan in motion
to bring the service level up to the objective of 99%

35 The energy team now knows what to expect Cloud
now has a deﬁned service level that allows it to prioritise reliability work As long as Cloud meets or exceeds its objective it is free to pursue other tasks

Vault CRDs introduced 2018 Energy team migration begins Questions arise around the platform’s service levels 2018 Cloud introduces their ﬁrst Service Level Objectives for the platform

37 With the value of the ﬁrst SLOs proven Cloud
continued with a roll out more to cover more of the platform

38 Category SLI SLO Availability Proportion of successful requests as
measured by DNS probes 99.5% Latency Latency of DNS requests as measured by DNS probes < 45ms 99% Category SLI SLO Freshness Average time lag (current time - timestamp on newest log) as measured across the core indices < 1 minute 99% Cluster DNS Logging

Vault CRDs introduced 2018 Energy team migration begins Questions arise around the platform’s service levels 2018 Cloud introduces their ﬁrst Service Level Objectives for the platform 2018 Cloud extends SLO coverage to additional parts of the platform

The birth of SRE @ RVU

41 Mobiles Financial Cloud Infrastructure Broadband ⚙ Services Data ⚙
Services Data ⚙ Services Data Energy ⚙ Services Data SRE

42 Is SRE just another team then?

43 Yes and no

44 Teams still build, own and run their services and
applications

45 SREs at RVU work alongside our platform and product
teams to adopt modern operational practices

46 We help teams adopt SRE practices to better deﬁne
the relationship between them and their customers

47 Maximise change velocity without breaching service level objectives

Vault CRDs introduced 2018 Energy team migration begins Questions arise around the platform’s service levels 2018 Cloud introduces their ﬁrst Service Level Objectives for the platform 2018 2019 The SRE team is formally announced Cloud extends SLO coverage to additional parts of the platform

Automated operations for everyone

50 Remember that interface between teams that cloud kept expanding?

51 SLO Controller for everyone • Deﬁne a set of
SLOs as Metrics, Selectors and Objectives • Creates Prometheus recording rules to calculate values • Generates dashboards on Grafana for teams to view

52 SLO Controller for everyone

53 SLO Controller for everyone

54 Teams saw the value of SRE practices especially after
being shown their current performance using SLOs

55 Teams that have adopted SRE practices show a bias
towards talking in relative terms about reasonable reliability

56 Can we apply this even further, beyond server side?

57 Fastly Metrics for prometheus • Add syslog output to
Fastly • Add JS to log performance metrics to a endpoint in Fastly • Ingest logs for services • Create metrics for service side and client side metrics

58 Fastly Metrics for prometheus

59 Fastly Metrics for prometheus

60 What about when things go wrong?

61 Incident bot for those bad days • Ask it
to start a `/incident` • Creates a channel in Slack • Invites people • Generates a report when you tell it you’re done

62 And that was just the beginning

The future of SRE at RVU

64 • Automated alerts on Error budget burn rates •
Time remaining until error budget depletion • Escalation policy from service classiﬁcations • Automated load testing platform

65 • Using excess error budget for load and chaos
tests in production • Error budget and SLOs for non-technical solutions • Discovering new measurements for SLIs

66 Open source as much as possible

The end (for now)

68 Don’t adopt SRE practices because it’s the next cool
thing after Kube

69 Do listen for those conversations where engineers feel or
think that their systems are ok

70 Automate as much as possible, far beyond just deployments
and infrastructure

71 And ﬁnally, know that SRE is hard to get
right but we can ﬁnd ways to make it simpler to adopt

Thank you! Dewald Viljoen Lead SRE @dewald_v

The SRE journey at RVU

The SRE journey at RVU

More Decks by Dewald Viljoen

Other Decks in Programming

Featured

Transcript