Slide 1

Slide 1 text

Teemu Tossavainen Retty Inc. Ensuring highly available services Infrastructure and service monitoring Tech cafe Retty SRE

Slide 2

Slide 2 text

Motivation Why?

Slide 3

Slide 3 text

How can we make People of the world happy? Are our users happy? What can we do to make them more happy?

Slide 4

Slide 4 text

To provide new features rapidly while retaining the quality of service I Goals ▸ Provide better quality and better service same time ▹ “To meet requirements of an user” ▸ Improve reliability ▹ By reducing errors and downtime II Ways ▸ Facilitate teamwork ▹ Common goals ▹ Better communication ▸ Lifecycle management and ownership ▸ Continuous improvement ▸ Monitor and measure

Slide 5

Slide 5 text

Infrastructure and service monitoring

Slide 6

Slide 6 text

Monitoring goals ▸ Facilitate information flow ▸ Provide accurate information of the system... But! ▹ Displaying the essential metrics that reflects the system state ▸ Human readability ▹ Big picture at one glance ▹ Trends ▹ Easily accessible details ▸ Retain information for analysis

Slide 7

Slide 7 text

Aspects of monitoring Quality - System transparency - Performance analysis - Better resource utilization - Performance metrics - End user experience Recovery & error handling - System awareness - Early detection - Alerting - Disaster and error Recovery - Big picture - Measures - Trend Improvement - Feedback - Information - Experience - KPI - Error logs - System awareness And understanding

Slide 8

Slide 8 text

Monitoring systems Others

Slide 9

Slide 9 text

Retty command center

Slide 10

Slide 10 text

So what is missing???

Slide 11

Slide 11 text

They are tools of Active monitoring Just like eye sight, it provides lots of information, but it is focused in front of you... Hearing is a remote sense that can alert you for remote events all around you.

Slide 12

Slide 12 text

Case: Retty SRE + PagerDuty + ...

Slide 13

Slide 13 text

PagerDuty ▸ 24H alerting service ▸ Multi channel ▹ Integrations ▸ Teams ▸ Statistics ▸ Integrations ▹ Chat services, Slack ▹ Custom services ▸ Metrics

Slide 14

Slide 14 text

Schedules

Slide 15

Slide 15 text

Escalation rules

Slide 16

Slide 16 text

Retty Infrastructure Retty monitoring + Retty team

Slide 17

Slide 17 text

How to facilitate SRE

Slide 18

Slide 18 text

Retty Infrastructure Retty monitoring +

Slide 19

Slide 19 text

DEMO

Slide 20

Slide 20 text

Playbook integration ▸ Contains the information for operations. ▸ Seamless integration ▸ Provides support on critical situation ▸ Facilitates ▹ Faster response times ▹ Correct procedures ▹ Reduces human error ▹ Know who to contact for help ▸ Communal learning

Slide 21

Slide 21 text

Results ▸ We have the right information at the right time ▸ We facilitate Retty SRE ▸ We improve response times and advocate early response ▸ We facilitate teamwork, proficiency, and ownership ▸ Communal learning ▸ Better service and reliability ▸ Happier users