2024-07-03 Eliminating toil with LLM

Eliminating toil with LLM 2024.07.03 Mercari x Treasure Data SRE
Meetup 1 by @naka

2 naka Mercari Hallo SRE TL • Joined Mercari as
SRE in May, 2022. • Moved to Mercari Hallo in June, 2023.

3 Agenda About Mercari Hallo Lots of alerts Alert response
is a toil Approaches Future Work 1 2 3 4 5

6 Behind the successful start...

SLO-based alerts are not perfect yet SLO-based alerts couldn’t catch
low-volume but important errors. We set threshold-based informative alerts. As a result,... 7

Too many non-paging alerts We have many informative non-paging alerts
and it’s tough for oncall members. 1. All developers are on-call members. 2. Time consuming, exhausting. 8

Alert response is a toil 1. Manual: yes! 2. Repetitive:
yes! 3. Automatable: To some extent yeah! 4. Tactical: yesssssssss! 5. No enduring value: sometimes yes! unless you ﬁnd a bug to ﬁx, which stops the alert from next time! Let’s eliminate it!!!!!! 9

Approaches 1. Improve alerts a. Reduce/adjust alerts to make all
the alerts actionable. b. User journey-based SLO alert (WIP) 2. Automate alert response 10

Improvements 1. Reduce manual and repetitive work a. Use the
same ﬁlter condition for the same alert b. The pattern of alerts are limited. c. Refer to previous alert response: i. “This case can be ignorable.” ii. “This alert is same as this one.” iii. “I'll see if it recurs.” 2. Accumulate knowledge a. A thread in the alert channel can be very valuable and informative but it tends to be lost. 11

Reduce manual and repetitive work 1. Rule-based auto alert response
Slack bot 2. Generate ﬁlter condition a. Time range (30 mins ~ the event timestamp) b. app name e.g. backend c. module name e.g. graphql_server d. etc. 3. Search Logs/Trace a. Error count b. Top 5 gRPC method, graphql query/mutation c. Link New oncall members can easily know what to check when receiving an alert! 12

Accumulate Knowledge 1. Human response: Store the thread with human
replies 2. Second alert: Search for the past human responses for the same alert 13

WIP: Summarize past responses by LLM As-Is: still requires human
to read and interpret them. To-Be: LLM summarizes the past responses: 1. Status: Observing, Ticket Created, WIP, etc. 2. Error: error shared on Slack if exists 3. Link: 14 As-Is To-Be (sample)

Other than human responses summary, many things to do! 1.
Automate oncall handover generation. 2. Automate playbook/runbook (placeholder) generation based on Slack conversation. 3. etc. LLM is like a seasoning for me. It’s not always main dish (work) but it can make it more delicious (fun). Future Work 15

We’re looking for new SRE members! Software Engineer, Site Reliability
- Mercari/HR領域新規事業 (Mercari Hallo) 16

Thank you! 17

2024-07-03 Eliminating toil with LLM

2024-07-03 Eliminating toil with LLM

Naka Masato

More Decks by Naka Masato

Featured

Transcript

Eliminating toil with LLM 2024.07.03 Mercari x Treasure Data SRE

2 naka Mercari Hallo SRE TL • Joined Mercari as

3 Agenda About Mercari Hallo Lots of alerts Alert response

4

5

6 Behind the successful start...

SLO-based alerts are not perfect yet SLO-based alerts couldn’t catch

Too many non-paging alerts We have many informative non-paging alerts

Alert response is a toil 1. Manual: yes! 2. Repetitive:

Approaches 1. Improve alerts a. Reduce/adjust alerts to make all

Improvements 1. Reduce manual and repetitive work a. Use the

Reduce manual and repetitive work 1. Rule-based auto alert response

Accumulate Knowledge 1. Human response: Store the thread with human

WIP: Summarize past responses by LLM As-Is: still requires human

Other than human responses summary, many things to do! 1.

We’re looking for new SRE members! Software Engineer, Site Reliability

Thank you! 17