2024-07-03 Eliminating toil with LLM

Slide 1

Slide 1 text

Eliminating toil with LLM 2024.07.03 Mercari x Treasure Data SRE Meetup 1 by @naka

Slide 2

Slide 2 text

2 naka Mercari Hallo SRE TL ● Joined Mercari as SRE in May, 2022. ● Moved to Mercari Hallo in June, 2023.

Slide 3

Slide 3 text

3 Agenda About Mercari Hallo Lots of alerts Alert response is a toil Approaches Future Work 1 2 3 4 5

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

6 Behind the successful start...

Slide 7

Slide 7 text

SLO-based alerts are not perfect yet SLO-based alerts couldn’t catch low-volume but important errors. We set threshold-based informative alerts. As a result,... 7

Slide 8

Slide 8 text

Too many non-paging alerts We have many informative non-paging alerts and it’s tough for oncall members. 1. All developers are on-call members. 2. Time consuming, exhausting. 8

Slide 9

Slide 9 text

Alert response is a toil 1. Manual: yes! 2. Repetitive: yes! 3. Automatable: To some extent yeah! 4. Tactical: yesssssssss! 5. No enduring value: sometimes yes! unless you ﬁnd a bug to ﬁx, which stops the alert from next time! Let’s eliminate it!!!!!! 9

Slide 10

Slide 10 text

Approaches 1. Improve alerts a. Reduce/adjust alerts to make all the alerts actionable. b. User journey-based SLO alert (WIP) 2. Automate alert response 10

Slide 11

Slide 11 text

Improvements 1. Reduce manual and repetitive work a. Use the same ﬁlter condition for the same alert b. The pattern of alerts are limited. c. Refer to previous alert response: i. “This case can be ignorable.” ii. “This alert is same as this one.” iii. “I'll see if it recurs.” 2. Accumulate knowledge a. A thread in the alert channel can be very valuable and informative but it tends to be lost. 11

Slide 12

Slide 12 text

Reduce manual and repetitive work 1. Rule-based auto alert response Slack bot 2. Generate ﬁlter condition a. Time range (30 mins ~ the event timestamp) b. app name e.g. backend c. module name e.g. graphql_server d. etc. 3. Search Logs/Trace a. Error count b. Top 5 gRPC method, graphql query/mutation c. Link New oncall members can easily know what to check when receiving an alert! 12

Slide 13

Slide 13 text

Accumulate Knowledge 1. Human response: Store the thread with human replies 2. Second alert: Search for the past human responses for the same alert 13

Slide 14

Slide 14 text

WIP: Summarize past responses by LLM As-Is: still requires human to read and interpret them. To-Be: LLM summarizes the past responses: 1. Status: Observing, Ticket Created, WIP, etc. 2. Error: error shared on Slack if exists 3. Link: 14 As-Is To-Be (sample)

Slide 15

Slide 15 text

Other than human responses summary, many things to do! 1. Automate oncall handover generation. 2. Automate playbook/runbook (placeholder) generation based on Slack conversation. 3. etc. LLM is like a seasoning for me. It’s not always main dish (work) but it can make it more delicious (fun). Future Work 15

Slide 16

Slide 16 text

We’re looking for new SRE members! Software Engineer, Site Reliability - Mercari/HR領域新規事業 (Mercari Hallo) 16

Slide 17

Slide 17 text

Thank you! 17