Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2024-07-03 Eliminating toil with LLM

Naka Masato
July 03, 2024
140

2024-07-03 Eliminating toil with LLM

Naka Masato

July 03, 2024
Tweet

More Decks by Naka Masato

Transcript

  1. 2 naka Mercari Hallo SRE TL • Joined Mercari as

    SRE in May, 2022. • Moved to Mercari Hallo in June, 2023.
  2. 3 Agenda About Mercari Hallo Lots of alerts Alert response

    is a toil Approaches Future Work 1 2 3 4 5
  3. 4

  4. 5

  5. SLO-based alerts are not perfect yet SLO-based alerts couldn’t catch

    low-volume but important errors. We set threshold-based informative alerts. As a result,... 7
  6. Too many non-paging alerts We have many informative non-paging alerts

    and it’s tough for oncall members. 1. All developers are on-call members. 2. Time consuming, exhausting. 8
  7. Alert response is a toil 1. Manual: yes! 2. Repetitive:

    yes! 3. Automatable: To some extent yeah! 4. Tactical: yesssssssss! 5. No enduring value: sometimes yes! unless you find a bug to fix, which stops the alert from next time! Let’s eliminate it!!!!!! 9
  8. Approaches 1. Improve alerts a. Reduce/adjust alerts to make all

    the alerts actionable. b. User journey-based SLO alert (WIP) 2. Automate alert response 10
  9. Improvements 1. Reduce manual and repetitive work a. Use the

    same filter condition for the same alert b. The pattern of alerts are limited. c. Refer to previous alert response: i. “This case can be ignorable.” ii. “This alert is same as this one.” iii. “I'll see if it recurs.” 2. Accumulate knowledge a. A thread in the alert channel can be very valuable and informative but it tends to be lost. 11
  10. Reduce manual and repetitive work 1. Rule-based auto alert response

    Slack bot 2. Generate filter condition a. Time range (30 mins ~ the event timestamp) b. app name e.g. backend c. module name e.g. graphql_server d. etc. 3. Search Logs/Trace a. Error count b. Top 5 gRPC method, graphql query/mutation c. Link New oncall members can easily know what to check when receiving an alert! 12
  11. Accumulate Knowledge 1. Human response: Store the thread with human

    replies 2. Second alert: Search for the past human responses for the same alert 13
  12. WIP: Summarize past responses by LLM As-Is: still requires human

    to read and interpret them. To-Be: LLM summarizes the past responses: 1. Status: Observing, Ticket Created, WIP, etc. 2. Error: error shared on Slack if exists 3. Link: 14 As-Is To-Be (sample)
  13. Other than human responses summary, many things to do! 1.

    Automate oncall handover generation. 2. Automate playbook/runbook (placeholder) generation based on Slack conversation. 3. etc. LLM is like a seasoning for me. It’s not always main dish (work) but it can make it more delicious (fun). Future Work 15
  14. We’re looking for new SRE members! Software Engineer, Site Reliability

    - Mercari/HR領域新規事業 (Mercari Hallo) 16