Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Where are we going with observability? KubeDay ...

Where are we going with observability? KubeDay Australia

This was a presentation of "Where are we going with observability?" at KubeDay Australia. It covered observability and genai, then both together, both ways. Engaged folks who tolerated my humor well, and met several after. Will love to go again!

https://events.linuxfoundation.org/kubeday-australia/

Adrian Cole

October 15, 2024
Tweet

More Decks by Adrian Cole

Other Decks in Technology

Transcript

  1. Iʼm Adrian from the Elastic Observability team, mostly work on

    OpenTelemetry. Only a few months in GenAI programming
  2. Agenda • Quick sync on Observability • Overview of GenAI

    • Quick on GenAI Observability • Deeper on Observability with GenAI
  3. OpenTelemetry is a CNCF project that addresses Observability in terms

    of specifications, SDKs and tools. The next slides cover traditional observability primitives and challenges.
  4. Traditional Observability focuses on data collection and primary signals •

    We normalize key metrics or correlation fields we care about most • We spend a lot of energy in instrumentation • We collect a lot of logs, but may struggle analyzing impact of them
  5. CNCF TAG Runtime includes the Cloud Native AI Working Group.

    This is an expert group you can follow for anything from landscape to solutions. The next slides cover basics of GenAI including some wins and challenges
  6. Generative Models generate new data based on known data •

    Causal language models are unidirectional: predict words based on previous ones • Large Language Models LLMs can perform functions like chat, code completion, etc. The LLM can answer correctly if geography was a part of the text it was trained on Assistant: Bouvet Island is located in the South Atlantic Ocean. User: Which ocean contains bouvet island?
  7. LLMs can know what you mean • If you search

    for something in a database and misspell it → nope • If you mess up syntax for a query → nope The LLM is trained on language, so it can compensate on slightly off input Assistant: Bouvet Island is located in the South Atlantic Ocean. User: Which ocean cntains the bouvet island?
  8. LLMs know multiple languages • Models like qwen2.5 know dozens

    of written languages • LLMs can respond in the userʼs language even if its source is english Assistant: Pulau Bouvet terletak dalam Laut Atlantik Selatan. User: Lautan manakah yang mengandungi pulau bouvet?
  9. LLMs can hallucinate, giving irrelevant, nonsense or factually incorrect answers

    • LLMs rely on statistical correlation and may invent things to fill gaps • Mitigate with selecting relevant models, prompt engineering, etc. There is no capital! It is only penguins and researchers! Assistant: The capital of the Bouvet Island is Punta Arenas. User: What is the capital of the Bouvet island?
  10. There is no correlation between recency or version of a

    model and what it knows about. • LLMs have a training date which is often not documented • LLMs are trained differently and may have no knowledge of a specific topic • LLMs might hallucinate when asked about their training date or knowledge cut-off! Should be Stanley Assistant: As of my knowledge cutoff date in September 2023… version 1.21 User: What is the latest version of go? The easiest way to add new information is to pass it to the context of an existing model, this is called RAG.
  11. GenAI is frequently used for assistants • They can re-use

    the strengths of an LLM • Add or restrict data it can use to answer questions using RAG • Humans can validate answers and accept high latency
  12. These concepts connect both ways! • AI Observability is about

    LLM performance and cost • Observability using AI is about improving your process with LLMs
  13. OpenTelemetry has a GenAI/LLM Special Interest Group SIG) that creates

    Semantic Conventions (schemas) for Traces, Metrics and Log events github.com/open-telemetry/community
  14. Langtrace defines three pillars of GenAI observability • Traces -

    usage, cost, latency, success rate and model completions • Evaluations - Manual and automated for feedback • Dataset management - For fine tuning and bootstrapping few shot demos https://langtrace.ai
  15. GenAI Observability is normal obs with some key challenges •

    OpenAI is dominant within an otherwise very diverse ecosystem • Expectation of immediate support of features not planned openly • Norms of full request/response collection for debugging chats • Token counts are tied to spend, and both a system and biz metric • Vendors desire common telemetry tools, but they arenʼt there, yet • Accuracy is subjective and depends on the use case
  16. Observability can leverage GenAI in many ways • Assisting analysis

    of existing tools • Extracting patterns or system entities out of logs • Configuring or training data feeds • Connecting telemetry data to help desk tickets or source code
  17. Logs are always there, AI can get more out of

    them • Logs start before other instrumentation like tracing • Some problems exist like changing log formats (think k8s) • Logs often have stack traces which can be cross-checked with code
  18. Helps SREs interpret log messages and errors, optimize code, write

    reports, and even identify and execute a runbook! Observability AI Assistant There are other assistants in Kibana like Search, Security and Integration!
  19. Observability with AI isnʼt always GenAI! • bert-base-NER is a

    bidirectional model for Named Entity Recognition • This can be used for precise identification of PII in a redaction pipeline www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1
  20. GenAI helps humans in Observability • Assist in laborious tasks

    or root cause analysis • Flex with changing log formats or new types of frameworks • Get more value out of your logs, generating configuration like SLOs • Consider more fields than a human can comprehend • Integrate non-traditional facts such as pull requests, issues and code • Elastic Observability Lab -> www.elastic.co/observability-labs • Runtime TAG -> Cloud Native AI Working Group • OpenTelemetry -> Semantic Conventions: LLM SIG Want more?
  21. Observability Assistants retrieved data RAG • Understand language in help

    desk tickets or issue trackers • Connect stack traces to source code • Identify connections between events which are inconsistently labeled Code PR
  22. Example of LLM for Observability: log analyzer The “USER” enters

    log lines instead of a question SYSTEM """ You are a log file analyzer. You will receive a set of lines from a log file for some software application, find the errors and other interesting aspects of the logs, and explain them so a new user can understand what they mean. If there are any steps they can do to resolve them, list the steps in your answer. """ –snip– 2023-11-10 07:17:40 /docker-entrypoint.sh: Configuration complete; ready for start up 2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: using the "epoll" event method 2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: nginx/1.25.3 –snip– 2023-11-10 07:18:53 2023/11/10 13:18:53 [error] 29#29: *1 open() "/usr/share/nginx/html/ahstat" failed (2: No such file or directory), client: 192.168.65.1, server: localhost, request: "GET /ahstat HTTP/1.1", host: "localhost:8080" –snip–
  23. The LLM Responds with a summary of what it can

    figure out Here are the logs for your Nginx server from 2023-11-10 13:17:40 to 13:18:53. The first log entry shows that a worker process has started, which is normal and expected behavior. The next three entries show requests made by a client with IP address 192.168.65.1 to the server at localhost:8080 for the root path "/", "/favicon.ico", and "/ahstat". The first request returned a 200 status code, which indicates success. However, the requests for "/favicon.ico" and "/ahstat" both failed with a 404 status code, indicating that those resources were not found on the server. The final log entry shows another request from the same client to "/ahstat", also resulting in a 404 error. There is nothing inherently wrong or unusual about these logs, but it's worth noting that Nginx could not find the "/favicon.ico" and "/ahstat" resources requested by the client, which may indicate a misconfiguration or missing files on your server. If you expect to serve those resources, you should ensure they exist in the correct location (by default, /usr/share/nginx/html) and that Nginx is properly configured to serve them.
  24. Extract signals from logs without instrumentation Typically instrumentation code has

    to be very precise, particularly in how it represents errors. These rules are typically in a schema. The LLM derives impact and semantics from plain logs, without any preparation, even knowing 404 is not necessarily an error.