Where are we going with observability? KubeDay Australia

v KubeDay Australia Where are we going with observability? Adrian
Cole, Principal Engineer @adrianfcole

Iʼm Adrian from the Elastic Observability team, mostly work on
OpenTelemetry. Only a few months in GenAI programming

Agenda • Quick sync on Observability • Overview of GenAI
• Quick on GenAI Observability • Deeper on Observability with GenAI

OpenTelemetry is a CNCF project that addresses Observability in terms
of specifications, SDKs and tools. The next slides cover traditional observability primitives and challenges.

Traditional Observability focuses on data collection and primary signals •
We normalize key metrics or correlation fields we care about most • We spend a lot of energy in instrumentation • We collect a lot of logs, but may struggle analyzing impact of them

CNCF TAG Runtime includes the Cloud Native AI Working Group.
This is an expert group you can follow for anything from landscape to solutions. The next slides cover basics of GenAI including some wins and challenges

Generative Models generate new data based on known data •
Causal language models are unidirectional: predict words based on previous ones • Large Language Models LLMs can perform functions like chat, code completion, etc. The LLM can answer correctly if geography was a part of the text it was trained on Assistant: Bouvet Island is located in the South Atlantic Ocean. User: Which ocean contains bouvet island?

LLMs can know what you mean • If you search
for something in a database and misspell it → nope • If you mess up syntax for a query → nope The LLM is trained on language, so it can compensate on slightly oﬀ input Assistant: Bouvet Island is located in the South Atlantic Ocean. User: Which ocean cntains the bouvet island?

LLMs know multiple languages • Models like qwen2.5 know dozens
of written languages • LLMs can respond in the userʼs language even if its source is english Assistant: Pulau Bouvet terletak dalam Laut Atlantik Selatan. User: Lautan manakah yang mengandungi pulau bouvet?

LLMs can hallucinate, giving irrelevant, nonsense or factually incorrect answers
• LLMs rely on statistical correlation and may invent things to fill gaps • Mitigate with selecting relevant models, prompt engineering, etc. There is no capital! It is only penguins and researchers! Assistant: The capital of the Bouvet Island is Punta Arenas. User: What is the capital of the Bouvet island?

There is no correlation between recency or version of a
model and what it knows about. • LLMs have a training date which is often not documented • LLMs are trained differently and may have no knowledge of a specific topic • LLMs might hallucinate when asked about their training date or knowledge cut-off! Should be Stanley Assistant: As of my knowledge cutoff date in September 2023… version 1.21 User: What is the latest version of go? The easiest way to add new information is to pass it to the context of an existing model, this is called RAG.

GenAI is frequently used for assistants • They can re-use
the strengths of an LLM • Add or restrict data it can use to answer questions using RAG • Humans can validate answers and accept high latency

Where do Observability and GenAI connect?

These concepts connect both ways! • AI Observability is about
LLM performance and cost • Observability using AI is about improving your process with LLMs

GenAI Observability

OpenTelemetry has a GenAI/LLM Special Interest Group SIG) that creates
Semantic Conventions (schemas) for Traces, Metrics and Log events github.com/open-telemetry/community

Langtrace defines three pillars of GenAI observability • Traces -
usage, cost, latency, success rate and model completions • Evaluations - Manual and automated for feedback • Dataset management - For fine tuning and bootstrapping few shot demos https://langtrace.ai

GenAI Observability is normal obs with some key challenges •
OpenAI is dominant within an otherwise very diverse ecosystem • Expectation of immediate support of features not planned openly • Norms of full request/response collection for debugging chats • Token counts are tied to spend, and both a system and biz metric • Vendors desire common telemetry tools, but they arenʼt there, yet • Accuracy is subjective and depends on the use case

Observability using GenAI

Observability can leverage GenAI in many ways • Assisting analysis
of existing tools • Extracting patterns or system entities out of logs • Configuring or training data feeds • Connecting telemetry data to help desk tickets or source code

Logs are always there, AI can get more out of
them • Logs start before other instrumentation like tracing • Some problems exist like changing log formats (think k8s) • Logs often have stack traces which can be cross-checked with code

What are some real examples?

K8sGPT Converts errors and vulnerabilities to plain english!

Helps SREs interpret log messages and errors, optimize code, write
reports, and even identify and execute a runbook! Observability AI Assistant There are other assistants in Kibana like Search, Security and Integration!

Elastic Automatic Import Automates the creation of custom data integrations,
given sample data.

Observability with AI isnʼt always GenAI! • bert-base-NER is a
bidirectional model for Named Entity Recognition • This can be used for precise identification of PII in a redaction pipeline www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1

Letʼs close up

GenAI helps humans in Observability • Assist in laborious tasks
or root cause analysis • Flex with changing log formats or new types of frameworks • Get more value out of your logs, generating configuration like SLOs • Consider more fields than a human can comprehend • Integrate non-traditional facts such as pull requests, issues and code • Elastic Observability Lab -> www.elastic.co/observability-labs • Runtime TAG -> Cloud Native AI Working Group • OpenTelemetry -> Semantic Conventions: LLM SIG Want more?

More slides

Observability Assistants retrieved data RAG • Understand language in help
desk tickets or issue trackers • Connect stack traces to source code • Identify connections between events which are inconsistently labeled Code PR

Example of LLM for Observability: log analyzer The “USER” enters
log lines instead of a question SYSTEM """ You are a log file analyzer. You will receive a set of lines from a log file for some software application, find the errors and other interesting aspects of the logs, and explain them so a new user can understand what they mean. If there are any steps they can do to resolve them, list the steps in your answer. """ –snip– 2023-11-10 07:17:40 /docker-entrypoint.sh: Configuration complete; ready for start up 2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: using the "epoll" event method 2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: nginx/1.25.3 –snip– 2023-11-10 07:18:53 2023/11/10 13:18:53 [error] 29#29: *1 open() "/usr/share/nginx/html/ahstat" failed (2: No such file or directory), client: 192.168.65.1, server: localhost, request: "GET /ahstat HTTP/1.1", host: "localhost:8080" –snip–

The LLM Responds with a summary of what it can
figure out Here are the logs for your Nginx server from 2023-11-10 13:17:40 to 13:18:53. The first log entry shows that a worker process has started, which is normal and expected behavior. The next three entries show requests made by a client with IP address 192.168.65.1 to the server at localhost:8080 for the root path "/", "/favicon.ico", and "/ahstat". The first request returned a 200 status code, which indicates success. However, the requests for "/favicon.ico" and "/ahstat" both failed with a 404 status code, indicating that those resources were not found on the server. The final log entry shows another request from the same client to "/ahstat", also resulting in a 404 error. There is nothing inherently wrong or unusual about these logs, but it's worth noting that Nginx could not find the "/favicon.ico" and "/ahstat" resources requested by the client, which may indicate a misconfiguration or missing files on your server. If you expect to serve those resources, you should ensure they exist in the correct location (by default, /usr/share/nginx/html) and that Nginx is properly configured to serve them.

Extract signals from logs without instrumentation Typically instrumentation code has
to be very precise, particularly in how it represents errors. These rules are typically in a schema. The LLM derives impact and semantics from plain logs, without any preparation, even knowing 404 is not necessarily an error.

Where are we going with observability? KubeDay ...

Where are we going with observability? KubeDay Australia

Adrian Cole

More Decks by Adrian Cole

Other Decks in Technology

Featured

Transcript

v KubeDay Australia Where are we going with observability? Adrian

Iʼm Adrian from the Elastic Observability team, mostly work on

Agenda • Quick sync on Observability • Overview of GenAI

OpenTelemetry is a CNCF project that addresses Observability in terms

Traditional Observability focuses on data collection and primary signals •

CNCF TAG Runtime includes the Cloud Native AI Working Group.

Generative Models generate new data based on known data •

LLMs can know what you mean • If you search

LLMs know multiple languages • Models like qwen2.5 know dozens

LLMs can hallucinate, giving irrelevant, nonsense or factually incorrect answers

There is no correlation between recency or version of a

GenAI is frequently used for assistants • They can re-use

Where do Observability and GenAI connect?

These concepts connect both ways! • AI Observability is about

GenAI Observability

OpenTelemetry has a GenAI/LLM Special Interest Group SIG) that creates

Langtrace defines three pillars of GenAI observability • Traces -

GenAI Observability is normal obs with some key challenges •

Observability using GenAI

Observability can leverage GenAI in many ways • Assisting analysis

Logs are always there, AI can get more out of

What are some real examples?

K8sGPT Converts errors and vulnerabilities to plain english!

Helps SREs interpret log messages and errors, optimize code, write

Elastic Automatic Import Automates the creation of custom data integrations,

Observability with AI isnʼt always GenAI! • bert-base-NER is a

Letʼs close up

GenAI helps humans in Observability • Assist in laborious tasks

More slides

Observability Assistants retrieved data RAG • Understand language in help

Example of LLM for Observability: log analyzer The “USER” enters

The LLM Responds with a summary of what it can

Extract signals from logs without instrumentation Typically instrumentation code has