Where are we going with observability? KubeDay Australia

Slide 1

Slide 1 text

v KubeDay Australia Where are we going with observability? Adrian Cole, Principal Engineer @adrianfcole

Slide 2

Slide 2 text

Iʼm Adrian from the Elastic Observability team, mostly work on OpenTelemetry. Only a few months in GenAI programming

Slide 3

Slide 3 text

Agenda ● Quick sync on Observability ● Overview of GenAI ● Quick on GenAI Observability ● Deeper on Observability with GenAI

Slide 4

Slide 4 text

OpenTelemetry is a CNCF project that addresses Observability in terms of specifications, SDKs and tools. The next slides cover traditional observability primitives and challenges.

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Traditional Observability focuses on data collection and primary signals ● We normalize key metrics or correlation fields we care about most ● We spend a lot of energy in instrumentation ● We collect a lot of logs, but may struggle analyzing impact of them

Slide 10

Slide 10 text

CNCF TAG Runtime includes the Cloud Native AI Working Group. This is an expert group you can follow for anything from landscape to solutions. The next slides cover basics of GenAI including some wins and challenges

Slide 11

Slide 11 text

Generative Models generate new data based on known data ● Causal language models are unidirectional: predict words based on previous ones ● Large Language Models LLMs can perform functions like chat, code completion, etc. The LLM can answer correctly if geography was a part of the text it was trained on Assistant: Bouvet Island is located in the South Atlantic Ocean. User: Which ocean contains bouvet island?

Slide 12

Slide 12 text

LLMs can know what you mean ● If you search for something in a database and misspell it → nope ● If you mess up syntax for a query → nope The LLM is trained on language, so it can compensate on slightly oﬀ input Assistant: Bouvet Island is located in the South Atlantic Ocean. User: Which ocean cntains the bouvet island?

Slide 13

Slide 13 text

LLMs know multiple languages ● Models like qwen2.5 know dozens of written languages ● LLMs can respond in the userʼs language even if its source is english Assistant: Pulau Bouvet terletak dalam Laut Atlantik Selatan. User: Lautan manakah yang mengandungi pulau bouvet?

Slide 14

Slide 14 text

LLMs can hallucinate, giving irrelevant, nonsense or factually incorrect answers ● LLMs rely on statistical correlation and may invent things to fill gaps ● Mitigate with selecting relevant models, prompt engineering, etc. There is no capital! It is only penguins and researchers! Assistant: The capital of the Bouvet Island is Punta Arenas. User: What is the capital of the Bouvet island?

Slide 15

Slide 15 text

There is no correlation between recency or version of a model and what it knows about. ● LLMs have a training date which is often not documented ● LLMs are trained differently and may have no knowledge of a specific topic ● LLMs might hallucinate when asked about their training date or knowledge cut-off! Should be Stanley Assistant: As of my knowledge cutoff date in September 2023… version 1.21 User: What is the latest version of go? The easiest way to add new information is to pass it to the context of an existing model, this is called RAG.

Slide 16

Slide 16 text

GenAI is frequently used for assistants ● They can re-use the strengths of an LLM ● Add or restrict data it can use to answer questions using RAG ● Humans can validate answers and accept high latency

Slide 17

Slide 17 text

Where do Observability and GenAI connect?

Slide 18

Slide 18 text

These concepts connect both ways! ● AI Observability is about LLM performance and cost ● Observability using AI is about improving your process with LLMs

Slide 19

Slide 19 text

GenAI Observability

Slide 20

Slide 20 text

OpenTelemetry has a GenAI/LLM Special Interest Group SIG) that creates Semantic Conventions (schemas) for Traces, Metrics and Log events github.com/open-telemetry/community

Slide 21

Slide 21 text

Langtrace defines three pillars of GenAI observability ● Traces - usage, cost, latency, success rate and model completions ● Evaluations - Manual and automated for feedback ● Dataset management - For fine tuning and bootstrapping few shot demos https://langtrace.ai

Slide 22

Slide 22 text

GenAI Observability is normal obs with some key challenges ● OpenAI is dominant within an otherwise very diverse ecosystem ● Expectation of immediate support of features not planned openly ● Norms of full request/response collection for debugging chats ● Token counts are tied to spend, and both a system and biz metric ● Vendors desire common telemetry tools, but they arenʼt there, yet ● Accuracy is subjective and depends on the use case

Slide 23

Slide 23 text

Observability using GenAI

Slide 24

Slide 24 text

Observability can leverage GenAI in many ways ● Assisting analysis of existing tools ● Extracting patterns or system entities out of logs ● Configuring or training data feeds ● Connecting telemetry data to help desk tickets or source code

Slide 25

Slide 25 text

Logs are always there, AI can get more out of them ● Logs start before other instrumentation like tracing ● Some problems exist like changing log formats (think k8s) ● Logs often have stack traces which can be cross-checked with code

Slide 26

Slide 26 text

What are some real examples?

Slide 27

Slide 27 text

K8sGPT Converts errors and vulnerabilities to plain english!

Slide 28

Slide 28 text

Helps SREs interpret log messages and errors, optimize code, write reports, and even identify and execute a runbook! Observability AI Assistant There are other assistants in Kibana like Search, Security and Integration!

Slide 29

Slide 29 text

Elastic Automatic Import Automates the creation of custom data integrations, given sample data.

Slide 30

Slide 30 text

Observability with AI isnʼt always GenAI! ● bert-base-NER is a bidirectional model for Named Entity Recognition ● This can be used for precise identification of PII in a redaction pipeline www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1

Slide 31

Slide 31 text

Letʼs close up

Slide 32

Slide 32 text

GenAI helps humans in Observability ● Assist in laborious tasks or root cause analysis ● Flex with changing log formats or new types of frameworks ● Get more value out of your logs, generating configuration like SLOs ● Consider more fields than a human can comprehend ● Integrate non-traditional facts such as pull requests, issues and code ● Elastic Observability Lab -> www.elastic.co/observability-labs ● Runtime TAG -> Cloud Native AI Working Group ● OpenTelemetry -> Semantic Conventions: LLM SIG Want more?

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Observability Assistants retrieved data RAG ● Understand language in help desk tickets or issue trackers ● Connect stack traces to source code ● Identify connections between events which are inconsistently labeled Code PR

Slide 35

Slide 35 text

Example of LLM for Observability: log analyzer The “USER” enters log lines instead of a question SYSTEM """ You are a log file analyzer. You will receive a set of lines from a log file for some software application, find the errors and other interesting aspects of the logs, and explain them so a new user can understand what they mean. If there are any steps they can do to resolve them, list the steps in your answer. """ –snip– 2023-11-10 07:17:40 /docker-entrypoint.sh: Configuration complete; ready for start up 2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: using the "epoll" event method 2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: nginx/1.25.3 –snip– 2023-11-10 07:18:53 2023/11/10 13:18:53 [error] 29#29: *1 open() "/usr/share/nginx/html/ahstat" failed (2: No such file or directory), client: 192.168.65.1, server: localhost, request: "GET /ahstat HTTP/1.1", host: "localhost:8080" –snip–

Slide 36

Slide 36 text

The LLM Responds with a summary of what it can figure out Here are the logs for your Nginx server from 2023-11-10 13:17:40 to 13:18:53. The first log entry shows that a worker process has started, which is normal and expected behavior. The next three entries show requests made by a client with IP address 192.168.65.1 to the server at localhost:8080 for the root path "/", "/favicon.ico", and "/ahstat". The first request returned a 200 status code, which indicates success. However, the requests for "/favicon.ico" and "/ahstat" both failed with a 404 status code, indicating that those resources were not found on the server. The final log entry shows another request from the same client to "/ahstat", also resulting in a 404 error. There is nothing inherently wrong or unusual about these logs, but it's worth noting that Nginx could not find the "/favicon.ico" and "/ahstat" resources requested by the client, which may indicate a misconfiguration or missing files on your server. If you expect to serve those resources, you should ensure they exist in the correct location (by default, /usr/share/nginx/html) and that Nginx is properly configured to serve them.

Slide 37

Slide 37 text

Extract signals from logs without instrumentation Typically instrumentation code has to be very precise, particularly in how it represents errors. These rules are typically in a schema. The LLM derives impact and semantics from plain logs, without any preparation, even knowing 404 is not necessarily an error.