Practical GenAI with Go - GopherCon Singapore

Slide 1

Slide 1 text

Practical GenAI with Go GopherCon Singapore

Slide 2

Slide 2 text

Introductions! I’m Adrian from the Elastic Observability team, mostly work on OpenTelemetry. Baby in GenAI programming (<9 months full-time), but spent 100+hours on this deck! Co-led wazero, zero dependency webassembly runtime for go, a couple years ago. Tons of portability things in my open source history github.com/codefromthecrypt x.com/adrianfcole

Slide 3

Slide 3 text

Agenda ● Introduction to GenAI ● Running models locally with llama.cpp ● Remote access to LLMs using the OpenAI REST API ● Simplifying model hosting with Ollama ● Coding GenAI with Parakeet ● Context, RAG and Tools (secret behind agents) ● Choose which demo you want ● Summing it up and things we left out

Slide 4

Slide 4 text

Introduction to GenAI Generative Models generate new data based on known data ● Causal language models are unidirectional: predict words based on previous ones ● Large Language Models (LLMs) are typically trained for chat, code completion, etc. The LLM can answer correctly if geography was a part of the text it was trained on Assistant: South Atlantic Ocean User: Answer in up to 3 words: Which ocean contains Bouvet Island?

Slide 5

Slide 5 text

Context window The context window includes all messages considered when generating a response. In other words, any chat message replays all prior questions and answers. ● It can include chat history, instructions, documents, images (if multimodal), etc. ● Limits are not in characters or words, but tokens roughly ¾ of a word. Assistant: As Bouvet Island is an uninhabited territory, it does not have a capital. User: Answer in up to 3 words: Which ocean contains Bouvet Island? Assistant: South Atlantic Ocean User: What’s the capital? 34 tokens in the context window 18 tokens generated

Slide 6

Slide 6 text

What’s our architecture? LLM I need an LLM of some kind

Slide 7

Slide 7 text

Some popular LLMs OpenAI is an organization and cloud platform, 49% owned by Microsoft ● OpenAI develops the GPT (Generative Pre-trained Transformer) LLM ● GPT is hosted by OpenAI and Azure, accessible via API or apps like ChatGPT LLaMa is an LLM developed by Meta ● You can download it for free from meta and run it with llama-stack ● It is also hosted on platforms, and downloadable in llama.cpp’s GGUF format Thousands of LLMs exist ● Many types of LLMs, diﬀering on source, size, training, modality (photo, sound, etc) ● We’ll use the small and well documented Qwen2.5 LLM developed by Alibaba

Slide 8

Slide 8 text

What’s our architecture? Qwen2.5-0.5B Instruct Where do I get this? The latest version, Qwen2.5, has the following features: ● Dense, easy-to-use, decoder-only language models, available in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B sizes, and base and instruct variants. ● Pretrained on our latest large-scale dataset, encompassing up to 18T tokens. ● Signiﬁcant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. ● More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. ● Context length support up to 128K tokens and can generate up to 8K tokens. ● Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.

Slide 9

Slide 9 text

huggingface.co Hugging Face Hub hosts thousands of models Repo: Qwen/Qwen2.5-0.5B-Instruct-GGUF File: qwen2.5-0_5b-instruct-q5_k_m.gguf I want GGUF for portability

Slide 10

Slide 10 text

What’s our architecture? Next.. llama.cpp I’ll get a portable (GGUF) ﬁle from Hugging Face Why not go? There’s no pure go GGUF capable runtime. gotzmann/llama.go is the closest attempt Qwen2.5-0.5B Instruct

Slide 11

Slide 11 text

llama.cpp github.com/ggerganov/llama.cpp llama.cpp is a dependency free LLM library and binary distribution ● It deﬁnes the GGUF format and can convert from Hugging Face (hf) and others ● MIT licensed, community project with ﬁrst implementation in March 2023 llama-cli runs llama-server hosts llama-tokenize counts

Slide 12

Slide 12 text

What’s our architecture? llama-cli How do I use this? Qwen2.5-0.5B Instruct

Slide 13

Slide 13 text

Hello llama-cli + hugging face $ llama-cli --log-disable --no-display-prompt \ --hf-repo Qwen/Qwen2.5-0.5B-Instruct-GGUF \ --hf-file qwen2.5-0.5b-instruct-q5_k_m.gguf \ -no-cnv --prompt 'Which ocean contains Bouvet Island?'

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

What’s our architecture? llama-server curl How do I use this? Qwen2.5-0.5B Instruct

Slide 17

Slide 17 text

OpenAI API is a de’facto standard github.com/openai/openai-openapi $ llama-server --log-disable \ --hf-repo Qwen/Qwen2.5-0.5B-Instruct-GGUF \ --hf-file qwen2.5-0_5b-instruct-q5_k_m.gguf $ curl -s -X POST localhost:8080/v1/completions -H "Content-Type: application/json" -d '{ "prompt": "<|im_start|>user\nWhich ocean contains Bouvet Island?\n<|im_end|>\n<|im_start|>assistant\n" }' | jq -r '.content' The completions endpoint accepts the same prompt as llama-cli

Slide 18

Slide 18 text

What’s our architecture? llama-server curl Can we simplify this part? Qwen2.5-0.5B Instruct

Slide 19

Slide 19 text

Hello Ollama github.com/ollama/ollama Written in Go Docker like experience Llama.cpp backend Simpliﬁed model ﬂow $ ollama serve $ ollama pull qwen2.5:0.5b $ ollama run qwen2.5:0.5b "Which ocean contains Bouvet Island?" Bouvet Island are located in the South Atlantic Ocean.

Slide 20

Slide 20 text

Ollama at a glance registry.ollama.ai ollama pull qwen2.5:0.5b ollama serve ~/.ollama/models ollama run qwen2.5:0.5b "Which ocean contains Bouvet Island?" ollama_llama_server qwen2.5:0.5b Ollama gets models from a registry and bundles a custom llama-server into its binary with //go:embed!

Slide 21

Slide 21 text

What’s our architecture? curl How do I use this? ollama serve registry.ollama.ai qwen2.5:0.5 b

Slide 22

Slide 22 text

You can use Ollama’s API or OpenAI’s API $ curl -s -X POST localhost:11434/v1/completions -H "Content-Type: application/json" -d '{ "model": "qwen2.5:0.5b", "prompt": "Which ocean contains Bouvet Island?" }' | jq -r .choices[0].text If OpenAPI, specify the model, but you don’t need to format the prompt! Ollama’s Server.GenerateHandler can format the prompt for you! pkg.go.dev/text/template

Slide 23

Slide 23 text

Ollama registry includes models and conﬁguration $ curl -s https://registry.ollama.ai/v2/library/qwen2.5/manifests/0.5b|jq . { "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "config": { "mediaType": "application/vnd.docker.container.image.v1+json", "digest": "sha256:005f95c7475154a17e84b85cd497949d6dd2a4f9d77c096e3c66e4d9c32acaf5", "size": 490 }, "layers": [ { "mediaType": "application/vnd.ollama.image.model", "digest": "sha256:c5396e06af294bd101b30dce59131a76d2b773e76950acc870eda801d3ab0515", "size": 397807936 }, { "mediaType": "application/vnd.ollama.image.system", "digest": "sha256:66b9ea09bd5b7099cbb4fc820f31b575c0366fa439b08245566692c6784e281e", "size": 68 }, { "mediaType": "application/vnd.ollama.image.template", "digest": "sha256:eb4402837c7829a690fa845de4d7f3fd842c2adee476d5341da8a46ea9255175", "size": 1482 }, { "mediaType": "application/vnd.ollama.image.license", "digest": "sha256:832dd9e00a68dd83b3c3fb9f5588dad7dcf337a0db50f7d9483f310cd292e92e", "size": 11343 } ] } Similar to container registries!

Slide 24

Slide 24 text

What’s our architecture? ??? ollama serve registry.ollama.ai I want to use Go! qwen2.5:0.5 b

Slide 25

Slide 25 text

Hello Parakeet github.com/parakeet-nest/parakeet func main() { url := "http://localhost:11434" model := "qwen2.5:0.5b" question := llm.Query{ Model: model, Prompt: "Which ocean contains Bouvet Island?", } answer, _ := completion.Generate(url, question) fmt.Println(answer.Response) }

Slide 26

Slide 26 text

Passing context github.com/parakeet-nest/parakeet question := llm.Query{ Model: model, Prompt: "Which ocean contains Bouvet Island?", } answer, _ := completion.Generate(url, question) fmt.Println(answer.Response) fmt.Println() secondQuestion := llm.Query{ Model: model, Prompt: "What’s the capital?", Context: answer.Context, } answer, _ = completion.Generate(url, secondQuestion) fmt.Println(answer.Response) In parakeet, replay messages by passing the previous context to the next question.

Slide 27

Slide 27 text

Hallucination LLMs can hallucinate, giving irrelevant, nonsense or factually incorrect answers ● LLMs rely on statistical correlation and may invent things to ﬁll gaps ● Mitigate with selecting relevant models, prompt engineering, etc. User: What is the capital of Bouvet Island? Assistant: The capital of Bouvet Island is Punta Arenas. $ ollama ls qwen2.5:7b 845dbda0ea48 4.7 GB 4 hours ago qwen2.5:0.5b a8b0c5157701 397 MB 4 weeks ago More parameters can be more expensive, but might give more relevant answers *

Slide 28

Slide 28 text

What’s our architecture? ollama serve registry.ollama.ai main.go I want to learn about Go 1.24! qwen2.5:7b

Slide 29

Slide 29 text

Limits to available knowledge There is no correlation between recency or version of a model and what it knows about. ● LLMs have a training date which is often not documented ● LLMs are trained differently and may have no knowledge of a specific topic ● LLMs might hallucinate when asked about their training date or knowledge cut-off! User: What is the latest version of go? Assistant: As of my last update in October 2023, the latest stable release of Go (Golang) is typically found at the official Go website. The most recent major releases have been: - **Go 1.20** (December 2023) - **Go 1.21** (March 2024) The easiest way to add new information is to pass it to the context of an existing model. ¯\_(ツ)_/¯

Slide 30

Slide 30 text

Passing a document to the chat context If you already have a document, setup the system before executing the user’s question. message := `You are a Golang expert. Using only the below provided context, answer the user's question to the best of your ability using only the resources provided. ` context := ` ### New benchmark function Benchmarks may now use the faster and less error-prone [testing.B.Loop](/pkg/testing#B.Loop) question := `Summarize what's new with benchmarks in 3 bullet points. Be succinct` query := llm.Query{ Model: model, Messages: []llm.Message{ {Role: "system", Content: message}, {Role: "system", Content: context}, {Role: "user", Content: question}, }, Stream: false, }

Slide 31

Slide 31 text

Retrieval-augmented generation (RAG) RAG is a technique to retrieve relevant information based on the user’s question. ● Embeddings models convert text into numeric vectors ● A VectorDB stores vectors and associated text, and exposes similarity queries First, data is chunked and stored in the VectorDB ● Each chunk of data is vectorized by the embedding model ● The resulting vectors and raw text are stored in a vectorDB Each query is vectorized and used in a similarity query ● The user query is is vectorized by the same embedding model ● These vectors are used as an input in a VectorDB similarity query ● The result of that query are chunks of text to put into context www.mixedbread.ai/docs/embeddings/mxbai-embed-large-v1

Slide 32

Slide 32 text

What’s our architecture? ollama serve registry.ollama.ai main.go Elasticsearch is also a VectorDB qwen2.5:7b elasticsearch mxbai-embed-large embeddings chat

Slide 33

Slide 33 text

LLM + tools = bot agent When calling an LLM, you can add metadata about tools it can use. ● Along with your messages, you include function signatures and documentation. When the LLM thinks a tool could help answer, it replies with a tool call request ● The calling code is who actually invokes the tool, not the LLM. If you send back a failed call, the LLM may be able to ﬁgure out how to progress ● It might change parameters, try another tool, give up, or … hallucinate! We can decouple functions and system prompts from the agent! There’s modelcontextprotocol.io with a few Go implementations like github.com/mark3labs/mcp-go

Slide 34

Slide 34 text

The LLM may request you to invoke tools If you provide Tools, the LLM might return ToolCalls needed to complete your prompt a, _ := completion.Chat(ollamaURL, q) // Loop to let the LLM handle tool glitches for len(answer.Message.ToolCalls) == 1 { toolCall := answer.Message.ToolCalls[0] out := callFunction(toolCall.Function) q.Messages = append(q.Messages, a.Message, llm.Message{Role: "tool", Content: out}) a, _ := completion.Chat(ollamaURL, q) } prompt := `Write me a README.md for the code in this directory` q := llm.Query{ Model: model Tools: []llm.Tool{ { Type: "function", Function: llm.Function{ Name: "shell", –snip– Messages: []llm.Message{ {Role: "system", Content: prompt}, {Role: "user", Content: prompt}, }, }

Slide 35

Slide 35 text

What’s our architecture? ollama serve registry.ollama.ai main.go I want you to write my README.md ﬁle qwen2.5:7b shell(command) read_file(path) write_file(path,content)

Slide 36

Slide 36 text

Before our demo, thank Philippe (k33g)! Not just thanks for parakeet.. Or even for these demo.. For months of hard thinking on GenAI and in open source! github.com/k33g x.com/k33g_org

Slide 37

Slide 37 text

Choose your demo: RAG or Agent! github.com/codefromthecrypt/practical-genai-go

Slide 38

Slide 38 text

Takeaways and Thanks! Use Go as much as you like in GenAI! We showed how.. Ollama is written in Go and eases access to models and the powerful llama.cpp backend We didn’t discuss Modelfile which can prep images with your system context Parakeet is a Go library with handy features for RAG and markdown, html, etc. We didn’t discuss the official OpenAI go SDK, which works with Ollama, or MCP Elasticsearch VectorDB features help with your custom embeddings We didn’t discuss the semantic_text field type which can automatically create embeddings github.com/codefromthecrypt x.com/adrianfcole Adrian at Elastic Thank Philippe from Parakeet! github.com/k33g x.com/k33g_org