Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Practical GenAI with Go - Elastic and Golang Sy...

Practical GenAI with Go - Elastic and Golang Sydney

My first time giving a presentation as an Elastic employee!

This talk overviews GenAI from both "what things are" and "nuts-and-bolts" perspectives. This emphasizes technology written in Go. Notably, the server is Ollama and code examples are in Parakeet. We covered how ollama works for example, and since this was at an Elastic office, even how to use Elasticsearch as the VectorDB for RAG workflows! The dozens present in Sydney were fantastic participants, offering insights along the way, and a healthy amount of feedback at the end.

https://www.meetup.com/en-AU/sydney-elastic-fantastics/events/302290920/

Adrian Cole

August 29, 2024
Tweet

More Decks by Adrian Cole

Other Decks in Technology

Transcript

  1. Introductions! I’m Adrian from the Elastic Observability team, mostly work

    on OpenTelemetry. Baby in GenAI programming (<3 months full-time), but spent 100+hours on this deck! Co-led wazero, zero dependency webassembly runtime for go, a couple years ago. Tons of portability things in my open source history github.com/codefromthecrypt x.com/adrianfcole
  2. Agenda • Introduction to GenAI • Running models locally with

    llama.cpp • Remote access to LLMs using the OpenAI REST API • Simplifying model hosting with Ollama • Coding GenAI with Parakeet • Loading new data via context and RAG Demo • Summing it up and things we left out
  3. Introduction to GenAI Generative Models generate new data based on

    known data • Causal language models are unidirectional: predict words based on previous ones • Large Language Models (LLMs) are typically trained for chat, code completion, etc. The LLM can answer correctly if geography was a part of the text it was trained on Assistant: The Falkland Islands are located in the South Atlantic Ocean. User: Which ocean contains the falkland islands?
  4. Context window The context window includes all messages considered when

    generating a response. In other words, any chat message replays all prior questions and answers. • It can include chat history, instructions, documents, images (if multimodal), etc. • Limits are not in characters or words, but tokens roughly ¾ of a word. Assistant: The capital of the Falkland Islands is Stanley. User: Which ocean contains the falkland islands? Assistant: The Falkland Islands are located in the South Atlantic Ocean. User: What’s the capital? 41 tokens in the context window 11 tokens generated
  5. Some popular LLMs OpenAI is an organization and cloud platform,

    49% owned by Microsoft • OpenAI develops the GPT (Generative Pre-trained Transformer) LLM • GPT is hosted by OpenAI and Azure, accessible via API or apps like ChatGPT LLaMa is an LLM developed by Meta • You can download it for free from meta and run it with llama-stack • It is also hosted on platforms, and downloadable in llama.cpp’s GGUF format Thousands of LLMs exist • Many types of LLMs, differing on source, size, training, modality (photo, sound, etc) • We’ll use the small and well documented Qwen2 LLM developed by Alibaba
  6. What’s our architecture? Qwen2-0.5B Instruct Where do I get this?

    After months of efforts, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you: • Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B; • Having been trained on data in 27 additional languages besides English and Chinese; • State-of-the-art performance in a large number of benchmark evaluations; • Significantly improved performance in coding and mathematics; • Extended context length support up to 128K tokens with Qwen2-7B-Instruct and Qwen2-72B-Instruct
  7. huggingface.co Hugging Face Hub hosts thousands of models Repo: Qwen/Qwen2-0.5B-Instruct-GGUF

    File: qwen2-0_5b-instruct-q5_k_m.gguf I want GGUF for portability
  8. What’s our architecture? Next.. llama.cpp I’ll get a portable (GGUF)

    file from Hugging Face Qwen2-0.5B Instruct Why not go? There’s no pure go GGUF capable runtime. gotzmann/llama.go is the closest attempt
  9. llama.cpp github.com/ggerganov/llama.cpp llama.cpp is a dependency free LLM library and

    binary distribution • It defines the GGUF format and can convert from Hugging Face (hf) and others • MIT licensed, community project with first implementation in March 2023 llama-cli runs llama-server hosts llama-tokenize counts
  10. Hello llama-cli + hugging face $ llama-cli --log-disable --no-display-prompt \

    --hf-repo Qwen/Qwen2-0.5B-Instruct-GGUF \ --hf-file qwen2-0_5b-instruct-q5_k_m.gguf \ --prompt 'Which ocean contains the falkland islands?'
  11. Is it a question to answer or an example to

    follow? <|im_start|>user Which ocean contains the falkland islands? <|im_end|> <|im_start|>assistant User: Which ocean contains the falkland islands? Qwen2 documents prompt formatting as ChatML. Doing otherwise can create ambiguity. • Control tokens: <|im_start|> <|im_end|> and <|endoftext|> • Roles: system, user and assistant
  12. Hello reading the Qwen2 docs! $ llama-cli --log-disable --no-display-prompt \

    --hf-repo Qwen/Qwen2-0.5B-Instruct-GGUF \ --hf-file qwen2-0_5b-instruct-q5_k_m.gguf \ --prompt '<|im_start|>user Which ocean contains the falkland islands? <|im_end|> <|im_start|>assistant '
  13. OpenAI API is a de’facto standard github.com/openai/openai-openapi $ llama-server --log-disable

    \ --hf-repo Qwen/Qwen2-0.5B-Instruct-GGUF \ --hf-file qwen2-0_5b-instruct-q5_k_m.gguf $ curl -s -X POST localhost:8080/v1/completions -H "Content-Type: application/json" -d '{ "prompt": "<|im_start|>user\nWhich ocean contains the falkland islands?\n<|im_end|>\n<|im_start|>assistant\n" }' | jq -r '.content' The completions endpoint accepts the same prompt as llama-cli
  14. Hello Ollama github.com/ollama/ollama Written in Go Docker like experience Llama.cpp

    backend Simplified model flow $ ollama serve $ ollama pull qwen2:0.5b $ ollama run qwen2:0.5b "Which ocean contains the falkland islands?" The Falkland Islands are located in the South Atlantic Ocean.
  15. Ollama at a glance registry.ollama.ai ollama pull qwen2:0.5b ollama serve

    ~/.ollama/models ollama run qwen2:0.5b "Which ocean contains the falkland islands?" ollama_llama_server qwen2:0.5b Ollama gets models from a registry and bundles a custom llama-server into its binary with //go:embed!
  16. What’s our architecture? curl How do I use this? ollama

    serve registry.ollama.ai qwen2:0.5b
  17. You can use Ollama’s API or OpenAI’s API $ curl

    -s -X POST localhost:11434/v1/completions -H "Content-Type: application/json" -d '{ "model": "qwen2:0.5b", "prompt": "Which ocean contains the falkland islands?" }' | jq -r .choices[0].text If OpenAPI, specify the model, but you don’t need to format the prompt! Ollama’s Server.GenerateHandler can format the prompt for you! pkg.go.dev/text/template
  18. Ollama registry includes models and configuration $ curl -s https://registry.ollama.ai/v2/library/qwen2/manifests/0.5b|jq

    . { "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "config": { "digest": "sha256:0123", "mediaType": "application/vnd.docker.container.image.v1+json", "size": 488 }, "layers": [ { "digest": "sha256:4567", "mediaType": "application/vnd.ollama.image.model", "size": 352151968 }, { "digest": "sha256:89ab", "mediaType": "application/vnd.ollama.image.template", "size": 182 }, { "digest": "sha256:cdef", "mediaType": "application/vnd.ollama.image.license", "size": 11344 }, { "digest": "sha256:0011", "mediaType": "application/vnd.ollama.image.params", "size": 59 } ] } Similar to container registries!
  19. Hello Parakeet github.com/parakeet-nest/parakeet func main() { url := "http://localhost:11434" model

    := "qwen2:0.5b" question := llm.Query{ Model: model, Prompt: "Which ocean contains the falkland islands?", } answer, _ := completion.Generate(url, question) fmt.Println(answer.Response) }
  20. Passing context github.com/parakeet-nest/parakeet question := llm.Query{ Model: model, Prompt: "Which

    ocean contains the falkland islands?", } answer, _ := completion.Generate(url, question) fmt.Println(answer.Response) fmt.Println() secondQuestion := llm.Query{ Model: model, Prompt: "What’s the capital?", Context: answer.Context, } answer, _ = completion.Generate(url, secondQuestion) fmt.Println(answer.Response) In parakeet, replay messages by passing the previous context to the next question.
  21. Hallucination LLMs can hallucinate, giving irrelevant, nonsense or factually incorrect

    answers • LLMs rely on statistical correlation and may invent things to fill gaps • Mitigate with selecting relevant models, prompt engineering, etc. User: What is the capital of the falkland islands? Assistant: The capital of the Falkland Islands is Punta Arenas. $ ollama ls qwen2:7b e0d4e1163c58 4.4 GB 3 seconds ago qwen2:0.5b 6f48b936a09f 352 MB 7 seconds ago More parameters can be more expensive, but might give more relevant answers *
  22. Limits to available knowledge There is no correlation between recency

    or version of a model and what it knows about. • LLMs have a training date which is often not documented • LLMs are trained differently and may have no knowledge of a specific topic • LLMs might hallucinate when asked about their training date or knowledge cut-off! User: What is the latest version of go? Assistant: As of my knowledge cutoff date in September 2023… version 1.21 The easiest way to add new information is to pass it to the context of an existing model.
  23. Passing a document to the chat context If you already

    have a document, setup the system before executing the user’s question. message := `You are a Golang expert. Using only the below provided context, answer the user's question to the best of your ability using only the resources provided. ` context := `<context> <doc> The TLS client now supports the Encrypted Client Hello draft specification. This feature can be enabled by setting the Config.EncryptedClientHelloConfigList field to an encoded ECHConfigList for the host that is being connected to. question := `Summarize what's new with TLS client in 3 bullet points. Be succinct` query := llm.Query{ Model: model, Messages: []llm.Message{ {Role: "system", Content: message}, {Role: "system", Content: context}, {Role: "user", Content: question}, }, Stream: false, }
  24. Retrieval-augmented generation (RAG) RAG is a technique to retrieve relevant

    information based on the user’s question. • Embeddings models convert text into numeric vectors • A VectorDB stores vectors and associated text, and exposes similarity queries First, data is chunked and stored in the VectorDB • Each chunk of data is vectorized by the embedding model • The resulting vectors and raw text are stored in a vectorDB Each query is vectorized and used in a similarity query • The user query is is vectorized by the same embedding model • These vectors are used as an input in a VectorDB similarity query • The result of that query are chunks of text to put into context www.mixedbread.ai/docs/embeddings/mxbai-embed-large-v1
  25. Thanks to Philippe (k33g)! Not just thanks for parakeet.. Or

    even for this Elastic demo.. For weeks of hard thinking out loud and in open source! github.com/k33g x.com/k33g_org
  26. Takeaways and Thanks! Use Go as much as you like

    in GenAI! We showed how.. Ollama is written in Go and eases access to models and the powerful llama.cpp backend We didn’t discuss Modelfile which can prep images with your system context! Parakeet is a Go library with handy features for RAG and markdown, html, etc. We didn’t discuss tool calling which can do things for you like take screenshots! Elasticsearch VectorDB features help with your custom embeddings We didn’t discuss the semantic_text field type which can automatically create embeddings github.com/codefromthecrypt x.com/adrianfcole Adrian at Elastic Thank Philippe from Parakeet! github.com/k33g x.com/k33g_org