Slide 1

Slide 1 text

Building Large Language Model Powered Apps: Best Practices Kacper Łukawski, Developer Advocate, Qdrant

Slide 2

Slide 2 text

Vector Search in production Qdrant is a vector search database using HNSW, one of the most promising algorithms for Approximate Nearest Neighbours. ● Written in Rust. ● HTTP / gRPC APIs + official SDKs. ● Local in-memory, Docker & Cloud. ● Metadata filtering built-in into vector search phase.

Slide 3

Slide 3 text

Why is my model so much worse than ChatGPT?

Slide 4

Slide 4 text

Source: https://www.youtube.com/watch?v=bZQun8Y4L2A

Slide 5

Slide 5 text

Source: https://towardsdatascience.com/fine-tuning-large-language-models-llms-23473d763b91

Slide 6

Slide 6 text

Source: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

Slide 7

Slide 7 text

Open source vs “Open” models

Slide 8

Slide 8 text

Freeware/Freemium is not open source 1. “Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.“ 2. “You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).” Source: https://opensourceconnections.com/blog/2023/07/19/is-llama-2...definition-of-open/

Slide 9

Slide 9 text

Fine-tuning

Slide 10

Slide 10 text

1. Self-supervised - predict the next token in a document. 2. Supervised - based on ideal responses to given prompts. 3. RLHF - produce the best answer, based on a different model feedback. Choosing the strategy

Slide 11

Slide 11 text

Tokenizers limit the fine-tuning abilities!

Slide 12

Slide 12 text

Source: https://huyenchip.com/2023/08/16/llm-research-open-challenges.html

Slide 13

Slide 13 text

To fine-tune or not to fine-tune

Slide 14

Slide 14 text

Retrieval Augmented Generation Extending prompts with context information to convert a knowledge oriented task, into a language task.

Slide 15

Slide 15 text

RAG vs longer context window

Slide 16

Slide 16 text

Common issues with Retrieval Augmented Generation 1. Wrong chunking strategy. 2. Poor embedding models. 3. Too few documents put into context (also too many). 4. Bad prompts.

Slide 17

Slide 17 text

Fine-tuning checklist ❏ Our language model cannot produce responses in a way we expect it to. ❏ Prompt engineering doesn’t help. ❏ RAG introduces relevant results, but they are improperly treated by the LLM.

Slide 18

Slide 18 text

A framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution. License: Apache 2.0 Author: UC Berkeley’s RISELab

Slide 19

Slide 19 text

Do I need Langchain / LlamaIndex / Haystack / NameYourTool?

Slide 20

Slide 20 text

The default prompt for one of the available chains Source: https://github.com/langchain-ai/...langchain/chains/qa_with_sources/stuff_prompt.py

Slide 21

Slide 21 text

Moving to production

Slide 22

Slide 22 text

Using LLMs, the proper way - Dataset versioning - Model versioning - Model evaluation - Prompt versioning

Slide 23

Slide 23 text

Source: https://shreyar.github.io/guardrails/

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

FastChat An open platform for training, serving, and evaluating large language model based chatbots. License: Apache 2.0 Author: The Large Model System Organization (https://lmsys.org)

Slide 26

Slide 26 text

Questions? Kacper Łukawski Developer Advocate, Qdrant https://www.linkedin.com/in/kacperlukawski/ https://twitter.com/LukawskiKacper https://github.com/kacperlukawski