Building Large Language Model Powered Apps: Best Practices

Building Large Language Model Powered Apps: Best Practices Kacper Łukawski,
Developer Advocate, Qdrant

Vector Search in production Qdrant is a vector search database
using HNSW, one of the most promising algorithms for Approximate Nearest Neighbours. • Written in Rust. • HTTP / gRPC APIs + oﬃcial SDKs. • Local in-memory, Docker & Cloud. • Metadata ﬁltering built-in into vector search phase.

Why is my model so much worse than ChatGPT?

Source: https://www.youtube.com/watch?v=bZQun8Y4L2A

Source: https://towardsdatascience.com/ﬁne-tuning-large-language-models-llms-23473d763b91

Source: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

Open source vs “Open” models

Freeware/Freemium is not open source 1. “Additional Commercial Terms. If,
on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s aﬃliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.“ 2. “You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).” Source: https://opensourceconnections.com/blog/2023/07/19/is-llama-2...deﬁnition-of-open/

Fine-tuning

1. Self-supervised - predict the next token in a document.
2. Supervised - based on ideal responses to given prompts. 3. RLHF - produce the best answer, based on a different model feedback. Choosing the strategy

Tokenizers limit the ﬁne-tuning abilities!

Source: https://huyenchip.com/2023/08/16/llm-research-open-challenges.html

To ﬁne-tune or not to ﬁne-tune

Retrieval Augmented Generation Extending prompts with context information to convert
a knowledge oriented task, into a language task.

RAG vs longer context window

Common issues with Retrieval Augmented Generation 1. Wrong chunking strategy.
2. Poor embedding models. 3. Too few documents put into context (also too many). 4. Bad prompts.

Fine-tuning checklist ❏ Our language model cannot produce responses in
a way we expect it to. ❏ Prompt engineering doesn’t help. ❏ RAG introduces relevant results, but they are improperly treated by the LLM.

A framework for running LLMs, AI, and batch jobs on
any cloud, offering maximum cost savings, highest GPU availability, and managed execution. License: Apache 2.0 Author: UC Berkeley’s RISELab

Do I need Langchain / LlamaIndex / Haystack / NameYourTool?

The default prompt for one of the available chains Source:
https://github.com/langchain-ai/...langchain/chains/qa_with_sources/stuff_prompt.py

Moving to production

Using LLMs, the proper way - Dataset versioning - Model
versioning - Model evaluation - Prompt versioning

Source: https://shreyar.github.io/guardrails/

FastChat An open platform for training, serving, and evaluating large
language model based chatbots. License: Apache 2.0 Author: The Large Model System Organization (https://lmsys.org)

Questions? Kacper Łukawski Developer Advocate, Qdrant https://www.linkedin.com/in/kacperlukawski/ https://twitter.com/LukawskiKacper https://github.com/kacperlukawski

Building Large Language Model Powered Apps: Bes...

Building Large Language Model Powered Apps: Best Practices

Kacper Łukawski

More Decks by Kacper Łukawski

Other Decks in Technology

Featured

Transcript