Best practices for building LLM-based applications

Slide 1

Slide 1 text

Best practices for building LLM-based applications Kacper Łukawski, Developer Advocate, Qdrant

Slide 2

Slide 2 text

Agenda 1. Overview of the most popular LLM frameworks. a. LangChain b. LlamaIndex c. Haystack 2. Experiment toolkit. 3. Challenges in productionizing LLMs. 4. Real-world application: the “Ask Qdrant” bot.

Slide 3

Slide 3 text

LLM Frameworks

Slide 4

Slide 4 text

🦜🔗 LangChain ⭐ 57.9k Building applications with LLMs through composability. Currently, the most popular framework for building LLM-based applications.

Slide 5

Slide 5 text

Document - a basic entity, that has content and optionally some key-value metadata. Docstore / Document Loader - used to store and load the documents (e.g. UnstructuredHTMLLoader).

Slide 6

Slide 6 text

LLM vs embedding models Embeddings - an abstraction over a model which can transform given texts into vector representations. Might be remote (e.g. OpenAIEmbeddings, CohereEmbeddings) or local models (e.g. HuggingFaceEmbeddings). LLM - a system capable of producing texts, given prompt. Large Language Models might be remote, accessed through some sort of API (e.g. OpenAI, Cohere) or local (e.g. GPT4All). Prompt / PromptTemplate - an input to the LLM, with some placeholders, so we can provide additional context.

Slide 7

Slide 7 text

Tool - an interface to interact with an external system. Each tool has a description used by an agent to choose which one to use for a particular job. AmadeusFlightSearch Use this tool to search for a single ﬂight between the origin and destination airports at a departure between an earliest and latest datetime. YouTubeSearchTool Search for youtube videos associated with a person. The input to this tool should be a comma separated list, the ﬁrst part contains a person name and the second a number that is the maximum number of video results to return aka num_results. The second part is optional. PythonREPLTool A Python shell. Use this to execute python commands. Input should be a valid python command. If you want to see the output of a value, you should print it out with `print(...)`.

Slide 8

Slide 8 text

Retrievers & Vector Stores Retriever - returns documents based on text queries without storing them. Vector Store - keeps documents along with their embeddings and performs semantic search over them. Usually acts as Long Term Memory. Each vector store might be also used as retriever. Qdrant is the only vector store that supports full asynchronous API.

Slide 9

Slide 9 text

Chain A sequence of actions to be performed. Usually involves calling an LLM at least once. Chain may call some other components, like vector stores or retrievers. Each chain may use Memory that keeps the state of the chain between the runs. Useful in chat-like applications.

Slide 10

Slide 10 text

Agent - uses LLMs to choose the next action to perform. It might be based, for example, on tool descriptions.

Slide 11

Slide 11 text

The default prompt for one of the available chains Source: https://github.com/langchain-ai/...langchain/chains/qa_with_sources/stuff_prompt.py

Slide 12

Slide 12 text

LlamaIndex 🦙 ⭐ 19.9k LlamaIndex (formerly GPT Index) is a data framework for LLM applications to ingest, structure, and access private or domain-speciﬁc data. The second most popular LLM framework, that focuses more on the custom input data.

Slide 13

Slide 13 text

Documents & Nodes Document is an abstraction over a data source, such as ﬁle or database entry. Node is a chunk of the document, such as text chunk, but also image and anything else. Both documents and nodes have metadata and relationships to other documents and nodes. Documents might be loaded by using Data Connectors (LlamaHub).

Slide 14

Slide 14 text

References to Langchain - LLMs - Embeddings - it’s even possible to reuse the embedding models implemented in Langchain through LangchainEmbedding - Vector Stores & Retrievers - Tools - Agents LlamaIndex is integrated with Langchain, but also shares some concepts.

Slide 15

Slide 15 text

Query engine / Chat engine - speak to your data through chat or question/answer systems.

Slide 16

Slide 16 text

Haystack ⭐ 10.3k LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (Models, Vector DBs, File Converters) to Pipelines or Agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search and conversational agent. The oldest tool out of the three mentioned that incorporated LLMs recently. Previously, it was focused on semantic search.

Slide 17

Slide 17 text

Documents A single document deﬁnes a piece of information, which might be text, table or image along with the metadata and some other details.

Slide 18

Slide 18 text

Document Stores QdrantDocumentStore is one of the options and it supports documents with and without embeddings out of the box. A storage layer which retains documents and lets you query them effectively. It might be an SQL database, but also a vector search engine.

Slide 19

Slide 19 text

Nodes & pipelines Haystack is built on top of a directed acyclic graph (DAG). We build a pipeline that consists of multiple nodes that process given input in a deﬁned order. The processing doesn’t have to be linear, as there are decision components which allow to choose a different path. Haystack offers not only a support of LLMs, but also some additional features: - dense / sparse search - classiﬁcation - text to speech - translations - and many more…

Slide 20

Slide 20 text

Agents Haystack introduced agents and tools similarly to LangChain and LlamaIndex.

Slide 21

Slide 21 text

Source: https://haystack.deepset.ai/blog/introducing-haystack-agents

Slide 22

Slide 22 text

Haystack is not as focused on data connectors as much as LangChain or LlamaIndex.

Slide 23

Slide 23 text

Prototyping

Slide 24

Slide 24 text

Wishlist for prototypes - Rich integration suites for many data sources - Variety of embedding models and LLMs to try out different conﬁgurations - Different storage systems implemented - In-memory or local mode for data storage, to keep thing simple - Easy way to glue up different blocks together and try it out

Slide 25

Slide 25 text

The majority of libraries focus more on building demos in an easy way. Running systems in production requires different means.

Slide 26

Slide 26 text

Production challenges

Slide 27

Slide 27 text

Output quality

Slide 28

Slide 28 text

Source: https://eugeneyan.com/writing/llm-patterns/#evals-to-measure-performance

Slide 29

Slide 29 text

Source: https://shreyar.github.io/guardrails/

Slide 30

Slide 30 text

Hallucination is a conﬁdent response by an AI that does not seem to be justiﬁed by its input source. Hallucination

Slide 31

Slide 31 text

Retrieval Augmented Generation Extending prompts with context information to convert a knowledge oriented task, into a language task.

Slide 32

Slide 32 text

Semantic Search in production ● Integrated with all major LLM frameworks. ● Includes eﬃcient metadata ﬁltering to enrich semantic search with additional criteria. ● Runs in local in-memory/persisted mode, on-premise and Qdrant Cloud.

Slide 33

Slide 33 text

Speeding up semantic search ● Tweaking the HNSW indexing and search parameters ● Scalar Quantization ● Product Quantization

Slide 34

Slide 34 text

IO-bound vs CPU-bound applications

Slide 35

Slide 35 text

Asynchronous programming If we use external services, such as OpenAI or Cohere embeddings or LLMs, and an actual vector database, such as Qdrant, we can improve time eﬃciency. LangChain and LlamaIndex partially support asynchronous methods. Qdrant is the only vector store with full async support in Langchain.

Slide 36

Slide 36 text

Caching Calling external LLMs and embedding models means costs. Those APIs are usually charged based on tokens. Caching may help to reduce the latency and cost at the same time.

Slide 37

Slide 37 text

Additional challenges

Slide 38

Slide 38 text

Deployment Any production system aiming to work for multiple users simultaneously must be deployed to scale, not as a console application. LLM will usually serve a speciﬁc purpose in the whole application but won’t be its only component. Those kind of systems are often exposed as REST API. Haystack provides a way to expose the pipeline through the FastAPI-based REST API with no extra effort.

Slide 39

Slide 39 text

Monitoring Relevant not only for LLM-based applications, but for all sorts of systems in general. Not much different from any other application, but it is useful to have a framework that logs activity and gives users the option to analyze it down the road.

Slide 40

Slide 40 text

Human feedback Users judging the usefulness of the system outputs give us ground truth that might be used later on to ﬁne-tune the system.

Slide 41

Slide 41 text

Example: “Ask Qdrant”

Slide 42

Slide 42 text

The “Ask Qdrant” bot Qdrant Discord community is the best way to get quick support when building applications with Qdrant.

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

The output of the proﬁler running on both Langchain and custom version of the agent - almost 3x faster

Slide 45

Slide 45 text

Building an abstraction over LLM agents

Slide 46

Slide 46 text

Questions? Kacper Łukawski Developer Advocate Qdrant https://www.linkedin.com/in/kacperlukawski/ https://twitter.com/LukawskiKacper https://github.com/kacperlukawski