Best practices for building LLM-based applications

Best practices for building LLM-based applications Kacper Łukawski, Developer Advocate,
Qdrant

Agenda 1. Overview of the most popular LLM frameworks. a.
LangChain b. LlamaIndex c. Haystack 2. Experiment toolkit. 3. Challenges in productionizing LLMs. 4. Real-world application: the “Ask Qdrant” bot.

LLM Frameworks

🦜🔗 LangChain ⭐ 57.9k Building applications with LLMs through composability.
Currently, the most popular framework for building LLM-based applications.

Document - a basic entity, that has content and optionally
some key-value metadata. Docstore / Document Loader - used to store and load the documents (e.g. UnstructuredHTMLLoader).

LLM vs embedding models Embeddings - an abstraction over a
model which can transform given texts into vector representations. Might be remote (e.g. OpenAIEmbeddings, CohereEmbeddings) or local models (e.g. HuggingFaceEmbeddings). LLM - a system capable of producing texts, given prompt. Large Language Models might be remote, accessed through some sort of API (e.g. OpenAI, Cohere) or local (e.g. GPT4All). Prompt / PromptTemplate - an input to the LLM, with some placeholders, so we can provide additional context.

Tool - an interface to interact with an external system.
Each tool has a description used by an agent to choose which one to use for a particular job. AmadeusFlightSearch Use this tool to search for a single ﬂight between the origin and destination airports at a departure between an earliest and latest datetime. YouTubeSearchTool Search for youtube videos associated with a person. The input to this tool should be a comma separated list, the ﬁrst part contains a person name and the second a number that is the maximum number of video results to return aka num_results. The second part is optional. PythonREPLTool A Python shell. Use this to execute python commands. Input should be a valid python command. If you want to see the output of a value, you should print it out with `print(...)`.

Retrievers & Vector Stores Retriever - returns documents based on
text queries without storing them. Vector Store - keeps documents along with their embeddings and performs semantic search over them. Usually acts as Long Term Memory. Each vector store might be also used as retriever. Qdrant is the only vector store that supports full asynchronous API.

Chain A sequence of actions to be performed. Usually involves
calling an LLM at least once. Chain may call some other components, like vector stores or retrievers. Each chain may use Memory that keeps the state of the chain between the runs. Useful in chat-like applications.

Agent - uses LLMs to choose the next action to
perform. It might be based, for example, on tool descriptions.

The default prompt for one of the available chains Source:
https://github.com/langchain-ai/...langchain/chains/qa_with_sources/stuff_prompt.py

LlamaIndex 🦙 ⭐ 19.9k LlamaIndex (formerly GPT Index) is a
data framework for LLM applications to ingest, structure, and access private or domain-speciﬁc data. The second most popular LLM framework, that focuses more on the custom input data.

Documents & Nodes Document is an abstraction over a data
source, such as ﬁle or database entry. Node is a chunk of the document, such as text chunk, but also image and anything else. Both documents and nodes have metadata and relationships to other documents and nodes. Documents might be loaded by using Data Connectors (LlamaHub).

References to Langchain - LLMs - Embeddings - it’s even
possible to reuse the embedding models implemented in Langchain through LangchainEmbedding - Vector Stores & Retrievers - Tools - Agents LlamaIndex is integrated with Langchain, but also shares some concepts.

Query engine / Chat engine - speak to your data
through chat or question/answer systems.

Haystack ⭐ 10.3k LLM orchestration framework to build customizable, production-ready
LLM applications. Connect components (Models, Vector DBs, File Converters) to Pipelines or Agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search and conversational agent. The oldest tool out of the three mentioned that incorporated LLMs recently. Previously, it was focused on semantic search.

Documents A single document deﬁnes a piece of information, which
might be text, table or image along with the metadata and some other details.

Document Stores QdrantDocumentStore is one of the options and it
supports documents with and without embeddings out of the box. A storage layer which retains documents and lets you query them effectively. It might be an SQL database, but also a vector search engine.

Nodes & pipelines Haystack is built on top of a
directed acyclic graph (DAG). We build a pipeline that consists of multiple nodes that process given input in a deﬁned order. The processing doesn’t have to be linear, as there are decision components which allow to choose a different path. Haystack offers not only a support of LLMs, but also some additional features: - dense / sparse search - classiﬁcation - text to speech - translations - and many more…

Agents Haystack introduced agents and tools similarly to LangChain and
LlamaIndex.

Source: https://haystack.deepset.ai/blog/introducing-haystack-agents

Haystack is not as focused on data connectors as much
as LangChain or LlamaIndex.

Prototyping

Wishlist for prototypes - Rich integration suites for many data
sources - Variety of embedding models and LLMs to try out different conﬁgurations - Different storage systems implemented - In-memory or local mode for data storage, to keep thing simple - Easy way to glue up different blocks together and try it out

The majority of libraries focus more on building demos in
an easy way. Running systems in production requires different means.

Production challenges

Output quality

Source: https://eugeneyan.com/writing/llm-patterns/#evals-to-measure-performance

Source: https://shreyar.github.io/guardrails/

Hallucination is a conﬁdent response by an AI that does
not seem to be justiﬁed by its input source. Hallucination

Retrieval Augmented Generation Extending prompts with context information to convert
a knowledge oriented task, into a language task.

Semantic Search in production • Integrated with all major LLM
frameworks. • Includes eﬃcient metadata ﬁltering to enrich semantic search with additional criteria. • Runs in local in-memory/persisted mode, on-premise and Qdrant Cloud.

Speeding up semantic search • Tweaking the HNSW indexing and
search parameters • Scalar Quantization • Product Quantization

IO-bound vs CPU-bound applications

Asynchronous programming If we use external services, such as OpenAI
or Cohere embeddings or LLMs, and an actual vector database, such as Qdrant, we can improve time eﬃciency. LangChain and LlamaIndex partially support asynchronous methods. Qdrant is the only vector store with full async support in Langchain.

Caching Calling external LLMs and embedding models means costs. Those
APIs are usually charged based on tokens. Caching may help to reduce the latency and cost at the same time.

Additional challenges

Deployment Any production system aiming to work for multiple users
simultaneously must be deployed to scale, not as a console application. LLM will usually serve a speciﬁc purpose in the whole application but won’t be its only component. Those kind of systems are often exposed as REST API. Haystack provides a way to expose the pipeline through the FastAPI-based REST API with no extra effort.

Monitoring Relevant not only for LLM-based applications, but for all
sorts of systems in general. Not much different from any other application, but it is useful to have a framework that logs activity and gives users the option to analyze it down the road.

Human feedback Users judging the usefulness of the system outputs
give us ground truth that might be used later on to ﬁne-tune the system.

Example: “Ask Qdrant”

The “Ask Qdrant” bot Qdrant Discord community is the best
way to get quick support when building applications with Qdrant.

The output of the proﬁler running on both Langchain and
custom version of the agent - almost 3x faster

Building an abstraction over LLM agents

Questions? Kacper Łukawski Developer Advocate Qdrant https://www.linkedin.com/in/kacperlukawski/ https://twitter.com/LukawskiKacper https://github.com/kacperlukawski

Best practices for building LLM-based applications

Best practices for building LLM-based applications

More Decks by Kacper Łukawski

Other Decks in Technology

Featured

Transcript