Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Best practices for building LLM-based applications

Best practices for building LLM-based applications

Many businesses started incorporating Large Language Models into their applications. There are, however, several challenges that may impact such systems. It’s great to be aware of them before you start. During the talk, we will review the existing tools and see how to move from development to production without a headache.

Kacper Łukawski

August 11, 2023

More Decks by Kacper Łukawski

Other Decks in Technology


  1. Agenda 1. Overview of the most popular LLM frameworks. a.

    LangChain b. LlamaIndex c. Haystack 2. Experiment toolkit. 3. Challenges in productionizing LLMs. 4. Real-world application: the “Ask Qdrant” bot.
  2. 🦜🔗 LangChain ⭐ 57.9k Building applications with LLMs through composability.

    Currently, the most popular framework for building LLM-based applications.
  3. Document - a basic entity, that has content and optionally

    some key-value metadata. Docstore / Document Loader - used to store and load the documents (e.g. UnstructuredHTMLLoader).
  4. LLM vs embedding models Embeddings - an abstraction over a

    model which can transform given texts into vector representations. Might be remote (e.g. OpenAIEmbeddings, CohereEmbeddings) or local models (e.g. HuggingFaceEmbeddings). LLM - a system capable of producing texts, given prompt. Large Language Models might be remote, accessed through some sort of API (e.g. OpenAI, Cohere) or local (e.g. GPT4All). Prompt / PromptTemplate - an input to the LLM, with some placeholders, so we can provide additional context.
  5. Tool - an interface to interact with an external system.

    Each tool has a description used by an agent to choose which one to use for a particular job. AmadeusFlightSearch Use this tool to search for a single flight between the origin and destination airports at a departure between an earliest and latest datetime. YouTubeSearchTool Search for youtube videos associated with a person. The input to this tool should be a comma separated list, the first part contains a person name and the second a number that is the maximum number of video results to return aka num_results. The second part is optional. PythonREPLTool A Python shell. Use this to execute python commands. Input should be a valid python command. If you want to see the output of a value, you should print it out with `print(...)`.
  6. Retrievers & Vector Stores Retriever - returns documents based on

    text queries without storing them. Vector Store - keeps documents along with their embeddings and performs semantic search over them. Usually acts as Long Term Memory. Each vector store might be also used as retriever. Qdrant is the only vector store that supports full asynchronous API.
  7. Chain A sequence of actions to be performed. Usually involves

    calling an LLM at least once. Chain may call some other components, like vector stores or retrievers. Each chain may use Memory that keeps the state of the chain between the runs. Useful in chat-like applications.
  8. Agent - uses LLMs to choose the next action to

    perform. It might be based, for example, on tool descriptions.
  9. The default prompt for one of the available chains Source:

  10. LlamaIndex 🦙 ⭐ 19.9k LlamaIndex (formerly GPT Index) is a

    data framework for LLM applications to ingest, structure, and access private or domain-specific data. The second most popular LLM framework, that focuses more on the custom input data.
  11. Documents & Nodes Document is an abstraction over a data

    source, such as file or database entry. Node is a chunk of the document, such as text chunk, but also image and anything else. Both documents and nodes have metadata and relationships to other documents and nodes. Documents might be loaded by using Data Connectors (LlamaHub).
  12. References to Langchain - LLMs - Embeddings - it’s even

    possible to reuse the embedding models implemented in Langchain through LangchainEmbedding - Vector Stores & Retrievers - Tools - Agents LlamaIndex is integrated with Langchain, but also shares some concepts.
  13. Query engine / Chat engine - speak to your data

    through chat or question/answer systems.
  14. Haystack ⭐ 10.3k LLM orchestration framework to build customizable, production-ready

    LLM applications. Connect components (Models, Vector DBs, File Converters) to Pipelines or Agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search and conversational agent. The oldest tool out of the three mentioned that incorporated LLMs recently. Previously, it was focused on semantic search.
  15. Documents A single document defines a piece of information, which

    might be text, table or image along with the metadata and some other details.
  16. Document Stores QdrantDocumentStore is one of the options and it

    supports documents with and without embeddings out of the box. A storage layer which retains documents and lets you query them effectively. It might be an SQL database, but also a vector search engine.
  17. Nodes & pipelines Haystack is built on top of a

    directed acyclic graph (DAG). We build a pipeline that consists of multiple nodes that process given input in a defined order. The processing doesn’t have to be linear, as there are decision components which allow to choose a different path. Haystack offers not only a support of LLMs, but also some additional features: - dense / sparse search - classification - text to speech - translations - and many more…
  18. Wishlist for prototypes - Rich integration suites for many data

    sources - Variety of embedding models and LLMs to try out different configurations - Different storage systems implemented - In-memory or local mode for data storage, to keep thing simple - Easy way to glue up different blocks together and try it out
  19. The majority of libraries focus more on building demos in

    an easy way. Running systems in production requires different means.
  20. Hallucination is a confident response by an AI that does

    not seem to be justified by its input source. Hallucination
  21. Semantic Search in production • Integrated with all major LLM

    frameworks. • Includes efficient metadata filtering to enrich semantic search with additional criteria. • Runs in local in-memory/persisted mode, on-premise and Qdrant Cloud.
  22. Speeding up semantic search • Tweaking the HNSW indexing and

    search parameters • Scalar Quantization • Product Quantization
  23. Asynchronous programming If we use external services, such as OpenAI

    or Cohere embeddings or LLMs, and an actual vector database, such as Qdrant, we can improve time efficiency. LangChain and LlamaIndex partially support asynchronous methods. Qdrant is the only vector store with full async support in Langchain.
  24. Caching Calling external LLMs and embedding models means costs. Those

    APIs are usually charged based on tokens. Caching may help to reduce the latency and cost at the same time.
  25. Deployment Any production system aiming to work for multiple users

    simultaneously must be deployed to scale, not as a console application. LLM will usually serve a specific purpose in the whole application but won’t be its only component. Those kind of systems are often exposed as REST API. Haystack provides a way to expose the pipeline through the FastAPI-based REST API with no extra effort.
  26. Monitoring Relevant not only for LLM-based applications, but for all

    sorts of systems in general. Not much different from any other application, but it is useful to have a framework that logs activity and gives users the option to analyze it down the road.
  27. Human feedback Users judging the usefulness of the system outputs

    give us ground truth that might be used later on to fine-tune the system.
  28. The “Ask Qdrant” bot Qdrant Discord community is the best

    way to get quick support when building applications with Qdrant.
  29. The output of the profiler running on both Langchain and

    custom version of the agent - almost 3x faster