Upgrade to Pro — share decks privately, control downloads, hide ads and more …

5 Painful Lessons using LLMs

5 Painful Lessons using LLMs

Over the last year at Anyscale, we’ve built a number of our own LLM Applications as well as helping our customers build them as well. Through that process we’ve come up with 5 design patterns or principles that we feel significantly increase the chances of success in building LLM applications. We discuss those 5 patterns and how to build them into your application. It will address:

1. What exactly is an LLM application is.
2. How to design for easy evaluation and testability.
3. Making LLM components reusable across applications.

Anyscale

August 31, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. 1. Three ways to use LLMs. Choose wisely. 2. Know

    what fine-tuning helps with. Know the other tools. 3. Speed, cost and scale matter from day one. 4. Murphy’s law also applies to LLMs. 5. Two design patterns we keep coming back to. Summary
  2. - Commercial Closed (OpenAI, Anthropic, Cohere etc). - Hosted OSS

    - Per machine (Hugging Face, Replicate, Vertex AI, etc) - Per token (Anyscale Endpoints) - Self hosted - Run it yourself on top of machines 3 ways to deploy?
  3. - Commercial Closed (OpenAI, Anthropic, Cohere etc). - Hosted OSS

    - Per machine (Hugging Face, Replicate, Vertex AI, etc) - Per token (Anyscale Endpoints) - Self hosted - Run it yourself on top of machines 3 ways to deploy?
  4. Approach Quality Cost Ease of Use Privacy Commercial APIs 😃

    😦 😃 😦 Hosted OSS 😐→ 😃 😐 😐 😐 Self hosted 😐→ 😃 ? 😦 😃
  5. - Commercial APIs (in particular GPT-4) is the best -

    Llama 2 models set a new bar for OSS models - Gap continues to close - (This morning) Google will now serve Llama 2 models Quality
  6. Llama 2 models Released 5 weeks ago + last week

    3 sizes: 7b, 13b, 70b Permissive licence - Can be used commercially - Can’t be used to train other models - Companies w/ > 700 million MAU need a license Completely changed the game - Nobody even talks about Falcon much - Last Week: Code Llama 7b, 13b, 34b.
  7. Summary Ranking established in literature. “insiders say the row brought

    simmering tensions between the starkly contrasting pair -- both rivals for miliband's ear -- to a head.” A: insiders say the row brought tensions between the contrasting pair. B: insiders say the row brought simmering tensions between miliband's ear. Example of comparable quality: Factuality eval
  8. - Good serving options now! - github.com/ray-project/aviary - Built on

    top of Ray Serve - TGI used to be good but … - Can be cheaper but not always - Depends on your workload - Complexity of autoscaling etc - $1 per million tokens is pretty hard to beat - We aggregate across users Self-Hosted
  9. Llama 2 7B: One g5.2xlarge is ~$7000/yr - Can do

    ~700 tokens/s - No autoscaling or redundancy Llama 2 70B: - You need 4x A100 80GB – like Hen’s teeth to get - Lambda Labs: $2/GPU so we’re talking $70,000/yr. Self hosted Llama 2 Models
  10. - APIs are easy - Anyscale Endpoints is API compatible.

    3 line change. - But: - ChatGPT follows instructions - Llama 2 doesn’t always Ease of use
  11. What we asked for: Please give an A or a

    B. What we got from GPT-4: A What we got from Llama 2 70b: ‘The correct answer is A: those who receive centrelink payments made up half of radio rental's income last year. Explanation: Summary A accurately summarizes the article sentence by mentioning that those who receive centrelink payments made up half of radio rental's income last year. It maintains the same meaning and information as the original sentence. On the other hand, Summary B is inconsistent with the article sentence. It suggests that the ABC's report only mentioned that those who receive centrelink payments made up radio rental's income last year, which is not entirely accurate. The article sentence explicitly states that the ABC reported that those who receive centrelink payments made up half of radio rental's income last year. Therefore, Summary A is the better choice’. Example
  12. We had to write our own little answer extractor using

    another LLM! System prompt You are a helpful assistant that carefully follows instruction. You provide only answers, no explanations. User prompt Determine if the following text says whether the answer is A, B or other. Only output a single word, either: A B or other Text: {query} How we fixed
  13. Privacy Changing terrain Used to be that Self Hosted was

    clearly the safest But now Azure and OpenAI offer to run in your cloud
  14. What is fine tuning? Incremental retraining of existing models with

    domain specific data. Example: fine tuning on Shakespeare
  15. Once upon a time, there was a horse. But this

    particular horse was too big to be put into a normal stall. Instead, the animal was moved into an indoor pasture, where it could take a few hours at a time out of the stall. The problem was that this pasture was so roomy that the horse would often get a little bored being stuck inside. Pre fine tuning
  16. Once upon a time there was a horse. This horse

    was in my youth, a little unruly, but yet the best of all. I have, sir; I know every horse in the field, and the best that I have known is the dead. And now I thank the gods, and take my leave. Post fine tuning
  17. Work by Kourosh Hakhameneshi and Rehaan Ahmad Fine-Tuning Llama-2: A

    Comprehensive Case Study for Tailoring Models to Unique Applications Does it work? Yes!
  18. Big improvements - Llama 2 13B went from 42% ->

    89% - Outperformed GPT-4: 80% general - Amazing because according to rumors, GPT-4 has 1.4T parameters – so at least 100x more But not just an open source thing any more … last week GPT-3.5-Turbo tuning released
  19. GPT-3.5 fine-tuning (released Tuesday!) End to end took about 75

    minutes Spent more time on jsonl conversion than FT It just works Kick off 40 min/ $35/ 4M tokens later Email Change 1 line Impressive results!
  20. No. It helps with the form, with the shape, with

    the vocabulary, with the “feel” – Shakespeare example But does not help with facts. Fine tuning solves everything?
  21. One technique that does work Retrieval Augmented Generation - Hit

    a database of facts and provide that to the LLM
  22. Speed, Cost and Performance still matter - The key is

    really distributed computing - Need to think holistically - Example: RAG – how do we speed it up? - Ray is your friend
  23. Speed, Cost and Performance still matter The key is really

    distributed computing Need to think holistically Example: RAG – how do we speed it up? Ray is your friend Cade is going to dive into an example of this in serving
  24. Only the most advanced machines run Llama 70b 2 GPUs

    we use: - A10Gs w/ 24GB – super slow and not much memory - Only good for Llama 7b models over 2 GPUs - Only good for Llama 13b models over 2 or 4 GPUs. - 8x A100s w/ 80GB – awesome if you can get them - We’ve been fighting to get them - P4de.24xlarge - None in AWS and if you can get them it’s $40/hr. - The new up and comers: - Lambda Labs (~$2/GPU/hr = $16/hr for an 8xA100 box) - Coreweave
  25. Murphy’s Law example for LLMs “If it can go wrong,

    it will go wrong.” It’s not a pessimistic statement. It’s an engineering principle So you just have to prepare for it
  26. Summary Ranking established in literature. “insiders say the row brought

    simmering tensions between the starkly contrasting pair -- both rivals for miliband's ear -- to a head.” A: insiders say the row brought tensions between the contrasting pair. B: insiders say the row brought simmering tensions between miliband's ear. Remember this?
  27. Anyone see the Murphy’s law problem? We ran this for

    GPT-3.5-Turbo It got 96% accuracy. Human is 84%. Amazing! If it’s too good to be true, it probably is. Why?
  28. We had done our testing with A being the correct

    answer. GPT-3.5-Turbo always chose A. What happens if we make B the correct answer? Accuracy drops to 60% Ordering bias!
  29. Murphy’s law in practice What if we make B the

    correct answer in a second batch? A B = correct A A = bias to A B B = bias to B B A = incorrect (but at least consistent) How to deal with this?
  30. Principle 1 : One LLM does one job Don’t ask

    too much of an LLM Ask it to do one thing only
  31. Example 2: Ansari Addl info Required? Respond to Query What

    search term should I use? Vector DB Augment Query
  32. - An LLM application is not just an LLM +

    Set of Prompts - An LLM Application is a combination of: - Agents: LLM based process - Tools: Things that Agents can query - Presenters: Surface to user (e.g. stdio vs Slack vs Gradio) - One Agent is the primary Agent that the user talks to One Agent != One Application Corollary
  33. Design for swappability https://github.com/anyscale/factuality-eval Make it possible to swap out:

    - Prompts - LLMs - Pass in LLMs - External tools - Presentation - Other Agents
  34. - Experimentation - Try different prompts, but do so systematically

    - Incrementally change one thing at a time - Testing - Can put “mock” objects in. - Presentation - Can easily swap out different modes of interaction Helps with
  35. 1. Different ways to use LLMs - each with pros

    and cons Biased view: Use Anyscale Endpoints Preview or Aviary 2. Fine tuning helps with form, but use RAG for facts 3. Speed still matters for LLMs for preproc, fine tuning and serving Biased View: Use Ray or Anyscale Platform (managed Ray) 4. Design for Murphy’s Law Example: Present in different order for multichoice 5. 2 Design Patterns a. One agent, one task b. Design for swapability Summarizing the whole talk
  36. Thank You! RAY SUMMIT - 18 - 20 September! Endpoints:

    endpoints.anyscale.com Aviary: github.com/ray-project/aviary Details: anyscale.com/blog Numbers: llm-numbers.ray.io Ray: ray.io Anyscale: anyscale.com Me: [email protected]