$30 off During Our Annual Pro Sale. View Details »

5 Painful Lessons using LLMs

5 Painful Lessons using LLMs

Over the last year at Anyscale, we’ve built a number of our own LLM Applications as well as helping our customers build them as well. Through that process we’ve come up with 5 design patterns or principles that we feel significantly increase the chances of success in building LLM applications. We discuss those 5 patterns and how to build them into your application. It will address:

1. What exactly is an LLM application is.
2. How to design for easy evaluation and testability.
3. Making LLM components reusable across applications.

Anyscale
PRO

August 31, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. 5 Painful Lessons using LLMs
    M Waleed Kadous
    Chief Scientist, Anyscale

    View Slide

  2. 1. Three ways to use LLMs. Choose wisely.
    2. Know what fine-tuning helps with. Know the other tools.
    3. Speed, cost and scale matter from day one.
    4. Murphy’s law also applies to LLMs.
    5. Two design patterns we keep coming back to.
    Summary

    View Slide

  3. Three ways to use LLMs. Choose wisely.

    View Slide

  4. - Commercial Closed (OpenAI, Anthropic, Cohere etc).
    - Hosted OSS
    - Per machine (Hugging Face, Replicate, Vertex AI, etc)
    - Per token (Anyscale Endpoints)
    - Self hosted
    - Run it yourself on top of machines
    3 ways to deploy?

    View Slide

  5. - Commercial Closed (OpenAI, Anthropic, Cohere etc).
    - Hosted OSS
    - Per machine (Hugging Face, Replicate, Vertex AI, etc)
    - Per token (Anyscale Endpoints)
    - Self hosted
    - Run it yourself on top of machines
    3 ways to deploy?

    View Slide

  6. Approach Quality Cost
    Ease of
    Use
    Privacy
    Commercial APIs
    😃 😦 😃 😦
    Hosted OSS 😐→ 😃 😐 😐 😐
    Self hosted 😐→ 😃 ? 😦 😃

    View Slide

  7. - Commercial APIs (in particular GPT-4) is the best
    - Llama 2 models set a new bar for OSS models
    - Gap continues to close
    - (This morning) Google will now serve Llama 2 models
    Quality

    View Slide

  8. Llama 2 models
    Released 5 weeks ago + last week
    3 sizes: 7b, 13b, 70b
    Permissive licence
    - Can be used commercially
    - Can’t be used to train other models
    - Companies w/ > 700 million MAU need a license
    Completely changed the game
    - Nobody even talks about Falcon much
    - Last Week: Code Llama 7b, 13b, 34b.

    View Slide

  9. Summary Ranking established in literature.
    “insiders say the row brought simmering
    tensions between the starkly contrasting
    pair -- both rivals for miliband's ear --
    to a head.”
    A: insiders say the row brought tensions between
    the contrasting pair.
    B: insiders say the row brought simmering tensions
    between miliband's ear.
    Example of comparable quality: Factuality eval

    View Slide

  10. View Slide

  11. GPT-4 is Expensive – 30x Llama 70b for similar performance
    Cost

    View Slide

  12. Can now readily get hosted version of Llama 2
    endpoints.anyscale.com
    Hosted OSS

    View Slide

  13. - Good serving options now!
    - github.com/ray-project/aviary
    - Built on top of Ray Serve
    - TGI used to be good but …
    - Can be cheaper but not always
    - Depends on your workload
    - Complexity of autoscaling etc
    - $1 per million tokens is pretty hard to beat
    - We aggregate across users
    Self-Hosted

    View Slide

  14. Llama 2 7B: One g5.2xlarge is ~$7000/yr
    - Can do ~700 tokens/s
    - No autoscaling or redundancy
    Llama 2 70B:
    - You need 4x A100 80GB – like Hen’s teeth to get
    - Lambda Labs: $2/GPU so we’re talking $70,000/yr.
    Self hosted Llama 2 Models

    View Slide

  15. - APIs are easy
    - Anyscale Endpoints is API compatible. 3 line change.
    - But:
    - ChatGPT follows instructions
    - Llama 2 doesn’t always
    Ease of use

    View Slide

  16. What we asked for: Please give an A or a B.
    What we got from GPT-4: A
    What we got from Llama 2 70b:
    ‘The correct answer is A: those who receive centrelink payments made up half of radio rental's
    income last year. Explanation: Summary A accurately summarizes the article sentence by
    mentioning that those who receive centrelink payments made up half of radio rental's income
    last year. It maintains the same meaning and information as the original sentence. On the other
    hand, Summary B is inconsistent with the article sentence. It suggests that the ABC's report
    only mentioned that those who receive centrelink payments made up radio rental's income last
    year, which is not entirely accurate. The article sentence explicitly states that the ABC reported
    that those who receive centrelink payments made up half of radio rental's income last year.
    Therefore, Summary A is the better choice’.
    Example

    View Slide

  17. We had to write our own little answer extractor using another LLM!
    System prompt
    You are a helpful assistant that carefully follows
    instruction. You provide only answers, no explanations.
    User prompt
    Determine if the following text says whether the answer is A, B
    or other.
    Only output a single word, either: A B or other
    Text: {query}
    How we fixed

    View Slide

  18. Privacy
    Changing terrain
    Used to be that Self Hosted was clearly the safest
    But now Azure and OpenAI offer to run in your cloud

    View Slide

  19. Fine tuning is not a magic wand

    View Slide

  20. What is fine tuning?
    Incremental retraining of existing models with domain
    specific data.
    Example: fine tuning on Shakespeare

    View Slide

  21. Once upon a time, there was a horse.
    But this particular horse was too big to be put into a
    normal stall. Instead, the animal was moved into an
    indoor pasture, where it could take a few hours at a
    time out of the stall. The problem was that this pasture
    was so roomy that the horse would often get a little
    bored being stuck inside.
    Pre fine tuning

    View Slide

  22. Once upon a time there was a horse. This horse
    was in my youth, a little unruly, but yet the
    best of all. I have, sir; I know every horse in
    the field, and the best that I have known is
    the dead. And now I thank the gods, and take my
    leave.
    Post fine tuning

    View Slide

  23. Work by Kourosh Hakhameneshi and Rehaan Ahmad
    Fine-Tuning Llama-2: A
    Comprehensive Case Study for
    Tailoring Models to Unique
    Applications
    Does it work? Yes!

    View Slide

  24. View Slide

  25. View Slide

  26. Big improvements
    - Llama 2 13B went from 42% -> 89%
    - Outperformed GPT-4: 80% general
    - Amazing because according to rumors, GPT-4 has 1.4T
    parameters – so at least 100x more
    But not just an open source thing any more … last week
    GPT-3.5-Turbo tuning released

    View Slide

  27. GPT-3.5 fine-tuning (released Tuesday!)
    End to end took about 75 minutes
    Spent more time on jsonl conversion than FT
    It just works
    Kick off
    40 min/
    $35/
    4M
    tokens
    later
    Email Change 1 line
    Impressive results!

    View Slide

  28. No.
    It helps with the form, with the shape, with the vocabulary,
    with the “feel” – Shakespeare example
    But does not help with facts.
    Fine tuning solves everything?

    View Slide

  29. View Slide

  30. A more holistic approach

    View Slide

  31. One technique that does work
    Retrieval Augmented Generation
    - Hit a database of facts and provide that to the LLM

    View Slide

  32. View Slide

  33. Speed, Cost and Performance
    important

    View Slide

  34. Speed, Cost and Performance still matter
    - The key is really distributed computing
    - Need to think holistically
    - Example: RAG – how do we speed it up?
    - Ray is your friend

    View Slide

  35. View Slide

  36. Speed, Cost and Performance still matter
    The key is really distributed computing
    Need to think holistically
    Example: RAG – how do we speed it up?
    Ray is your friend
    Cade is going to dive into an example of this in serving

    View Slide

  37. Only the most advanced machines run Llama 70b
    2 GPUs we use:
    - A10Gs w/ 24GB – super slow and not much memory
    - Only good for Llama 7b models over 2 GPUs
    - Only good for Llama 13b models over 2 or 4 GPUs.
    - 8x A100s w/ 80GB – awesome if you can get them
    - We’ve been fighting to get them
    - P4de.24xlarge
    - None in AWS and if you can get them it’s $40/hr.
    - The new up and comers:
    - Lambda Labs (~$2/GPU/hr = $16/hr for an 8xA100 box)
    - Coreweave

    View Slide

  38. Murphy’s Law & LLMs

    View Slide

  39. Murphy’s Law example for LLMs
    “If it can go wrong, it will go wrong.”
    It’s not a pessimistic statement. It’s an engineering principle
    So you just have to prepare for it

    View Slide

  40. Summary Ranking established in literature.
    “insiders say the row brought simmering
    tensions between the starkly contrasting
    pair -- both rivals for miliband's ear --
    to a head.”
    A: insiders say the row brought tensions between
    the contrasting pair.
    B: insiders say the row brought simmering tensions
    between miliband's ear.
    Remember this?

    View Slide

  41. Anyone see the Murphy’s law problem?
    We ran this for GPT-3.5-Turbo
    It got 96% accuracy. Human is 84%. Amazing!
    If it’s too good to be true, it probably is.
    Why?

    View Slide

  42. We had done our testing with A being the correct answer.
    GPT-3.5-Turbo always chose A.
    What happens if we make B the correct answer?
    Accuracy drops to 60%
    Ordering bias!

    View Slide

  43. Murphy’s law in practice
    What if we make B the correct answer in a second batch?
    A B = correct
    A A = bias to A
    B B = bias to B
    B A = incorrect (but at least consistent)
    How to deal with this?

    View Slide

  44. order_bias = abs(AA
    ratio
    - BB
    ratio
    )

    View Slide

  45. Two Design Patterns we are
    seeing

    View Slide

  46. Principle 1 : One LLM does one job
    Don’t ask too much of an LLM
    Ask it to do one thing only

    View Slide

  47. Example 1: Factual summarization
    What’s the correct answer?
    Is it A or B?

    View Slide

  48. Example 2: Ansari
    Addl info
    Required?
    Respond to Query
    What search term
    should I use?
    Vector DB
    Augment Query

    View Slide

  49. - An LLM application is not just an LLM + Set of Prompts
    - An LLM Application is a combination of:
    - Agents: LLM based process
    - Tools: Things that Agents can query
    - Presenters: Surface to user
    (e.g. stdio vs Slack vs Gradio)
    - One Agent is the primary Agent that the user talks to
    One Agent != One Application
    Corollary

    View Slide

  50. Design for swappability
    https://github.com/anyscale/factuality-eval
    Make it possible to swap out:
    - Prompts
    - LLMs
    - Pass in LLMs
    - External tools
    - Presentation
    - Other Agents

    View Slide

  51. - Experimentation
    - Try different prompts, but do so systematically
    - Incrementally change one thing at a time
    - Testing
    - Can put “mock” objects in.
    - Presentation
    - Can easily swap out different modes of interaction
    Helps with

    View Slide

  52. 1. Different ways to use LLMs - each with pros and cons
    Biased view: Use Anyscale Endpoints Preview or Aviary
    2. Fine tuning helps with form, but use RAG for facts
    3. Speed still matters for LLMs for preproc, fine tuning and serving
    Biased View: Use Ray or Anyscale Platform (managed Ray)
    4. Design for Murphy’s Law
    Example: Present in different order for multichoice
    5. 2 Design Patterns
    a. One agent, one task
    b. Design for swapability
    Summarizing the whole talk

    View Slide

  53. Thank You!
    RAY SUMMIT - 18 - 20 September!
    Endpoints: endpoints.anyscale.com
    Aviary: github.com/ray-project/aviary
    Details: anyscale.com/blog
    Numbers: llm-numbers.ray.io
    Ray: ray.io
    Anyscale: anyscale.com
    Me: [email protected]

    View Slide