5 Painful Lessons using LLMs

5 Painful Lessons using LLMs M Waleed Kadous Chief Scientist,
Anyscale

1. Three ways to use LLMs. Choose wisely. 2. Know
what fine-tuning helps with. Know the other tools. 3. Speed, cost and scale matter from day one. 4. Murphy’s law also applies to LLMs. 5. Two design patterns we keep coming back to. Summary

Three ways to use LLMs. Choose wisely.

- Commercial Closed (OpenAI, Anthropic, Cohere etc). - Hosted OSS
- Per machine (Hugging Face, Replicate, Vertex AI, etc) - Per token (Anyscale Endpoints) - Self hosted - Run it yourself on top of machines 3 ways to deploy?

Approach Quality Cost Ease of Use Privacy Commercial APIs 😃
😦 😃 😦 Hosted OSS 😐→ 😃 😐 😐 😐 Self hosted 😐→ 😃 ? 😦 😃

- Commercial APIs (in particular GPT-4) is the best -
Llama 2 models set a new bar for OSS models - Gap continues to close - (This morning) Google will now serve Llama 2 models Quality

Llama 2 models Released 5 weeks ago + last week
3 sizes: 7b, 13b, 70b Permissive licence - Can be used commercially - Can’t be used to train other models - Companies w/ > 700 million MAU need a license Completely changed the game - Nobody even talks about Falcon much - Last Week: Code Llama 7b, 13b, 34b.

Summary Ranking established in literature. “insiders say the row brought
simmering tensions between the starkly contrasting pair -- both rivals for miliband's ear -- to a head.” A: insiders say the row brought tensions between the contrasting pair. B: insiders say the row brought simmering tensions between miliband's ear. Example of comparable quality: Factuality eval

GPT-4 is Expensive – 30x Llama 70b for similar performance
Cost

Can now readily get hosted version of Llama 2 endpoints.anyscale.com
Hosted OSS

- Good serving options now! - github.com/ray-project/aviary - Built on
top of Ray Serve - TGI used to be good but … - Can be cheaper but not always - Depends on your workload - Complexity of autoscaling etc - $1 per million tokens is pretty hard to beat - We aggregate across users Self-Hosted

Llama 2 7B: One g5.2xlarge is ~$7000/yr - Can do
~700 tokens/s - No autoscaling or redundancy Llama 2 70B: - You need 4x A100 80GB – like Hen’s teeth to get - Lambda Labs: $2/GPU so we’re talking $70,000/yr. Self hosted Llama 2 Models

- APIs are easy - Anyscale Endpoints is API compatible.
3 line change. - But: - ChatGPT follows instructions - Llama 2 doesn’t always Ease of use

What we asked for: Please give an A or a
B. What we got from GPT-4: A What we got from Llama 2 70b: ‘The correct answer is A: those who receive centrelink payments made up half of radio rental's income last year. Explanation: Summary A accurately summarizes the article sentence by mentioning that those who receive centrelink payments made up half of radio rental's income last year. It maintains the same meaning and information as the original sentence. On the other hand, Summary B is inconsistent with the article sentence. It suggests that the ABC's report only mentioned that those who receive centrelink payments made up radio rental's income last year, which is not entirely accurate. The article sentence explicitly states that the ABC reported that those who receive centrelink payments made up half of radio rental's income last year. Therefore, Summary A is the better choice’. Example

We had to write our own little answer extractor using
another LLM! System prompt You are a helpful assistant that carefully follows instruction. You provide only answers, no explanations. User prompt Determine if the following text says whether the answer is A, B or other. Only output a single word, either: A B or other Text: {query} How we fixed

Privacy Changing terrain Used to be that Self Hosted was
clearly the safest But now Azure and OpenAI offer to run in your cloud

Fine tuning is not a magic wand

What is fine tuning? Incremental retraining of existing models with
domain specific data. Example: fine tuning on Shakespeare

Once upon a time, there was a horse. But this
particular horse was too big to be put into a normal stall. Instead, the animal was moved into an indoor pasture, where it could take a few hours at a time out of the stall. The problem was that this pasture was so roomy that the horse would often get a little bored being stuck inside. Pre fine tuning

Once upon a time there was a horse. This horse
was in my youth, a little unruly, but yet the best of all. I have, sir; I know every horse in the field, and the best that I have known is the dead. And now I thank the gods, and take my leave. Post fine tuning

Work by Kourosh Hakhameneshi and Rehaan Ahmad Fine-Tuning Llama-2: A
Comprehensive Case Study for Tailoring Models to Unique Applications Does it work? Yes!

Big improvements - Llama 2 13B went from 42% ->
89% - Outperformed GPT-4: 80% general - Amazing because according to rumors, GPT-4 has 1.4T parameters – so at least 100x more But not just an open source thing any more … last week GPT-3.5-Turbo tuning released

GPT-3.5 fine-tuning (released Tuesday!) End to end took about 75
minutes Spent more time on jsonl conversion than FT It just works Kick off 40 min/ $35/ 4M tokens later Email Change 1 line Impressive results!

No. It helps with the form, with the shape, with
the vocabulary, with the “feel” – Shakespeare example But does not help with facts. Fine tuning solves everything?

A more holistic approach

One technique that does work Retrieval Augmented Generation - Hit
a database of facts and provide that to the LLM

Speed, Cost and Performance important

Speed, Cost and Performance still matter - The key is
really distributed computing - Need to think holistically - Example: RAG – how do we speed it up? - Ray is your friend

Speed, Cost and Performance still matter The key is really
distributed computing Need to think holistically Example: RAG – how do we speed it up? Ray is your friend Cade is going to dive into an example of this in serving

Only the most advanced machines run Llama 70b 2 GPUs
we use: - A10Gs w/ 24GB – super slow and not much memory - Only good for Llama 7b models over 2 GPUs - Only good for Llama 13b models over 2 or 4 GPUs. - 8x A100s w/ 80GB – awesome if you can get them - We’ve been fighting to get them - P4de.24xlarge - None in AWS and if you can get them it’s $40/hr. - The new up and comers: - Lambda Labs (~$2/GPU/hr = $16/hr for an 8xA100 box) - Coreweave

Murphy’s Law & LLMs

Murphy’s Law example for LLMs “If it can go wrong,
it will go wrong.” It’s not a pessimistic statement. It’s an engineering principle So you just have to prepare for it

Summary Ranking established in literature. “insiders say the row brought
simmering tensions between the starkly contrasting pair -- both rivals for miliband's ear -- to a head.” A: insiders say the row brought tensions between the contrasting pair. B: insiders say the row brought simmering tensions between miliband's ear. Remember this?

Anyone see the Murphy’s law problem? We ran this for
GPT-3.5-Turbo It got 96% accuracy. Human is 84%. Amazing! If it’s too good to be true, it probably is. Why?

We had done our testing with A being the correct
answer. GPT-3.5-Turbo always chose A. What happens if we make B the correct answer? Accuracy drops to 60% Ordering bias!

Murphy’s law in practice What if we make B the
correct answer in a second batch? A B = correct A A = bias to A B B = bias to B B A = incorrect (but at least consistent) How to deal with this?

order_bias = abs(AA ratio - BB ratio )

Two Design Patterns we are seeing

Principle 1 : One LLM does one job Don’t ask
too much of an LLM Ask it to do one thing only

Example 1: Factual summarization What’s the correct answer? Is it
A or B?

Example 2: Ansari Addl info Required? Respond to Query What
search term should I use? Vector DB Augment Query

- An LLM application is not just an LLM +
Set of Prompts - An LLM Application is a combination of: - Agents: LLM based process - Tools: Things that Agents can query - Presenters: Surface to user (e.g. stdio vs Slack vs Gradio) - One Agent is the primary Agent that the user talks to One Agent != One Application Corollary

Design for swappability https://github.com/anyscale/factuality-eval Make it possible to swap out:
- Prompts - LLMs - Pass in LLMs - External tools - Presentation - Other Agents

- Experimentation - Try different prompts, but do so systematically
- Incrementally change one thing at a time - Testing - Can put “mock” objects in. - Presentation - Can easily swap out different modes of interaction Helps with

1. Different ways to use LLMs - each with pros
and cons Biased view: Use Anyscale Endpoints Preview or Aviary 2. Fine tuning helps with form, but use RAG for facts 3. Speed still matters for LLMs for preproc, fine tuning and serving Biased View: Use Ray or Anyscale Platform (managed Ray) 4. Design for Murphy’s Law Example: Present in different order for multichoice 5. 2 Design Patterns a. One agent, one task b. Design for swapability Summarizing the whole talk

Thank You! RAY SUMMIT - 18 - 20 September! Endpoints:
endpoints.anyscale.com Aviary: github.com/ray-project/aviary Details: anyscale.com/blog Numbers: llm-numbers.ray.io Ray: ray.io Anyscale: anyscale.com Me: [email protected]

5 Painful Lessons using LLMs

5 Painful Lessons using LLMs

More Decks by Anyscale

Other Decks in Programming

Featured

Transcript