5 Painful Lessons using LLMs

Slide 1

Slide 1 text

5 Painful Lessons using LLMs M Waleed Kadous Chief Scientist, Anyscale

Slide 2

Slide 2 text

1. Three ways to use LLMs. Choose wisely. 2. Know what fine-tuning helps with. Know the other tools. 3. Speed, cost and scale matter from day one. 4. Murphy’s law also applies to LLMs. 5. Two design patterns we keep coming back to. Summary

Slide 3

Slide 3 text

Three ways to use LLMs. Choose wisely.

Slide 4

Slide 4 text

- Commercial Closed (OpenAI, Anthropic, Cohere etc). - Hosted OSS - Per machine (Hugging Face, Replicate, Vertex AI, etc) - Per token (Anyscale Endpoints) - Self hosted - Run it yourself on top of machines 3 ways to deploy?

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Approach Quality Cost Ease of Use Privacy Commercial APIs 😃 😦 😃 😦 Hosted OSS 😐→ 😃 😐 😐 😐 Self hosted 😐→ 😃 ? 😦 😃

Slide 7

Slide 7 text

- Commercial APIs (in particular GPT-4) is the best - Llama 2 models set a new bar for OSS models - Gap continues to close - (This morning) Google will now serve Llama 2 models Quality

Slide 8

Slide 8 text

Llama 2 models Released 5 weeks ago + last week 3 sizes: 7b, 13b, 70b Permissive licence - Can be used commercially - Can’t be used to train other models - Companies w/ > 700 million MAU need a license Completely changed the game - Nobody even talks about Falcon much - Last Week: Code Llama 7b, 13b, 34b.

Slide 9

Slide 9 text

Summary Ranking established in literature. “insiders say the row brought simmering tensions between the starkly contrasting pair -- both rivals for miliband's ear -- to a head.” A: insiders say the row brought tensions between the contrasting pair. B: insiders say the row brought simmering tensions between miliband's ear. Example of comparable quality: Factuality eval

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

GPT-4 is Expensive – 30x Llama 70b for similar performance Cost

Slide 12

Slide 12 text

Can now readily get hosted version of Llama 2 endpoints.anyscale.com Hosted OSS

Slide 13

Slide 13 text

- Good serving options now! - github.com/ray-project/aviary - Built on top of Ray Serve - TGI used to be good but … - Can be cheaper but not always - Depends on your workload - Complexity of autoscaling etc - $1 per million tokens is pretty hard to beat - We aggregate across users Self-Hosted

Slide 14

Slide 14 text

Llama 2 7B: One g5.2xlarge is ~$7000/yr - Can do ~700 tokens/s - No autoscaling or redundancy Llama 2 70B: - You need 4x A100 80GB – like Hen’s teeth to get - Lambda Labs: $2/GPU so we’re talking $70,000/yr. Self hosted Llama 2 Models

Slide 15

Slide 15 text

- APIs are easy - Anyscale Endpoints is API compatible. 3 line change. - But: - ChatGPT follows instructions - Llama 2 doesn’t always Ease of use

Slide 16

Slide 16 text

What we asked for: Please give an A or a B. What we got from GPT-4: A What we got from Llama 2 70b: ‘The correct answer is A: those who receive centrelink payments made up half of radio rental's income last year. Explanation: Summary A accurately summarizes the article sentence by mentioning that those who receive centrelink payments made up half of radio rental's income last year. It maintains the same meaning and information as the original sentence. On the other hand, Summary B is inconsistent with the article sentence. It suggests that the ABC's report only mentioned that those who receive centrelink payments made up radio rental's income last year, which is not entirely accurate. The article sentence explicitly states that the ABC reported that those who receive centrelink payments made up half of radio rental's income last year. Therefore, Summary A is the better choice’. Example

Slide 17

Slide 17 text

We had to write our own little answer extractor using another LLM! System prompt You are a helpful assistant that carefully follows instruction. You provide only answers, no explanations. User prompt Determine if the following text says whether the answer is A, B or other. Only output a single word, either: A B or other Text: {query} How we fixed

Slide 18

Slide 18 text

Privacy Changing terrain Used to be that Self Hosted was clearly the safest But now Azure and OpenAI offer to run in your cloud

Slide 19

Slide 19 text

Fine tuning is not a magic wand

Slide 20

Slide 20 text

What is fine tuning? Incremental retraining of existing models with domain specific data. Example: fine tuning on Shakespeare

Slide 21

Slide 21 text

Once upon a time, there was a horse. But this particular horse was too big to be put into a normal stall. Instead, the animal was moved into an indoor pasture, where it could take a few hours at a time out of the stall. The problem was that this pasture was so roomy that the horse would often get a little bored being stuck inside. Pre fine tuning

Slide 22

Slide 22 text

Once upon a time there was a horse. This horse was in my youth, a little unruly, but yet the best of all. I have, sir; I know every horse in the field, and the best that I have known is the dead. And now I thank the gods, and take my leave. Post fine tuning

Slide 23

Slide 23 text

Work by Kourosh Hakhameneshi and Rehaan Ahmad Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications Does it work? Yes!

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Big improvements - Llama 2 13B went from 42% -> 89% - Outperformed GPT-4: 80% general - Amazing because according to rumors, GPT-4 has 1.4T parameters – so at least 100x more But not just an open source thing any more … last week GPT-3.5-Turbo tuning released

Slide 27

Slide 27 text

GPT-3.5 fine-tuning (released Tuesday!) End to end took about 75 minutes Spent more time on jsonl conversion than FT It just works Kick off 40 min/ $35/ 4M tokens later Email Change 1 line Impressive results!

Slide 28

Slide 28 text

No. It helps with the form, with the shape, with the vocabulary, with the “feel” – Shakespeare example But does not help with facts. Fine tuning solves everything?

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

A more holistic approach

Slide 31

Slide 31 text

One technique that does work Retrieval Augmented Generation - Hit a database of facts and provide that to the LLM

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Speed, Cost and Performance important

Slide 34

Slide 34 text

Speed, Cost and Performance still matter - The key is really distributed computing - Need to think holistically - Example: RAG – how do we speed it up? - Ray is your friend

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Speed, Cost and Performance still matter The key is really distributed computing Need to think holistically Example: RAG – how do we speed it up? Ray is your friend Cade is going to dive into an example of this in serving

Slide 37

Slide 37 text

Only the most advanced machines run Llama 70b 2 GPUs we use: - A10Gs w/ 24GB – super slow and not much memory - Only good for Llama 7b models over 2 GPUs - Only good for Llama 13b models over 2 or 4 GPUs. - 8x A100s w/ 80GB – awesome if you can get them - We’ve been fighting to get them - P4de.24xlarge - None in AWS and if you can get them it’s $40/hr. - The new up and comers: - Lambda Labs (~$2/GPU/hr = $16/hr for an 8xA100 box) - Coreweave

Slide 38

Slide 38 text

Murphy’s Law & LLMs

Slide 39

Slide 39 text

Murphy’s Law example for LLMs “If it can go wrong, it will go wrong.” It’s not a pessimistic statement. It’s an engineering principle So you just have to prepare for it

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Anyone see the Murphy’s law problem? We ran this for GPT-3.5-Turbo It got 96% accuracy. Human is 84%. Amazing! If it’s too good to be true, it probably is. Why?

Slide 42

Slide 42 text

We had done our testing with A being the correct answer. GPT-3.5-Turbo always chose A. What happens if we make B the correct answer? Accuracy drops to 60% Ordering bias!

Slide 43

Slide 43 text

Murphy’s law in practice What if we make B the correct answer in a second batch? A B = correct A A = bias to A B B = bias to B B A = incorrect (but at least consistent) How to deal with this?

Slide 44

Slide 44 text

order_bias = abs(AA ratio - BB ratio )

Slide 45

Slide 45 text

Two Design Patterns we are seeing

Slide 46

Slide 46 text

Principle 1 : One LLM does one job Don’t ask too much of an LLM Ask it to do one thing only

Slide 47

Slide 47 text

Example 1: Factual summarization What’s the correct answer? Is it A or B?

Slide 48

Slide 48 text

Example 2: Ansari Addl info Required? Respond to Query What search term should I use? Vector DB Augment Query

Slide 49

Slide 49 text

- An LLM application is not just an LLM + Set of Prompts - An LLM Application is a combination of: - Agents: LLM based process - Tools: Things that Agents can query - Presenters: Surface to user (e.g. stdio vs Slack vs Gradio) - One Agent is the primary Agent that the user talks to One Agent != One Application Corollary

Slide 50

Slide 50 text

Design for swappability https://github.com/anyscale/factuality-eval Make it possible to swap out: - Prompts - LLMs - Pass in LLMs - External tools - Presentation - Other Agents

Slide 51

Slide 51 text

- Experimentation - Try different prompts, but do so systematically - Incrementally change one thing at a time - Testing - Can put “mock” objects in. - Presentation - Can easily swap out different modes of interaction Helps with

Slide 52

Slide 52 text

1. Different ways to use LLMs - each with pros and cons Biased view: Use Anyscale Endpoints Preview or Aviary 2. Fine tuning helps with form, but use RAG for facts 3. Speed still matters for LLMs for preproc, fine tuning and serving Biased View: Use Ray or Anyscale Platform (managed Ray) 4. Design for Murphy’s Law Example: Present in different order for multichoice 5. 2 Design Patterns a. One agent, one task b. Design for swapability Summarizing the whole talk

Slide 53

Slide 53 text

Thank You! RAY SUMMIT - 18 - 20 September! Endpoints: endpoints.anyscale.com Aviary: github.com/ray-project/aviary Details: anyscale.com/blog Numbers: llm-numbers.ray.io Ray: ray.io Anyscale: anyscale.com Me: [email protected]