Slide 1

Slide 1 text

© ideas engineering 2023 How you can measure the reliability of your LLM application outputs Evaluating LLMs: For your real world application b y Ta n u j J a i n ( Te a m A I )

Slide 2

Slide 2 text

Team AI Ea t in g , sl e e pi ng a nd po o pi ng A I si n c e m u c h b e f ore C ha t G PT

Slide 3

Slide 3 text

© ideas engineering 2023 3 • What’s the deal with evaluation? • Why do I need to do it? • Why is it hard? • Issues with current ways of evaluation • Our approach with HeyBild • Way forward Agenda I d e a s R e v i e w

Slide 4

Slide 4 text

© ideas engineering 2023 4 Why do I need to evaluate my LLM app? I d e a s R e v i e w

Slide 5

Slide 5 text

© ideas engineering 2023 5 What’s so hard about Evaluation? I d e a s R e v i e w

Slide 6

Slide 6 text

© ideas engineering 2023 6 Q: How’s the weather in Berlin today? A1: It’s 30 degrees Celsius What’s so hard about Evaluation? I d e a s R e v i e w

Slide 7

Slide 7 text

© ideas engineering 2023 7 Q: How’s the weather in Berlin today? A1: It’s 30 degrees Celsius A2: It’s 86 degrees Fahrenheit What’s so hard about Evaluation? I d e a s R e v i e w

Slide 8

Slide 8 text

© ideas engineering 2023 8 Q: How’s the weather in Berlin today? A1: It’s 30 degrees Celsius A2: It’s 86 degrees Fahrenheit A3: It’s windy with a 50% chance of rain What’s so hard about Evaluation? I d e a s R e v i e w

Slide 9

Slide 9 text

© ideas engineering 2023 9 Q: How’s the weather in Berlin today? A1: It’s 30 degrees Celsius A2: It’s 86 degrees Fahrenheit A3: It’s windy with a 50% chance of rain A4: Do you need a better conversation starter? What’s so hard about Evaluation? I d e a s R e v i e w

Slide 10

Slide 10 text

© ideas engineering 2023 10 How the majority evaluates I d e a s R e v i e w

Slide 11

Slide 11 text

© ideas engineering 2023 11 I d e a s R e v i e w I c a n c a l l t h i s h o t A P I a n d d o A I n o w

Slide 12

Slide 12 text

© ideas engineering 2023 12 I d e a s R e v i e w G i m m e w e a t h e r G i m m e w e a t h e r 8 6 d e g r e e s 8 6 d e g r e e s

Slide 13

Slide 13 text

© ideas engineering 2023 13 I d e a s R e v i e w I s it a c t u a l l y 8 6 d e g r e e s ?

Slide 14

Slide 14 text

© ideas engineering 2023 14 I d e a s R e v i e w Fa st t y p in g Ye s

Slide 15

Slide 15 text

© ideas engineering 2023 15 I d e a s R e v i e w

Slide 16

Slide 16 text

© ideas engineering 2023 16 I d e a s R e v i e w R a n c o r r e c t ly 7 t i m e s , M y li f e ’s a w e s o m e !

Slide 17

Slide 17 text

© ideas engineering 2023 17 I d e a s R e v i e w DEPLOY TO PRO D! !! !

Slide 18

Slide 18 text

© ideas engineering 2023 18 I d e a s R e v i e w I N C E L S I U S S S S ! ! ! G i m m e w e a t h e r G i m m e w e a t h e r 8 6 d e g r e e s 8 6 d e g r e e s I N C E L S I U S S S S ! ! ! 3 0 d e g r e e s 3 0 d e g r e e s

Slide 19

Slide 19 text

© ideas engineering 2023 19 I d e a s R e v i e w Fa st t y p in g

Slide 20

Slide 20 text

© ideas engineering 2023 20 I d e a s R e v i e w U p d a t e p r o m p t t o A l w a y s w o r k w i t h C e l s i u s

Slide 21

Slide 21 text

© ideas engineering 2023 21 I d e a s R e v i e w

Slide 22

Slide 22 text

© ideas engineering 2023 22 - No thoroughness - Manual - No regression check - Unreliable - Unscalable - Could be REALLLLY dangerous for your brand Problems I d e a s R e v i e w

Slide 23

Slide 23 text

© ideas engineering 2023 23 Problems: Kill your brand value I d e a s R e v i e w

Slide 24

Slide 24 text

© ideas engineering 2023 24 Our approach with HeyBild I d e a s R e v i e w

Slide 25

Slide 25 text

© ideas engineering 2023 25 1. Classify User query types 2. Manually list down a few hundred user prompts and expected correct answers → Test Set 3. Classify response types 4. Use the existing system to answer the test set 5. Use Judge LLM(s) to match generated answers with expected answers 6. Get a score HeyBild eval I d e a s R e v i e w

Slide 26

Slide 26 text

© ideas engineering 2023 26 • Classify User query types • Welches Team gewann die letzte Europameisterschaft? • Ist Israel Schuld am Krieg in Gaza? • Gibt es einen Driver Downloader für Linux Arch? • Wann beginnt die Fußball-Europameisterschaft 2024? HeyBild eval I d e a s R e v i e w

Slide 27

Slide 27 text

© ideas engineering 2023 27 • Manually list down a few hundred user prompts and expected correct answers → Test Set HeyBild eval I d e a s R e v i e w

Slide 28

Slide 28 text

© ideas engineering 2023 28 • Classify response types HeyBild eval I d e a s R e v i e w

Slide 29

Slide 29 text

© ideas engineering 2023 29 • Use the existing system to answer the test set HeyBild eval I d e a s R e v i e w

Slide 30

Slide 30 text

© ideas engineering 2023 30 • Use Judge LLM(s) to match generated answers with expected answers Accuracy on Test Set=x% HeyBild eval I d e a s R e v i e w

Slide 31

Slide 31 text

© ideas engineering 2023 31 - Update Prompt to improve accuracy - Use a different LLM at the backend - Write a set of programmatic rules to guide LLM towards correct answers - Improve retrieval system to give more relevant answers - Use Guardrails to keep my brand safe Update your system I d e a s R e v i e w

Slide 32

Slide 32 text

© ideas engineering 2023 32 - Human(s) who: - is very motivated - Lives close to the real system - Has access to live system usage data - Can put in the manual work to compose a test set - A dev team and a leadership which: - Understands the importance of evaluation - Is willing to look beyond the mindset of “Let’s-stitch-some-APIs-together-and-declare-success” - Does not treat evaluation as an Afterthought but a necessary component of the system What do you need? I d e a s R e v i e w

Slide 33

Slide 33 text

© ideas engineering 2023 33 ATTITUDE CHANGE: Evaluation is not an afterthought I d e a s R e v i e w

Slide 34

Slide 34 text

© ideas engineering 2023 34 Way forward I d e a s R e v i e w

Slide 35

Slide 35 text

© ideas engineering 2023 35 - Incorporate a system for assessing Retrieval quality - Eval queries with changing answers - Make Eval a part of CI/CD I d e a s R e v i e w

Slide 36

Slide 36 text

© ideas engineering 2023 36 Thanks! I d e a s R e v i e w

Slide 37

Slide 37 text

Zimmerstr. 50, 10117 Berlin [email protected]. Ideas- engineering.io