AI Camp Berlin 2024: Evaluating LLMs for your real world application

© ideas engineering 2023 How you can measure the reliability
of your LLM application outputs Evaluating LLMs: For your real world application b y Ta n u j J a i n ( Te a m A I )

Team AI Ea t in g , sl e e
pi ng a nd po o pi ng A I si n c e m u c h b e f ore C ha t G PT

© ideas engineering 2023 3 • What’s the deal with
evaluation? • Why do I need to do it? • Why is it hard? • Issues with current ways of evaluation • Our approach with HeyBild • Way forward Agenda I d e a s R e v i e w

© ideas engineering 2023 4 Why do I need to
evaluate my LLM app? I d e a s R e v i e w

© ideas engineering 2023 5 What’s so hard about Evaluation?
I d e a s R e v i e w

© ideas engineering 2023 6 Q: How’s the weather in
Berlin today? A1: It’s 30 degrees Celsius What’s so hard about Evaluation? I d e a s R e v i e w

Berlin today? A1: It’s 30 degrees Celsius A2: It’s 86 degrees Fahrenheit What’s so hard about Evaluation? I d e a s R e v i e w

Berlin today? A1: It’s 30 degrees Celsius A2: It’s 86 degrees Fahrenheit A3: It’s windy with a 50% chance of rain What’s so hard about Evaluation? I d e a s R e v i e w

Berlin today? A1: It’s 30 degrees Celsius A2: It’s 86 degrees Fahrenheit A3: It’s windy with a 50% chance of rain A4: Do you need a better conversation starter? What’s so hard about Evaluation? I d e a s R e v i e w

© ideas engineering 2023 10 How the majority evaluates I
d e a s R e v i e w

© ideas engineering 2023 11 I d e a s
R e v i e w I c a n c a l l t h i s h o t A P I a n d d o A I n o w

R e v i e w G i m m e w e a t h e r G i m m e w e a t h e r 8 6 d e g r e e s 8 6 d e g r e e s

R e v i e w I s it a c t u a l l y 8 6 d e g r e e s ?

R e v i e w Fa st t y p in g Ye s

R e v i e w

R e v i e w R a n c o r r e c t ly 7 t i m e s , M y li f e ’s a w e s o m e !

R e v i e w DEPLOY TO PRO D! !! !

R e v i e w I N C E L S I U S S S S ! ! ! G i m m e w e a t h e r G i m m e w e a t h e r 8 6 d e g r e e s 8 6 d e g r e e s I N C E L S I U S S S S ! ! ! 3 0 d e g r e e s 3 0 d e g r e e s

R e v i e w Fa st t y p in g

R e v i e w U p d a t e p r o m p t t o A l w a y s w o r k w i t h C e l s i u s

R e v i e w

© ideas engineering 2023 22 - No thoroughness - Manual
- No regression check - Unreliable - Unscalable - Could be REALLLLY dangerous for your brand Problems I d e a s R e v i e w

© ideas engineering 2023 23 Problems: Kill your brand value
I d e a s R e v i e w

© ideas engineering 2023 24 Our approach with HeyBild I
d e a s R e v i e w

© ideas engineering 2023 25 1. Classify User query types
2. Manually list down a few hundred user prompts and expected correct answers → Test Set 3. Classify response types 4. Use the existing system to answer the test set 5. Use Judge LLM(s) to match generated answers with expected answers 6. Get a score HeyBild eval I d e a s R e v i e w

© ideas engineering 2023 26 • Classify User query types
• Welches Team gewann die letzte Europameisterschaft? • Ist Israel Schuld am Krieg in Gaza? • Gibt es einen Driver Downloader für Linux Arch? • Wann beginnt die Fußball-Europameisterschaft 2024? HeyBild eval I d e a s R e v i e w

© ideas engineering 2023 27 • Manually list down a
few hundred user prompts and expected correct answers → Test Set HeyBild eval I d e a s R e v i e w

© ideas engineering 2023 30 • Use Judge LLM(s) to
match generated answers with expected answers Accuracy on Test Set=x% HeyBild eval I d e a s R e v i e w

© ideas engineering 2023 31 - Update Prompt to improve
accuracy - Use a different LLM at the backend - Write a set of programmatic rules to guide LLM towards correct answers - Improve retrieval system to give more relevant answers - Use Guardrails to keep my brand safe Update your system I d e a s R e v i e w

© ideas engineering 2023 32 - Human(s) who: - is
very motivated - Lives close to the real system - Has access to live system usage data - Can put in the manual work to compose a test set - A dev team and a leadership which: - Understands the importance of evaluation - Is willing to look beyond the mindset of “Let’s-stitch-some-APIs-together-and-declare-success” - Does not treat evaluation as an Afterthought but a necessary component of the system What do you need? I d e a s R e v i e w

© ideas engineering 2023 35 - Incorporate a system for
assessing Retrieval quality - Eval queries with changing answers - Make Eval a part of CI/CD I d e a s R e v i e w

Zimmerstr. 50, 10117 Berlin [email protected]. Ideas- engineering.io

AI Camp Berlin 2024: Evaluating LLMs for your r...

AI Camp Berlin 2024: Evaluating LLMs for your real world application

More Decks by Tanuj

Featured

Transcript