Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AI Camp Berlin 2024: Evaluating LLMs for your r...

Tanuj
October 07, 2024
3

AI Camp Berlin 2024: Evaluating LLMs for your real world application

AI Camp Berlin 2024: Evaluating LLMs for your real world application

This talk details a real life usecase of evaluating the quality of an LLM-powered a chatbot with millions of users.

Tanuj

October 07, 2024
Tweet

Transcript

  1. © ideas engineering 2023 How you can measure the reliability

    of your LLM application outputs Evaluating LLMs: For your real world application b y Ta n u j J a i n ( Te a m A I )
  2. Team AI Ea t in g , sl e e

    pi ng a nd po o pi ng A I si n c e m u c h b e f ore C ha t G PT
  3. © ideas engineering 2023 3 • What’s the deal with

    evaluation? • Why do I need to do it? • Why is it hard? • Issues with current ways of evaluation • Our approach with HeyBild • Way forward Agenda I d e a s R e v i e w
  4. © ideas engineering 2023 4 Why do I need to

    evaluate my LLM app? I d e a s R e v i e w
  5. © ideas engineering 2023 6 Q: How’s the weather in

    Berlin today? A1: It’s 30 degrees Celsius What’s so hard about Evaluation? I d e a s R e v i e w
  6. © ideas engineering 2023 7 Q: How’s the weather in

    Berlin today? A1: It’s 30 degrees Celsius A2: It’s 86 degrees Fahrenheit What’s so hard about Evaluation? I d e a s R e v i e w
  7. © ideas engineering 2023 8 Q: How’s the weather in

    Berlin today? A1: It’s 30 degrees Celsius A2: It’s 86 degrees Fahrenheit A3: It’s windy with a 50% chance of rain What’s so hard about Evaluation? I d e a s R e v i e w
  8. © ideas engineering 2023 9 Q: How’s the weather in

    Berlin today? A1: It’s 30 degrees Celsius A2: It’s 86 degrees Fahrenheit A3: It’s windy with a 50% chance of rain A4: Do you need a better conversation starter? What’s so hard about Evaluation? I d e a s R e v i e w
  9. © ideas engineering 2023 11 I d e a s

    R e v i e w I c a n c a l l t h i s h o t A P I a n d d o A I n o w
  10. © ideas engineering 2023 12 I d e a s

    R e v i e w G i m m e w e a t h e r G i m m e w e a t h e r 8 6 d e g r e e s 8 6 d e g r e e s
  11. © ideas engineering 2023 13 I d e a s

    R e v i e w I s it a c t u a l l y 8 6 d e g r e e s ?
  12. © ideas engineering 2023 14 I d e a s

    R e v i e w Fa st t y p in g Ye s
  13. © ideas engineering 2023 16 I d e a s

    R e v i e w R a n c o r r e c t ly 7 t i m e s , M y li f e ’s a w e s o m e !
  14. © ideas engineering 2023 17 I d e a s

    R e v i e w DEPLOY TO PRO D! !! !
  15. © ideas engineering 2023 18 I d e a s

    R e v i e w I N C E L S I U S S S S ! ! ! G i m m e w e a t h e r G i m m e w e a t h e r 8 6 d e g r e e s 8 6 d e g r e e s I N C E L S I U S S S S ! ! ! 3 0 d e g r e e s 3 0 d e g r e e s
  16. © ideas engineering 2023 19 I d e a s

    R e v i e w Fa st t y p in g
  17. © ideas engineering 2023 20 I d e a s

    R e v i e w U p d a t e p r o m p t t o A l w a y s w o r k w i t h C e l s i u s
  18. © ideas engineering 2023 22 - No thoroughness - Manual

    - No regression check - Unreliable - Unscalable - Could be REALLLLY dangerous for your brand Problems I d e a s R e v i e w
  19. © ideas engineering 2023 25 1. Classify User query types

    2. Manually list down a few hundred user prompts and expected correct answers → Test Set 3. Classify response types 4. Use the existing system to answer the test set 5. Use Judge LLM(s) to match generated answers with expected answers 6. Get a score HeyBild eval I d e a s R e v i e w
  20. © ideas engineering 2023 26 • Classify User query types

    • Welches Team gewann die letzte Europameisterschaft? • Ist Israel Schuld am Krieg in Gaza? • Gibt es einen Driver Downloader für Linux Arch? • Wann beginnt die Fußball-Europameisterschaft 2024? HeyBild eval I d e a s R e v i e w
  21. © ideas engineering 2023 27 • Manually list down a

    few hundred user prompts and expected correct answers → Test Set HeyBild eval I d e a s R e v i e w
  22. © ideas engineering 2023 29 • Use the existing system

    to answer the test set HeyBild eval I d e a s R e v i e w
  23. © ideas engineering 2023 30 • Use Judge LLM(s) to

    match generated answers with expected answers Accuracy on Test Set=x% HeyBild eval I d e a s R e v i e w
  24. © ideas engineering 2023 31 - Update Prompt to improve

    accuracy - Use a different LLM at the backend - Write a set of programmatic rules to guide LLM towards correct answers - Improve retrieval system to give more relevant answers - Use Guardrails to keep my brand safe Update your system I d e a s R e v i e w
  25. © ideas engineering 2023 32 - Human(s) who: - is

    very motivated - Lives close to the real system - Has access to live system usage data - Can put in the manual work to compose a test set - A dev team and a leadership which: - Understands the importance of evaluation - Is willing to look beyond the mindset of “Let’s-stitch-some-APIs-together-and-declare-success” - Does not treat evaluation as an Afterthought but a necessary component of the system What do you need? I d e a s R e v i e w
  26. © ideas engineering 2023 35 - Incorporate a system for

    assessing Retrieval quality - Eval queries with changing answers - Make Eval a part of CI/CD I d e a s R e v i e w