evaluation? • Why do I need to do it? • Why is it hard? • Issues with current ways of evaluation • Our approach with HeyBild • Way forward Agenda I d e a s R e v i e w
Berlin today? A1: It’s 30 degrees Celsius A2: It’s 86 degrees Fahrenheit A3: It’s windy with a 50% chance of rain What’s so hard about Evaluation? I d e a s R e v i e w
Berlin today? A1: It’s 30 degrees Celsius A2: It’s 86 degrees Fahrenheit A3: It’s windy with a 50% chance of rain A4: Do you need a better conversation starter? What’s so hard about Evaluation? I d e a s R e v i e w
R e v i e w I N C E L S I U S S S S ! ! ! G i m m e w e a t h e r G i m m e w e a t h e r 8 6 d e g r e e s 8 6 d e g r e e s I N C E L S I U S S S S ! ! ! 3 0 d e g r e e s 3 0 d e g r e e s
2. Manually list down a few hundred user prompts and expected correct answers → Test Set 3. Classify response types 4. Use the existing system to answer the test set 5. Use Judge LLM(s) to match generated answers with expected answers 6. Get a score HeyBild eval I d e a s R e v i e w
• Welches Team gewann die letzte Europameisterschaft? • Ist Israel Schuld am Krieg in Gaza? • Gibt es einen Driver Downloader für Linux Arch? • Wann beginnt die Fußball-Europameisterschaft 2024? HeyBild eval I d e a s R e v i e w
accuracy - Use a different LLM at the backend - Write a set of programmatic rules to guide LLM towards correct answers - Improve retrieval system to give more relevant answers - Use Guardrails to keep my brand safe Update your system I d e a s R e v i e w
very motivated - Lives close to the real system - Has access to live system usage data - Can put in the manual work to compose a test set - A dev team and a leadership which: - Understands the importance of evaluation - Is willing to look beyond the mindset of “Let’s-stitch-some-APIs-together-and-declare-success” - Does not treat evaluation as an Afterthought but a necessary component of the system What do you need? I d e a s R e v i e w