𝙃𝙤𝙬 𝙩𝙤 𝙧𝙪𝙣 𝙚𝙫𝙖𝙡𝙨 𝙖𝙩 𝙨𝙘𝙖𝙡𝙚? 𝙏𝙝𝙞𝙣𝙠𝙞𝙣𝙜 𝙗𝙚𝙮𝙤𝙣𝙙 𝘼𝙘𝙘𝙪𝙧𝙖𝙘𝙮 𝙤𝙧 𝙎𝙞𝙢𝙞𝙡𝙖𝙧𝙞𝙩𝙮

How to run evals at scale? Thinking beyond accuracy or
similarity

Have you seen this before?

Have you seen these questions before? How do I test
applications when the outputs are non-deterministic and require subjective judgements? If I change the prompt, how do I know I'm not breaking something else? What metrics should I track? What tools should I use? Which models are best?

And the answer is…. Evals

Why evals Matter? BUSINESS IMPACTS MEASURE NON- DETERMINISTIC LLM OUTPUT
ALIGNING SYSTEM WITH THE GOALS IMPROVEMENT DRIVER TRUST AND ACCOUNTABILITY

Data is your friend Start from somewhere (synthetic data, manually
curated) Keep refining and add a pipeline Label data for different cases, for success and failure Continuous process flow and iteration (real world is not stopping) Multiple data sets: No data is perfect or covers everything

Evaluate everything  Define goals and objectives  Modular design
 Optimize data handling  Flows  Paths  Outputs

Adaptive Evals RAG: Accuracy, Similarity, Usefulness, Conciseness Code generation: Functional
Correctness, Robustness (Adversarial Code Evals), Efficiency, Code Quality, HITL Agents: Trajectory, Multiturn Simulation Tool calls: Correctness, Test Suites, Pass@K

Scaling Evals Cache intermediate results, regressions Orchestration and parallelism Aggregate
Results Run frequent experimentations and improve Standardize Metrics and iterate often Use what fits best (one or combination of human in the loop, Automated pipeline with refinement) Rely on process over tools

Takeaways Evals are the most important aspect for AI Applications
(Eval driven development is real) Define Evals based on the use cases Focus on Positive and Negative cases Focus on data (synthetic and continuously improved) Remember to measure, monitor, analyze and iterate in a loop Always take a balance approach Fidelity v/s Speed Questions? Let’s chat:

Appendix https://www.anthropic.com/news/a- new-initiative-for-developing-third- party-model-evaluations Anthropic Guide for Good evals: https://www.anthropic.com/news/a-
new-initiative-for-developing-third- party-model-evaluations Writing good evals: https://eugeneyan.com/writing/eval- process/

𝙃𝙤𝙬 𝙩𝙤 𝙧𝙪𝙣 𝙚𝙫𝙖𝙡𝙨 𝙖𝙩 𝙨𝙘𝙖𝙡𝙚? 𝙏𝙝𝙞𝙣𝙠𝙞𝙣𝙜 𝙗𝙚𝙮𝙤𝙣𝙙 𝘼𝙘𝙘𝙪...

𝙃𝙤𝙬 𝙩𝙤 𝙧𝙪𝙣 𝙚𝙫𝙖𝙡𝙨 𝙖𝙩 𝙨𝙘𝙖𝙡𝙚? 𝙏𝙝𝙞𝙣𝙠𝙞𝙣𝙜 𝙗𝙚𝙮𝙤𝙣𝙙 𝘼𝙘𝙘𝙪𝙧𝙖𝙘𝙮 𝙤𝙧 𝙎𝙞𝙢𝙞𝙡𝙖𝙧𝙞𝙩𝙮

Muktesh

More Decks by Muktesh

Other Decks in Technology

Featured

Transcript

How to run evals at scale? Thinking beyond accuracy or

Have you seen this before?

Have you seen these questions before? How do I test

And the answer is…. Evals

Why evals Matter? BUSINESS IMPACTS MEASURE NON- DETERMINISTIC LLM OUTPUT

Data is your friend Start from somewhere (synthetic data, manually

Evaluate everything  Define goals and objectives  Modular design

Adaptive Evals RAG: Accuracy, Similarity, Usefulness, Conciseness Code generation: Functional

Scaling Evals Cache intermediate results, regressions Orchestration and parallelism Aggregate

Takeaways Evals are the most important aspect for AI Applications

Appendix https://www.anthropic.com/news/a- new-initiative-for-developing-third- party-model-evaluations Anthropic Guide for Good evals: https://www.anthropic.com/news/a-