Upgrade to Pro — share decks privately, control downloads, hide ads and more …

๐™ƒ๐™ค๐™ฌ ๐™ฉ๐™ค ๐™ง๐™ช๐™ฃ ๐™š๐™ซ๐™–๐™ก๐™จ ๐™–๐™ฉ ๐™จ๐™˜๐™–๐™ก๐™š? ๐™๐™๐™ž๐™ฃ๐™ ๐™ž๐™ฃ๐™œ ๐™—๐™š๐™ฎ๐™ค๐™ฃ๐™™ ๐˜ผ๐™˜๐™˜๐™ช...

๐™ƒ๐™ค๐™ฌ ๐™ฉ๐™ค ๐™ง๐™ช๐™ฃ ๐™š๐™ซ๐™–๐™ก๐™จ ๐™–๐™ฉ ๐™จ๐™˜๐™–๐™ก๐™š? ๐™๐™๐™ž๐™ฃ๐™ ๐™ž๐™ฃ๐™œ ๐™—๐™š๐™ฎ๐™ค๐™ฃ๐™™ ๐˜ผ๐™˜๐™˜๐™ช๐™ง๐™–๐™˜๐™ฎ ๐™ค๐™งย ๐™Ž๐™ž๐™ข๐™ž๐™ก๐™–๐™ง๐™ž๐™ฉ๐™ฎ

Lightening talk from AI Engineer World Fair

Avatar for Muktesh

Muktesh

June 07, 2025
Tweet

More Decks by Muktesh

Other Decks in Technology

Transcript

  1. Have you seen these questions before? How do I test

    applications when the outputs are non-deterministic and require subjective judgements? If I change the prompt, how do I know I'm not breaking something else? What metrics should I track? What tools should I use? Which models are best?
  2. Why evals Matter? BUSINESS IMPACTS MEASURE NON- DETERMINISTIC LLM OUTPUT

    ALIGNING SYSTEM WITH THE GOALS IMPROVEMENT DRIVER TRUST AND ACCOUNTABILITY
  3. Data is your friend Start from somewhere (synthetic data, manually

    curated) Keep refining and add a pipeline Label data for different cases, for success and failure Continuous process flow and iteration (real world is not stopping) Multiple data sets: No data is perfect or covers everything
  4. Evaluate everything ๏ต Define goals and objectives ๏ต Modular design

    ๏ต Optimize data handling ๏ต Flows ๏ต Paths ๏ต Outputs
  5. Adaptive Evals RAG: Accuracy, Similarity, Usefulness, Conciseness Code generation: Functional

    Correctness, Robustness (Adversarial Code Evals), Efficiency, Code Quality, HITL Agents: Trajectory, Multiturn Simulation Tool calls: Correctness, Test Suites, Pass@K
  6. Scaling Evals Cache intermediate results, regressions Orchestration and parallelism Aggregate

    Results Run frequent experimentations and improve Standardize Metrics and iterate often Use what fits best (one or combination of human in the loop, Automated pipeline with refinement) Rely on process over tools
  7. Takeaways Evals are the most important aspect for AI Applications

    (Eval driven development is real) Define Evals based on the use cases Focus on Positive and Negative cases Focus on data (synthetic and continuously improved) Remember to measure, monitor, analyze and iterate in a loop Always take a balance approach Fidelity v/s Speed Questions? Letโ€™s chat:
  8. Appendix https://www.anthropic.com/news/a- new-initiative-for-developing-third- party-model-evaluations Anthropic Guide for Good evals: https://www.anthropic.com/news/a-

    new-initiative-for-developing-third- party-model-evaluations Writing good evals: https://eugeneyan.com/writing/eval- process/