Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Evaluating AI with Haystack

Evaluating AI with Haystack

Learn about all the different evaluation metrics that you can use with Haystack, as well as how to streamline the evaluation process over multiple metrics and datasets.

Youtube video: https://www.youtube.com/live/Dy-n_yC3Cto?si=F2dp0FZ-1o2t1GiT

Bilge Yücel

October 30, 2024
Tweet

More Decks by Bilge Yücel

Other Decks in Programming

Transcript

  1. 01 - What is it? • 🥑 Developer Relations Engineer

    at deepset • 🏗 Open source LLM Framework: Haystack • 📍 Istanbul, Turkey Bilge Yücel Twitter: @bilgeycl Linkedin: in/bilgeyucel Hi! 👋
  2. Agenda 01 - Metrics 02 - Evaluation in Haystack 03

    - Demo 04 - Next Steps 05 - Q&A
  3. Metrics • Answer Exact Match - ground-truth answers + predicted

    answers • Semantic Answer Similarity - ground-truth answers + predicted answers • Document Mean Average Precision (MAP) - ground-truth docs + predicted docs • Document Recall (Multi hit, single hit) - ground-truth docs + predicted docs • Document Mean Reciprocal Rank (MRR) - ground-truth docs + predicted docs • Document Normalized Discounted Cumulative Gain (NDCG) - ground-truth docs + predicted docs • Faithfulness - question + predicted docs + predicted answer • Context Relevance - question + predicted docs • LLM-based custom metrics • Ragas + FlowJudge + DeepEval
  4. Evaluation in Haystack • Benchmarking Haystack Pipelines for Optimal Performance

    • Evaluation Walkthrough • haystack-evaluation • EvaluationHarness (haystack-experimental) • Evaluation tutorial • Evaluation Docs
  5. Demo • ARAGOG dataset • Semantic Answer Similarity, Context Relevance,

    Faithfulness • Streamline with EvaluationHarness • Flow Judge, Ragas
  6. Next Steps • Add Ranker to your pipeline to increase

    the context relevance • Use local models to evaluate your pipeline • Change the splitting strategy • Generate ground-truth documents for retrieval evaluation 💡 • Check out the Evaluation Walkthrough • Use a tracing tool for continuous evaluation: Langfuse, Arize