Evaluating AI with Haystack

Bilge Yücel October 30 Evaluating AI with Haystack

01 - What is it? • 🥑 Developer Relations Engineer
at deepset • 🏗 Open source LLM Framework: Haystack • 📍 Istanbul, Turkey Bilge Yücel Twitter: @bilgeycl Linkedin: in/bilgeyucel Hi! 👋

Agenda 01 - Metrics 02 - Evaluation in Haystack 03
- Demo 04 - Next Steps 05 - Q&A

Metrics • Answer Exact Match - ground-truth answers + predicted
answers • Semantic Answer Similarity - ground-truth answers + predicted answers • Document Mean Average Precision (MAP) - ground-truth docs + predicted docs • Document Recall (Multi hit, single hit) - ground-truth docs + predicted docs • Document Mean Reciprocal Rank (MRR) - ground-truth docs + predicted docs • Document Normalized Discounted Cumulative Gain (NDCG) - ground-truth docs + predicted docs • Faithfulness - question + predicted docs + predicted answer • Context Relevance - question + predicted docs • LLM-based custom metrics • Ragas + FlowJudge + DeepEval

Metrics Source: https://haystack.deepset.ai/tutorials/guide_evaluation

Evaluation in Haystack • Benchmarking Haystack Pipelines for Optimal Performance
• Evaluation Walkthrough • haystack-evaluation • EvaluationHarness (haystack-experimental) • Evaluation tutorial • Evaluation Docs

Demo Time!

Demo • ARAGOG dataset • Semantic Answer Similarity, Context Relevance,
Faithfulness • Streamline with EvaluationHarness • Flow Judge, Ragas

Next Steps • Add Ranker to your pipeline to increase
the context relevance • Use local models to evaluate your pipeline • Change the splitting strategy • Generate ground-truth documents for retrieval evaluation 💡 • Check out the Evaluation Walkthrough • Use a tracing tool for continuous evaluation: Langfuse, Arize

@bilgeycl in/bilgeyucel Q&A Cookbook: Evaluating AI with Haystack

Evaluating AI with Haystack

Evaluating AI with Haystack

Bilge Yücel

More Decks by Bilge Yücel

Other Decks in Programming

Featured

Transcript

Bilge Yücel October 30 Evaluating AI with Haystack

01 - What is it? • 🥑 Developer Relations Engineer

Agenda 01 - Metrics 02 - Evaluation in Haystack 03

Metrics • Answer Exact Match - ground-truth answers + predicted

Metrics Source: https://haystack.deepset.ai/tutorials/guide_evaluation

Evaluation in Haystack • Benchmarking Haystack Pipelines for Optimal Performance

Demo Time!

Demo • ARAGOG dataset • Semantic Answer Similarity, Context Relevance,

Next Steps • Add Ranker to your pipeline to increase

@bilgeycl in/bilgeyucel Q&A Cookbook: Evaluating AI with Haystack