Slide 1

Slide 1 text

Bilge YΓΌcel October 30 Evaluating AI with Haystack

Slide 2

Slide 2 text

01 - What is it? ● πŸ₯‘ Developer Relations Engineer at deepset ● πŸ— Open source LLM Framework: Haystack ● πŸ“ Istanbul, Turkey Bilge YΓΌcel Twitter: @bilgeycl Linkedin: in/bilgeyucel Hi! πŸ‘‹

Slide 3

Slide 3 text

Agenda 01 - Metrics 02 - Evaluation in Haystack 03 - Demo 04 - Next Steps 05 - Q&A

Slide 4

Slide 4 text

Metrics ● Answer Exact Match - ground-truth answers + predicted answers ● Semantic Answer Similarity - ground-truth answers + predicted answers ● Document Mean Average Precision (MAP) - ground-truth docs + predicted docs ● Document Recall (Multi hit, single hit) - ground-truth docs + predicted docs ● Document Mean Reciprocal Rank (MRR) - ground-truth docs + predicted docs ● Document Normalized Discounted Cumulative Gain (NDCG) - ground-truth docs + predicted docs ● Faithfulness - question + predicted docs + predicted answer ● Context Relevance - question + predicted docs ● LLM-based custom metrics ● Ragas + FlowJudge + DeepEval

Slide 5

Slide 5 text

Metrics Source: https://haystack.deepset.ai/tutorials/guide_evaluation

Slide 6

Slide 6 text

Evaluation in Haystack ● Benchmarking Haystack Pipelines for Optimal Performance ● Evaluation Walkthrough ● haystack-evaluation ● EvaluationHarness (haystack-experimental) ● Evaluation tutorial ● Evaluation Docs

Slide 7

Slide 7 text

Demo Time!

Slide 8

Slide 8 text

Demo ● ARAGOG dataset ● Semantic Answer Similarity, Context Relevance, Faithfulness ● Streamline with EvaluationHarness ● Flow Judge, Ragas

Slide 9

Slide 9 text

Next Steps ● Add Ranker to your pipeline to increase the context relevance ● Use local models to evaluate your pipeline ● Change the splitting strategy ● Generate ground-truth documents for retrieval evaluation πŸ’‘ ● Check out the Evaluation Walkthrough ● Use a tracing tool for continuous evaluation: Langfuse, Arize

Slide 10

Slide 10 text

@bilgeycl in/bilgeyucel Q&A Cookbook: Evaluating AI with Haystack