applications when the outputs are non-deterministic and require subjective judgements? If I change the prompt, how do I know I'm not breaking something else? What metrics should I track? What tools should I use? Which models are best?
curated) Keep refining and add a pipeline Label data for different cases, for success and failure Continuous process flow and iteration (real world is not stopping) Multiple data sets: No data is perfect or covers everything
Results Run frequent experimentations and improve Standardize Metrics and iterate often Use what fits best (one or combination of human in the loop, Automated pipeline with refinement) Rely on process over tools
(Eval driven development is real) Define Evals based on the use cases Focus on Positive and Negative cases Focus on data (synthetic and continuously improved) Remember to measure, monitor, analyze and iterate in a loop Always take a balance approach Fidelity v/s Speed Questions? Letโs chat: