How we test and deploy prompts at FirstQuadrant

Slide 1

Slide 1 text

Anand Chowdhary 💼 Founder of AI sales platform FirstQuadrant 🚀 Y Combinator founder · Angel investor 💻 OSS contributor · GitHub Star since 2021 📍 AI Community Day 🎄 Dec 10 2024 How we test and deploy prompts at FirstQuadrant. 01 ANAND CHOWDHARY HELLO AI COMMUNITY DAY ABOUT 02 AI COMMUNITY DAY TALK 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY PROMPTS 05 AI COMMUNITY DAY QUESTIONS? 08 AI COMMUNITY DAY LEARNINGS 07 AI COMMUNITY DAY PLATFORMS 06

Slide 2

Slide 2 text

CC-BY-SA 4.0 © Jimmy on the Run Studio Amsterdam, 2023 👋 Hello I’m Anand, an engineer, designer, and tech entrepreneur from New Delhi living in Utrecht and running San Francisco-based AI sales platform FirstQuadrant, funded by Y Combinator. Featured in Forbes 30 Under 30 and publications like TechCrunch, CSS Tricks, HuffPost, Time Out, GitHub, Het Financieele Dagblad, BrowserStack, BusinessWorld Disrupt, among others. Awarded GitHub Stars in 2021, 2022, 2023, 2024. 01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 AI COMMUNITY DAY TALK 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY PROMPTS 05 AI COMMUNITY DAY QUESTIONS? 08 AI COMMUNITY DAY LEARNINGS 07 AI COMMUNITY DAY PLATFORMS 06

Slide 3

Slide 3 text

Prompts 01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY PROMPTS 05 AI COMMUNITY DAY QUESTIONS? 08 AI COMMUNITY DAY LEARNINGS 07 AI COMMUNITY DAY PLATFORMS 06 Talk This talk will focus on prompt engineering techniques, testing methods, and deployment best practices. Learnings AI COMMUNITY DAY TALK Quick overview of everything I’ve learned over the 2 years. Quick dive into how we store and version control prompts. Demo Quick demo of how FirstQuadrant works.

Slide 4

Slide 4 text

01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY PROMPTS 05 AI COMMUNITY DAY QUESTIONS? 08 AI COMMUNITY DAY LEARNINGS 07 AI COMMUNITY DAY PLATFORMS 06 AI COMMUNITY DAY TALK Demo A short demo of how FirstQuadrant works. Sign up at 1Q.ai

Slide 5

Slide 5 text

Prompt s 01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY QUESTIONS? 08 AI COMMUNITY DAY LEARNINGS 07 AI COMMUNITY DAY PLATFORMS 06 AI COMMUNITY DAY TALK AI COMMUNITY DAY 05 PROMPTS A short demo of how we store and iterate on prompt versions in production.

Slide 6

Slide 6 text

01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY QUESTIONS? 08 AI COMMUNITY DAY LEARNINGS 07 AI COMMUNITY DAY 06 AI COMMUNITY DAY TALK #1 #2 #3 KeywordsAI.co is a leading platform designed for monitoring LLMs speciﬁcally tailored for AI startups. It enables developers to easily integrate LLM monitoring into their applications with just two lines of code, akin to services like Datadog for AI. KEYWORDS AI Langbase.com is a serverless AI developer platform designed to facilitate the building, collaboration, and deployment of AI agents and applications. LANGBASE Humanloop is an enterprise-grade platform designed for evaluating LLMs. It offers advanced prompt management and observability tools, enabling product teams to develop and deploy reliable AI features. HUMANLOOP AI COMMUNITY DAY 05 PROMPTS PLATFORMS

Slide 7

Slide 7 text

Slide 8

Slide 8 text

01 AI COMMUNITY DAY AI COMMUNITY DAY 02 03 AI COMMUNITY DAY 04 AI COMMUNITY DAY QUESTIONS? 08 AI COMMUNITY DAY 06 AI COMMUNITY DAY AI COMMUNITY DAY LEARNINGS 07 LLM? AI COMMUNITY DAY 05 HELLO ABOUT DEMO TALK PROMPTS PLATFORMS ● Start by evaluating if an LLM is actually needed for the specific use case (e.g., AST, regex, ETLs, traditional ML) ● Match the model to your needs — small/dumb model + fine-tune ≈ big/smart model for a specific task ● Consider the tradeoffs between: ○ Basic prompting (for low volume, simple tasks) ○ RAG (for changing/large information needs) ○ Fine-tuning (for style consistency or high-volume specific tasks)

Slide 9

Slide 9 text

01 AI COMMUNITY DAY AI COMMUNITY DAY 02 03 AI COMMUNITY DAY 04 AI COMMUNITY DAY QUESTIONS? 08 AI COMMUNITY DAY 06 AI COMMUNITY DAY AI COMMUNITY DAY LEARNINGS 07 Prompting AI COMMUNITY DAY 05 HELLO ABOUT DEMO TALK PROMPTS PLATFORMS ● Breaking complex tasks into smaller, speciﬁc steps yields better results ● Use structured output (JSON) when possible for easier processing ● Explicitly restate references instead of using pronouns to avoid confusion when referring to “it” ● Give the model “thinking space” with phrases like “think about this step by step” or “reasoning” ● Ask the model to explain its understanding back to validate comprehension

Slide 10

Slide 10 text

01 AI COMMUNITY DAY AI COMMUNITY DAY 02 03 AI COMMUNITY DAY 04 AI COMMUNITY DAY QUESTIONS? 08 AI COMMUNITY DAY 06 AI COMMUNITY DAY AI COMMUNITY DAY LEARNINGS 07 Reliability AI COMMUNITY DAY 05 HELLO ABOUT DEMO TALK PROMPTS PLATFORMS ● Having domain experts involved in prompt iteration improves results ● Implement robust testing and regular manual QA ● Consider breaking large tasks into subtasks (categorization, summarization, etc.) ● Use the model's self-critique capabilities by asking it to review its own output ● Build detection systems for non-desired behaviors ● Implement rapid testing workﬂows for prompt iteration

Slide 11

Slide 11 text

01 AI COMMUNITY DAY AI COMMUNITY DAY 02 03 AI COMMUNITY DAY 04 AI COMMUNITY DAY QUESTIONS? 08 AI COMMUNITY DAY 06 AI COMMUNITY DAY AI COMMUNITY DAY LEARNINGS 07 Pitfalls AI COMMUNITY DAY 05 HELLO ABOUT DEMO TALK PROMPTS PLATFORMS ● More context doesn't always mean better output ● Negative clauses can confuse models — consider emphasizing them with special formatting ● Over-prompting can lead to overfitting and worse results ● Few-shot prompting may reduce accuracy by introducing noise or biasing the model ● Log probabilities aren't reliable confidence indicators as they don't reflect real-world accuracy (they measure linguistic probability, not factuality)

Slide 12

Slide 12 text

01 AI COMMUNITY DAY AI COMMUNITY DAY 02 03 AI COMMUNITY DAY 04 AI COMMUNITY DAY QUESTIONS? 08 AI COMMUNITY DAY 06 AI COMMUNITY DAY AI COMMUNITY DAY LEARNINGS 07 Evals AI COMMUNITY DAY 05 HELLO ABOUT DEMO TALK PROMPTS ● Combine multiple evaluation methods: ○ Self-grading and human evaluations ○ Metric-based evaluation (ROUGE for summarization, BLEU for translations, etc.) ○ Behavioral testing (test models against predeﬁned rules and edge cases) ○ Ground truth comparison (automated comparison against known correct answers) ○ A/B testing with end users PLATFORMS

Slide 13

Slide 13 text

01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY 07 AI COMMUNITY DAY PLATFORMS 06 AI COMMUNITY DAY TALK Questions? Happy to answer your prompting questions! AI COMMUNITY DAY 05 PROMPTS AI COMMUNITY DAY QUESTIONS? 08 LEARNINGS 📱 @AnandChowdhary 📨 [email protected] 🌏 anandchowdhary.com 🪴 chowdhary.org 🤝 chowdhary.co

Slide 14

Slide 14 text

01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY 07 AI COMMUNITY DAY PLATFORMS 06 AI COMMUNITY DAY TALK Backup slides AI COMMUNITY DAY 05 PROMPTS AI COMMUNITY DAY QUESTIONS? 08 LEARNINGS

Slide 15

Slide 15 text

01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY 07 AI COMMUNITY DAY PLATFORMS 06 AI COMMUNITY DAY TALK LLMs need tokens to think 1. Practical Applications of "Thinking Space": a. Breaking Down Complex Tasks i. Instead of asking for direct answers ii. Let the model work through steps iii. Helps prevent jumping to conclusions b. Self-Veriﬁcation i. Model can check its own reasoning ii. Can catch errors before giving ﬁnal answer iii. More transparent process for debugging c. Better Problem-Solving i. Similar to how humans solve problems ii. Reduces likelihood of shortcuts iii. Creates more reliable outputs 2. Implementation examples a. “thoughts” JSON property before “category” b. “Reasoning” JSON property before “result” c. Claude likes tags AI COMMUNITY DAY 05 PROMPTS AI COMMUNITY DAY QUESTIONS? 08 LEARNINGS

Slide 16

Slide 16 text

01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY 07 AI COMMUNITY DAY PLATFORMS 06 AI COMMUNITY DAY TALK Model selection strategy (1/2) 1. Basic Prompting a. Best for: Low volume, simple tasks b. When to use: If you can achieve good results with just a prompt c. Advantage: Easy to upgrade to new models when they're released 2. Retrieval Augmented Generation (RAG) a. Best for: i. Changing information needs ii. Large corpus of data iii. Third-party source integration iv. Need for citations/traceability b. Advantages: i. Fast updates to information ii. Easy model upgrades iii. Good for business use cases needing accountability iv. Works well with ticketing, emails, etc. 3. Fine-tuned Smaller Models a. Key insight: "small/dumb model + fine-tune ≈ big/smart model for a specific task" b. When to consider: i. High volume specific tasks (around 20M requests/month) ii. Need for consistent style iii. Cost optimization for specific use cases AI COMMUNITY DAY 05 PROMPTS AI COMMUNITY DAY QUESTIONS? 08 LEARNINGS

Slide 17

Slide 17 text

01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY 07 AI COMMUNITY DAY PLATFORMS 06 AI COMMUNITY DAY TALK Model selection strategy (2/2) 1. Practical Considerations: a. Volume Threshold i. Low volume: Larger general models might be more cost-effective ii. High volume: Fine-tuned smaller models can be more efficient 2. Technical Constraints a. GPU sharing challenges with fine-tuned models b. Loading adapters in GPU memory can be tricky c. May need dedicated instances for fine-tuned models 3. Business Requirements a. Need for citations? RAG might be better than fine-tuning b. Need for style consistency? Fine-tuning might be necessary c. Need for frequent updates? RAG would be more suitable 4. Cost-Benefit Analysis a. Consider waiting for newer models vs. investing in fine-tuning b. Factor in development and maintenance costs c. Consider the ROI timeline for model improvements AI COMMUNITY DAY 05 PROMPTS AI COMMUNITY DAY QUESTIONS? 08 LEARNINGS

Slide 18

Slide 18 text

01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY 07 AI COMMUNITY DAY PLATFORMS 06 AI COMMUNITY DAY TALK Prompt changes might improve one aspect but hurt others 1. Core approaches: a. Set up systematic evaluation framework b. Test prompts across diverse test cases c. Compare results against baseline metrics 2. Best Practices: a. Build UI for quick prompt experimentation b. Create comprehensive test suites c. Use metrics beyond just LLM evaluation d. Regular manual QA with domain experts e. Consider A/B testing with real users AI COMMUNITY DAY 05 PROMPTS AI COMMUNITY DAY QUESTIONS? 08 LEARNINGS

Slide 19

Slide 19 text

01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY 07 AI COMMUNITY DAY PLATFORMS 06 AI COMMUNITY DAY TALK The "Death Valley" problem in fine-tuning Unable to collect real-world data without deployment, but can't deploy without good data 1. Options: a. Wait for next-gen models (might solve problem in 6 months) b. Use synthetic data (but currently "finicky") c. Pay for professional data labeling (ROI challenges) d. Use smaller models like Mistral (due to licensing) 2. Key consideration: a. Need to make ROI back before next model generation makes fine-tuning unnecessary b. Creative Prompt Engineering Techniques AI COMMUNITY DAY 05 PROMPTS AI COMMUNITY DAY QUESTIONS? 08 LEARNINGS

Slide 20

Slide 20 text

01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY 07 AI COMMUNITY DAY PLATFORMS 06 AI COMMUNITY DAY TALK Creative prompt engineering techniques 1. Atomic Reasoning Types a. Break complex tasks into basic operations: i. Categorization ii. Summarization iii. Knowledge recall iv. Translation b. Why it works: These are core training tasks for LLMs c. Benefit: Reduces hallucinations, creates audit trail 2. Token-Based Length Specification a. Specifying length in tokens works better than words/characters b. Alternative approach: Generate outline first, then fill in details c. Trade-off: JSON responses may affect length adherence 3. Self-Grading a. Have model evaluate its own accuracy b. Can filter out low-confidence results c. Allows sorting/ranking of generated content AI COMMUNITY DAY 05 PROMPTS AI COMMUNITY DAY QUESTIONS? 08 LEARNINGS

Slide 21

Slide 21 text

01 AI COMMUNITY DAY HELLO AI COMMUNITY DAY ABOUT 02 03 AI COMMUNITY DAY DEMO 04 AI COMMUNITY DAY 07 AI COMMUNITY DAY PLATFORMS 06 AI COMMUNITY DAY Self-grading Key considerations: 1. Early experimentation tools (e.g., nat.dev) 1. Provider-specific playgrounds 2. Testing environments Infrastructure needs vary by approach: 3. RAG: Vector databases, chunking pipeline, embedding storage 4. Fine-tuning: GPU resources, adapter management 5. Basic prompting: Simple API integration Future-Proofing Prompts 6. Strategy components: a. Design for Upgradability i. Keep prompts model-agnostic ii. Focus on task description over model-specific tricks iii. Use structured outputs when possible b. Investment Decisions i. When to optimize current prompts ii. When to wait for better models iii. Balance between immediate needs and future improvements c. Maintenance Approach i. Regular testing with new model versions ii. Document prompt design decisions iii. Monitor performance metrics over time TALK AI COMMUNITY DAY 05 PROMPTS AI COMMUNITY DAY QUESTIONS? 08 LEARNINGS