Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons learned building a GenAI powered app

Lessons learned building a GenAI powered app

Mete Atamel

May 13, 2024
Tweet

More Decks by Mete Atamel

Other Decks in Technology

Transcript

  1. Proprietary + Confidential Marc Cohen Developer Advocate at Google [email protected]

    Lessons learned building a GenAI powered app Mete Atamel Developer Advocate at Google @meteatamel atamel.dev speakerdeck.com/meteatamel
  2. Initial problems • Limited list of topics • Limited questions

    and answers • Limited format: multiple choice with 4 answers • English only • No images • Expanding quiz content is difficult
  3. March 2023 Is it possible to have a more dynamic

    quiz app with infinite content using GenAI?
  4. API

  5. Quiz Generators Name Type Format OpenTrivia static multiple choice Palm

    genAI multiple choice (possible: free-form) Gemini (pro, ultra) genAI multiple choice (possible: free-form)
  6. Image Generator Name Type Description ImageGen (v1, v2) genAI Uses

    ImageGen model to generate images for quizzes
  7. • Limited Unlimited list of topics • Limited Unlimited questions

    and answers • Limited Unlimited format • English only Any language • No Unlimited images • Expanding quiz content is difficult easy with GenAI Revisit: Initial problems
  8. • Learning curve with GenAI • Inconsistent or no outputs

    from LLMs • Slow LLM calls • Hallucinations • Hard to check the accuracy and quality of LLM outputs • Fast changing landscape (models, APIs, libraries, etc.) New problems with GenAI
  9. 🎓 General Surprisingly easy to do hard things with GenAI

    • Quiz/image generation with a single API call Hard to do things well and consistently • Good results require prompt engineering • You will get inconsistent outputs • Hard to measure the output quality
  10. 🎓 General Accept uncertainty of LLMs • Same prompt, same

    model ⇒ different output • Same prompt, same model gets updated ⇒ different output • Same prompt, different model ⇒ different output
  11. 🎓 General Free upgrades with new/updated models • Palm ⇒

    Gemini-Pro: better quizzes • Gemini-Pro ⇒ Gemini-Ultra: even better quizzes • Imagen v1 ⇒ v2: better images • No or little code changes
  12. 🎓 General Do you even need an LLM? • In

    grading free-form ⇒ LLM vs. TheFuzz library • Image of the app ⇒ ImageGen vs. good old photo editor • Sometimes you don't need an expensive LLM call
  13. 🎓 Prompting Be specific and clear with prompts More detailed

    prompts != better results Manage prompts like code • Version prompts for safe iteration • Prompt + output parsers go hand-in-hand
  14. 🎓 Coding with LLMs Code defensively • LLM call can

    fail => Retry and keep the user informed • LLM can give you malformed JSON ⇒ Can you still parse JSON somehow? • LLM can return empty results ⇒ Can you live with no quizzes or no image? • LLM can be too cautious ⇒ Do you need to change safety settings?
  15. 🎓 Coding with LLMs Pin model versions • gemini-1.0-pro refers

    to the latest and can change to gemini-1.0-pro@001, gemini-1.0-pro@002, … • Use a specific version such as gemini-1.0-pro@001
  16. 🎓 Coding with LLMs Consider using a higher level library

    like LangChain • You can use Gemini from Google AI Studio and Vertex AI but each has different libraries • In Vertex AI, libraries for PaLM and Gemini are different • Other non-Google models have their own libraries • LangChain can help to abstract all of this away
  17. 🎓 Coding with LLMs Good old software engineering tricks •

    Minimize LLM calls by batching prompts • Use parallel calls (eg. quiz and image generation runs in parallel) • Cache common responses
  18. Unit/functional tests are as important as ever • Easy to

    check existence or format • Is this a quiz with 5 questions and 4 answers? • Is the image generated or not? 🎓 Testing and Validation
  19. Testing quality and accuracy is more difficult • Is the

    quiz actually on the topic of history? • Is the answer actually correct? • Is the generated image appropriate for the quiz? (still an open question) Need a way to measure LLM outputs • Automate it, and use as a benchmark to work towards 🎓 Testing and Validation
  20. How do you know if the validator works? • Use

    OpenTrivia as corpus of accurate quizzes • See how validator performs against OpenTrivia 🎓 Testing and Validation
  21. Every multiple choice quiz can be decomposed into four assertions,

    of the form: Q: question A: answer For example…Who was the first US president? A. Thomas Jefferson B. Alexander Hamilton C. George Washington D. Bill Clinton can be decomposed into these four assertions: • Q: Who was the first US president? A: Thomas Jefferson is False • Q: Who was the first US president? A: Alexander Hamilton is False • Q: Who was the first US president? A: George Washington is True • Q: Who was the first US president? A: Bill Clinton is False
  22. Evaluation “In one (and only one) word, are the following

    assertions true or false?” Q: Who was the first US president? A: Thomas Jefferson Q: Who was the first US president? A: Alexander Hamilton Q: Who was the first US president? A: George Washington Q: Who was the first US president? A: Bill Clinton LLM: False False True False
  23. 🎓 Testing and Validation Ultimately, you need grounding for more

    accuracy (eg. grounding with Google Search)
  24. Is it possible to have a more dynamic and richer

    quiz app with the help of GenAI? 7 years  7 weeks  7 years to 7 weeks
  25. Thank you Marc Cohen Developer Advocate at Google [email protected] Mete

    Atamel Developer Advocate at Google @meteatamel atamel.dev speakerdeck.com/meteatamel