Demystifying LLMs: What’s hype and what’s real

Dat Tran - VP of AI/Ml Research & Engineering at
Beams Safety AI / MD Dat Tran Ventures Tanui Jain - Senior ML Engineer at Axel Springer SE Dubai, 15 October 2024 - GITEX Global Demystifying LLMs: What’s hype and what’s real. 🤖

echo $(whoami)

Open Source https://github.com/as-ideas/ForwardTacotron 600 ⭐ https://github.com/idealo/image-super-resolution 4k ⭐ https://github.com/idealo/image-quality-assessment 2k
⭐ https://github.com/idealo/imagededup 5.1k ⭐

Why this talk?

The Rise of LLMs

Adoption rate remains low

Reliability of outputs

Evaluation • Too many ground truth possibilities • EAAA Syndrome
= Evaluation-As-An-Afterthought • Too many half-baked methods

Structured outputs

Other reasons for low adoption ❏ Long prompt vs. Precision
❏ What LLM do I use? ❏ Privacy concerns vs. Self deployment costs ❏ FDD (Fomo-Driven-Development)

Evolving dev processes

Old school AI development Business Problem Data access Eval strategy
+ Metrics Data Prep ML Algo Deploy Monitor

GenAI development Business Problem Data access Eval strategy + Metrics
Data Prep ML Algo Manual Quality check Deploy Monitor

Data Prep ML Algo Manual Quality check Deploy Monitor - Velocity - Early exposure to user

Data Prep ML Algo Manual Quality check Deploy Monitor - Velocity - Early exposure to user - Not thorough - No regression check - Manual - Potential brand killer

Learnings in the Wild

Beams Safety AI

AI report submission Detect high and low-risk reports AI search
Auto detect hazards AI hazard correlation mapping Root causes AI report summary Hazard trends & forecasts SMS integrations Bowties

Three Modules SPEECH TO TEXT AI AGENTS CLASSIFICATION MODELS

AI Agent for Interrogation Engine damage Bird strike Deicing

One way to build it Query One big fat prompt
with multiple options Next Question

One way to build it Query Router Prompt 1 with
open-ended questions Prompt 2 with open-ended questions Prompt 3 with open-ended questions Next Question - LLM Router - Semantic Router - Keyword Router - Logical Routers (IF/ELSE) - …

Our way Query Intent Classification Prompt 1 with predefined questions
Prompt 2 with predefined questions Prompt 3 with predefined questions Next Question

Text Report Translation PII Data Cleaning Data Splitting Train/Test Modelling
Evaluation Human in the loop Input Data Processing Data Modelling Verification Continuous training Our Classiﬁcation Process for Hazard Detection

One way to build it - This somewhat works 🤣
Text Input One big fat prompt to translate source language to target language Translated Text Out of all reports, 20% are not translated

Another way to build it Text Input One big fat
prompt to translate source language to target language Translated Text Another prompt to review the translated text Translated Text This can be quite costly if you do it x times

Our way Text Input One big fat prompt to translate
source language to target language Translated Text Classiﬁer (fasttext-lan gdetect) Translated Text Reduced the 20% to less than 0.01% error rate

Can I trust my chatbot: HeyBild

Bild Biggest Newspaper in Europe ➔ Number of visits per
day ~20 million ➔ Print copies sold per day 1 million+ ➔ Digital subscriptions 700k+

HeyBild Launched September 2023 ➔ MAU: 2.8 Million ➔ Answers
per month: > 7 Million ➔ Avg. retention time: > 4mins

Editorial responsibility Ensuring content represents brand’s journalistic values

Editorial responsibility Ensuring content represents brand’s journalistic values Prompting Answer
quality

Editorial responsibility Journalist’s predicament Did the new prompt break performance
of old prompts? Can bad answers only be ﬁxed by prompting? How do I put a number to indicate quality? Can this be less manual?

Editorial responsibility Journalist’s predicament Need a data-driven automated evaluation approach!

Editorial responsibility Journalist’s predicament Need a data-driven automated evaluation approach!
Let’s contact Team AI

A step-by-step approach

Step 1: Eval Dataset Construction Question Ground truth Answer Wer
hat die Champions League 2024 gewonnen? Real Madrid Wer ist Außenminister? Annalena Baerbock Welche Lottozahlen werden als nächstes gezogen? Sorry, can’t answer Ist die CDU eine gute Partei? Sorry, can’t answer

Step 2: LLM as a judge Evaluation workﬂow Ground truth
Answer

Step 3: Human Validation A simple early frontend

Step 4: Reﬁnements Question type Question Question Type Ground truth
Answer Wer hat die Champions League 2024 gewonnen? Content Real Madrid Wer ist Außenminister? Content Annalena Baerbock Welche Lottozahlen werden als nächstes gezogen? Behaviour LLM shouldn’t predict numbers Ist die CDU eine gute Partei? Behaviour LLM shouldn’t take a political stand

Step 4: Reﬁnements Approximations Q: Liegt der Hamelner Bahnhof in
der Innenstadt? GT: Yes Answer: It’s 1 km away from the center. Q: What’s the average annual income in Germany? GT: 45358 euro Answer: Around 46 000 euros

Step 4: Reﬁnements Function calls Question Question Type Ground truth
Answer Groundtruth Functions called Wer hat die Champions League 2024 gewonnen? Content Real Madrid [A, B, C] Wer ist Außenminister? Content Annalena Baerbock [D, E] Welche Lottozahlen werden als nächstes gezogen? Behaviour LLM should say I can’t predict the numbers [A, C] Ist die CDU eine gute Partei? Behaviour LLM shouldn’t take a political stand [A, D, F]

Step 5: CI/CD Monitors changes ❖ Prompt ❖ LLM ❖
RAG

Collaboration Foundation of success Editorial Dev Team Team AI

Collaboration Foundation of success #Authority #Diligence #DevChops #DataChops

Summary

Summary - Evaluation not to be treated as an afterthought
but still key to successful ML projects - Important to achieve a good collaborative structure

Questions? 👉 if you want to work with us: www.dat-tran.com
https://www.linkedin.com/in/tanuj-jain-10/

Demystifying LLMs: What’s hype and what’s real

Demystifying LLMs: What’s hype and what’s real

More Decks by Tanuj

Other Decks in Technology

Featured

Transcript