• Books and articles: Digitized texts from open libraries or licensed publishers. • Code repositories: For coding models (e.g., GitHub, open- source datasets).
out toxic, biased, or unsafe language. Normalize text (e.g., fixing encoding issues, removing special symbols). Exclude personal or sensitive information.
chunks called tokens “Cybersecurity” → ["Cyber", "security"] or sometimes → ["Cy", "ber", "security"] Are single letters stored as tokens — for typos or rare words? Yes, kind of. If the model doesn’t recognize a full word, the tokenizer breaks it into smaller known pieces, even letters. Example: “cybeersecurity” (typo) → ["cy", "bee", "r", "security"] So every letter can be a fallback token if needed.
Yes. Because tokenization only looks at the letters, not meaning. Example: “bank” (money) and “bank” (river) → same token. Meaning comes later during training, when the model learns context.
same token? No. “see” and “sea” → different tokens. “car” and “automobile” → different tokens. Even if they mean the same thing, tokenization doesn’t know that. Meaning is learned on training / transformers.
room. Transformers • Transformers are the core architecture behind modern LLMs like GPT, LLaMA, Claude, and Gemini. • They were introduced in 2017 by Google researchers in the paper “Attention Is All You Need.” • Just attention mechanisms are enough to understand relationships between words.
meaningful cat high “it” most likely refers to “cat” chased low Action, not reference the low neutral dog medium somewhat related because low connector word The cat chased the dog because it was scared. They’re useful. But not the full picture.
you need, chat, code, text analysis, or reasoning: Each task might need a different model. 2. Check data alignment: Does the model understand your domain (medical, legal, IoT, etc.)? 3. Balance performance & cost: Choose the smallest model that meets your needs efficiently. 4. Privacy & compliance: If your data is sensitive, prefer on-prem or private LLM options. 5. Evaluate accuracy: Test with your own prompts and metrics, hallucinations are real!
lawyer used ChatGPT to find legal cases for a lawsuit against Avianca Airlines. • ChatGPT invented fake court cases that sounded real. • The lawyer didn’t verify them and gave them to the judge.
Microsoft travel article about Ottawa Food Bank mistakenly listed a food bank as a tourist attraction. • It even said, “Consider going into it on an empty stomach” • Microsoft removed the article and said it was an error caused by poor human oversight in the content process.
LLM suggests or uses a software package/library name that doesn’t actually exist. • An attacker pre-registers that made-up name on a public package repository (e.g., PyPI, npm) and fills it with malicious code. • A developer using the AI suggestion installs the package, unknowingly introducing malware or a supply-chain vulnerability.
• Lots of log details, • Broken access issues, • Write prompts detailed, step by step. • Review the code after fix / new feature. • Think about infrastructure.
keeps giving the wrong answer.” “It doesn’t say what I mean.” What’s really happening It’s not that the LLM “doesn’t understand.” It’s that the prompt doesn’t guide it clearly enough. LLMs don’t read minds, they follow patterns.
as if you’re trying to inspire people to protect it. →Specific prompts give specific emotions. Tone Control Describe the ocean like a scientist. → Tone words (“like a poet,” “like a scientist,” “like a child”) change the personality.
ocean to a 5-year-old. → Audience decides vocabulary and depth. And More… • Add a perspective: “Describe the ocean from a sailor’s point of view.” • Add a limit: “Describe the ocean in under 10 words.” • Add a format: “Write it as a tweet.” • Add a contrast: “Describe the ocean and the desert in one paragraph.” • Add a creative twist: “Describe the ocean as if it were an AI learning emotions.”
AI Answer • Don’t trust tone -> check the facts • Ask: “How do you know?” • Ask for sources or reasoning • Rephrase your question and compare • If it sounds too confident, ask it to slow down and explain
perfect grammar or symmetry • Generic phrases (e.g. “In today’s fast-paced world…”) • Overuse of transitions (“Moreover”, “In conclusion”) • Repeated patterns, smooth but emotionless tone • No real “voice” or perspective