Cleaning Web pages, code repos, books, papers → deduplicated, filtered, token-balanced 2. Tokenization Raw text → sub-word tokens (e.g. Byte-Pair Encoding) 3. Self-Supervised Pre-Training Transformer learns to predict next token on 100 B+ tokens across thousands of GPUs 4. Alignment & Fine-Tuning Turns raw predictor into a helpful, policy-aligned assistant 5. Evaluation & Benchmarking Benchmarks (MMLU, HumanEval, SWE-bench) + adversarial tests 6. Deployment & Continuous Refresh Prompt filters, safety layers, telemetry for quality & drift How Large Language Models Are Trained