Democratizing Large Language Models

Eugène Delacroix - La Liberté guidant le peuple / Wikipedia
Democratizing Large Language Models Lothar Wieske

https://unsplash.com/de/fotos/weisse-und-schwarze-wanduhr-zYvuRX4kIWE Industrialization Digitalization Mainframe Cloud Native Client/Server GenAI Computing

AI DL ML GenAI

https://www.horvath-partners.com/fileadmin/horvath-partners.com/assets/05_Media_Center/PDFs/Studien-PDFs_fuer_MAT-Download/2023_Horvath_Future_of_Employment_in_Operations_Studie_DE.pdf

https://unsplash.com/de/fotos/luftaufnahme-der-zementfabrik-mit-hoher-betonwerksstruktur-und-turmdrehkran-auf-dem-industriegelande-am-nebligen-abend-produktions-und-globales-branchenkonzept-IAQtXdsacmc

2024 AI Index Report – Top Takeaways AI beats humans
on some tasks, but not on all Industry continues to dominate frontier AI research Frontier models get way more expensive The United States leads China, the EU, and the U.K. as the leading source of top AI models Robust and standardized evaluations for LLM responsibility are seriously lacking Generative AI investment skyrockets. The data is in: AI makes workers more productive and leads to higher quality work Scientific progress accelerates even further, thanks to AI The number of AI regulations in the United States sharply increases People across the globe are more cognizant of AI’s potential impact— and more nervous

Llama 2-70B PaLM (540B) Megatron-Turing NLG 530B GPT-3 175B (davinci)
RoBERTa Large BERT-Large Transformer AlexNet Gemini Ultra Claude 2 GPT-4 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 1000 10K 100K 1M 10M 100M 1B 10B 100B Language Vision Multimodal Publication date Training compute (petaFLOP - log scale) Training compute of notable machine learning models by domain, 2012–23 Source: Epoch, 2023 | Chart: 2024 AI Index report 2024 AI Index Report / Figure 1.3.7

2019 2020 2021 2022 2023 0% 10% 20% 30% 40%
50% 60% 70% Foundation models (% of total) 15.44%, Limited 18.79%, No access 65.77%, Open Foundation models (% of total) by access type, 2019–23 Source: Bommasani et al., 2023 | Chart: 2024 AI Index report 2024 AI Index Report / Figure 1.3.14

930 3,288 160,018 4,324,883 6,405,653 1,319,586 12,389,056 78,352,034 3,931,897 191,400,000
Transformer BERT-Large RoBERTa Large GPT-3 175B (davinci) Megatron-Turing NLG 530B LaMDA PaLM (540B) GPT-4 Llama 2 70B Gemini Ultra 2017 2018 2019 2020 2021 2022 2023 0 50M 100M 150M 200M Training cost (in U.S. dollars) Estimated training cost of select AI models, 2017–23 Source: Epoch, 2023 | Chart: 2024 AI Index report 2024 AI Index Report / Figure 1.3.21

https://unsplash.com/de/fotos/TOzgRFJ0JxY Contrary to how it may seem when we observe
its output, an LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

https://unsplash.com/de/fotos/tierfotografie-flamingoschwarm-ub1sSvJ_Tbs llama mistral mixtral gemma phi OpenELM

https://unsplash.com/de/fotos/silhouettenfoto-von-gruppenmenschen-die-auf-gras-stehen-aWDgqexSxA0 GPT-1 2018 117M GPT-2 2019 1.5B GPT-3 2020 175B
GPT-4 2022 ???

https://unsplash.com/de/fotos/leyUrzdwurc Pretraining Finetuning Prompting Alignment

Kaplan Scaling Laws / GPT-3 / 2020 https://unsplash.com/de/fotos/braune-bucherregale-aus-holz-in-der-bibliothek-yEQ9TOaL5FM 300B tokens
to train LLM with 175B parameters around 1.7 text tokens per param

Hoffman Scaling Laws / Chinchilla / 2022 1.4T tokens to
train LLM of size 70B parameters around 20 text tokens per param https://unsplash.com/de/fotos/foto-von-bibliothekssaal-dsvJgiBJTOs

2:1 20:1 200:1 <<# training tokens>> : <<# model parameters>>
GPT-3 / 176B 1.7:1 Llama / 65B 22:1 Llama-2 / 70B 29:1 Llama-3 / 70B 214:1 Phi / 65B 22:1 Phi-2 / 2.7B 519:1 Gemma / 7B 857:1 Mistral / 7B 1143:1 undertrained overtrained Phi-3-mini / 3.8B 868:1

https://unsplash.com/de/fotos/frau-mit-kopf-auf-der-hand-ruhend-Pe4gh8a8mBY 2024 AI Index Report / Figure 1.3.8

GTC Oct 2017 / Munich Keynote Jensen Huang

Nvidia GTC 2024 Keynote / Slide 10

David Blackwell

Blackwell GPU FP8 2.0 PFLOPS 2.5X Hopper new FP6 2.0
PFLOPS 2.5X new FP4 4.0 PFLOPS 5 X HBM Model Size 740 B params 6X HBM Bandwidth 34 T params/sec 5X NVLINK 7 2 TB/s 4 X

https://unsplash.com/de/fotos/person-die-auf-eine-blaue-treppe-tritt-7_kRuX1hSXM … altius, citius, fortius …

https://unsplash.com/de/fotos/M2cFm9iHXSc Attention Is All You Need Transformers Architecture Google 2017

https://arxiv.org/abs/1706.03762

https://unsplash.com/de/fotos/man-in-black-and-white-striped-long-sleeve-shirt-Clv9DfJLwac LLaMA (13B) outperforms GPT-3 (175B)

Training Params Context Attention Llama 1 1.0T 7B 2K 1.0T
13B 2K 1.4T 33B 2K 1.4T 65B 2K Llama 2 2T 7B 4K MQA 2T 13B 4K MQA 2T 34B 4K GQA 2T 70B 4K GQA Llama 3 15T 8B 8K GQA

https://unsplash.com/de/fotos/moulin-rouge-gebaude-bei-nacht-G-rXfLPhWys Rotary Position Embeddings (RoPE)

https://unsplash.com/de/fotos/person-die-auf-betonstrasse-steht-Dq0N1y0YHC8 Absolute Positional Encoding Rotary Positional Encoding Relative Positional Encoding

https://unsplash.com/de/fotos/text-fExAUZlcXPo MHA Multi Head Attention GQA Grouped Multi Query Attention
MQA Multi Query Attention

Key-Value Caching significantly increases the memory bandwidth requirements increases decoder
inference latency escalates computational costs.

https://unsplash.com/de/fotos/weisses-und-schwarzes-schiff-auf-see-unter-weissen-wolken-TUJud0AWAPI GPU hours tCO2eq Llama 3 8B 1.3M 390 Llama
3 70B 6.4M 1900

genai_llama3_8b_info_m1.pynb / I of V

genai_llama3_8b_info_m1.pynb / II of V

genai_llama3_8b_info_m1.pynb / III of V

genai_llama3_8b_info_m1.pynb / IV of V

genai_llama3_8b_info_m1.pynb / V of V

transformers predict the next token given some tokens … causal
language modelling predicts the probability distribution of the next token …

https://unsplash.com/de/fotos/braunes-holzgetriebe-401OD83Ke6o Autoregressive Generation

8 bits 23 bits fp32 8 bits 10 bits tf32
8 bits 23 bits bfloat16 5 bits fp16 10 bits 4 bits 3 bits fp8/e4m3 5 bits 2 bits fp8/e5m2 2 bits fp4

https://unsplash.com/de/fotos/drei-puppen-VMKBFR6r_jg Inputs / Tokens / Vectors / Embeddings

https://unsplash.com/de/fotos/blau-weisser-porsche-911-diRzqWT67CA google/gemma-2b apple/OpenELM mistralai/Mistral-7B-v0.1 mistralai/Mixtral-8x7B-v0.1 microsoft/Phi-3-mini-4k-instruct

https://interaktiv.tagesspiegel.de/lab/aleph-alpha-ki-aus-deutschland-biases-vorurteile/

https://unsplash.com/de/fotos/lavendelfeld-c1Jp-fo53U8 The mistral constantly disperses clouds and brings dry air,
making the south of France famous for its cloudless weather. The low clouds formed before the mistral are carried away by the current, and then no new clouds are formed in the dry air. https://unsplash.com/de/fotos/lavendelfeld-c1Jp-fo53U8

0.2€/1M tokens 0.2€/1M tokens 7.3€/1M tokens 22€/1M tokens

Mistral 7B is a 7.3B parameter model outperforms Llama 2
13B on all benchmarks.

https://unsplash.com/de/fotos/man-in-black-and-white-striped-long-sleeve-shirt-Clv9DfJLwac Mistral (7B) outperforms Llama-2 (13B) on all benchmarks.

https://unsplash.com/de/fotos/zementautobahn-die-zu-gebirgszugen-fuhrt-7-pLwj1ZF58 github.com/NVIDIA/NeMo-Guardrails

https://unsplash.com/de/fotos/grune-schlange-am-gewasser-pPKWCCsA0rg Mamba Selective State Space Model (CMU / Princeton)

https://unsplash.com/de/fotos/weisses-segelboot-auf-dem-gewasser-bei-sonnenuntergang-U45J8RWFWvA Running on an Nvidia A100 GPU with 80GB …
• a Mamba of 1.4 billion parameters produced 1,446 tokens per second • a transformer of 1.3 billion parameters produced 344 tokens per second

https://unsplash.com/de/fotos/sonnenblumenwiese-h7xJ_BRHjtY two main contributors to the GPU LLM memory requirement:
+ model weights + KV cache

% ollama serve <<…>> [GIN-debug] POST /api/pull [GIN-debug] POST /api/generate
[GIN-debug] POST /api/chat [GIN-debug] POST /api/embeddings [GIN-debug] POST /api/create [GIN-debug] POST /api/push [GIN-debug] POST /api/copy [GIN-debug] DELETE /api/delete [GIN-debug] POST /api/show [GIN-debug] POST /api/blobs/:digest [GIN-debug] HEAD /api/blobs/:digest [GIN-debug] POST /v1/chat/completions [GIN-debug] GET / [GIN-debug] GET /api/tags [GIN-debug] GET /api/version [GIN-debug] HEAD / [GIN-debug] HEAD /api/tags [GIN-debug] HEAD /api/version <<…>>

% ollama run llama3 'What is the Ackermann function?‘ <…>
A fascinating topic! The Ackermann function, named after Wilfried Ackermann, is a mathematical function that grows extremely rapidly. In fact, it's one of the fastest-growing functions known in mathematics. The Ackermann function is defined recursively as follows: 1. À(0, 0) = 0` 2. À(m, 0) = m + 1` for any positive integer `m` 3. À(m, n + 1) = A(A(m - 1, n), n)` if `m > 0` <<…>> As you can see, the function grows rapidly: À(m, 0)` becomes larger and larger as `m` increases. In fact, it's much faster than many other well-known functions, like Ackermann's function itself! The Ackermann function has some remarkable properties: 1. It's a non-principal recursive function, meaning that its growth rate is not bounded by any fixed power of `n`. 2. It grows much faster than the Ackermann function itself (which is already quite rapid). 3. The function can be shown to be uncomputable in the sense that it cannot be computed exactly using standard algorithms. The Ackermann function has been studied extensively in mathematics, particularly in the areas of computability theory and mathematical logic. It's an important example of a rapidly growing function that challenges our understanding of computation and mathematical structures %

LM Studio write the ackermann function in lisp … write
the ackermann function non-recursive in rust … write the ackermann function in COBOL …

The Ackermann function is a mathematical function that is recursively
defined. It grows very rapidly, and is an example of a function that is not primitive recursive. It's often used to show the limits of computable functions in the context of theoretical computer science.

https://unsplash.com/de/fotos/silber-iphone-6-auf-blauer-oberflache-Wzs4-QEmCUQ OpenELM (Open Source Efficient LLM) / Apple #params: 270
M / 450 M / 1.08 B / 3.04 B Llama Tokenizer RMSNorm RoPE GQA SwiGLU Flash Attention Layer-wise scaling

https://upload.wikimedia.org/wikipedia/commons/c/cd/University_of_Washington_-_Bill_and_Melinda_Gates_Center_for_CSE_01.jpg Bill and Melinda Gates Center for Computer Science &
Engineering, University of Washington, Seattle, Washington, U.S.

https://unsplash.com/de/fotos/foto-von-fachwerkturme-yETqkLnhsUI MacBook Pro M3 Max Power Consumption - less than
70 Watts (average much lower)

https://unsplash.com/de/fotos/graue-felsinsel-wU-DD2LB2fA OpenELM The reproducibility and transparency of large language models
are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks. … our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations. We also release code to convert models to MLX library for inference and fine-tuning on Apple devices. This comprehensive release aims to empower and strengthen the open research community, paving the way for future open research endeavors.

https://www.semianalysis.com/p/google-we-have-no-moat-and-neither Leaked Internal Google Document Claims Open Source AI Will
Outcompete Google and OpenAI semianalysis.com / May 04, 2023 … we aren’t positioned to win this arms race and neither is OpenAI … Open-source models are faster, more customizable, more private, and pound-for-pound more capable. They are doing things with $100 and 13B params that we struggle with at $10M and 540B.

https://unsplash.com/de/fotos/person-die-mit-brauner-ledertasche-spazieren-geht-6dW3xyQvcYE … small and open is the new black …
… context length is the new currency … … iterate from simple to complex prompting … … iterate from one-shot to few-shot learning … … seldom – finetune a model … … consider applying an agentic workflow … … get familiar with 7/8B models … … monitor closely phone models … … look at new ways of scaling transformers …

Democratizing Large Language Models

Democratizing Large Language Models

More Decks by Lothar Wieske

Featured

Transcript