Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Democratizing Large Language Models

Avatar for Lothar Wieske Lothar Wieske
May 16, 2024
540

Democratizing Large Language Models

Generative AI is on everyone's lips. Enthusiastic praise for OpenAI's ChatGPT (175 billion trainable parameters), which was released in November 2022 by the first users quickly led to projections of productivity gains for companies and the global economy by relevant analysts. The stock market is also going crazy and the profits and losses of relevant major players are sometimes in double-digit percentages on a daily basis.

The talk is divided into three parts. In the first part, we will look at developments over the past year driven by the Attention and Diffusion architectural styles with the increasing shift to smaller and more open models. In the second part, we will take a closer look at where the parameters in Transformer architectures (Attention) are and what this means for the design of the number and size of CPUs and GPUs for training and inference. In the third part, we will trace the technological development of graphics cards - or accelerators in general - with the top dog NVIDIA (in particular the brand new Blackwell @ GTC 2024), but also AMD and large cloud providers. And in the fourth part, a small practical demo will show how democratization through smaller and more open models makes the possibilities of Generative AI easier, faster and more widely available.

Avatar for Lothar Wieske

Lothar Wieske

May 16, 2024
Tweet

Transcript

  1. Eugène Delacroix - La Liberté guidant le peuple / Wikipedia

    Democratizing Large Language Models Lothar Wieske
  2. 2024 AI Index Report – Top Takeaways AI beats humans

    on some tasks, but not on all Industry continues to dominate frontier AI research Frontier models get way more expensive The United States leads China, the EU, and the U.K. as the leading source of top AI models Robust and standardized evaluations for LLM responsibility are seriously lacking Generative AI investment skyrockets. The data is in: AI makes workers more productive and leads to higher quality work Scientific progress accelerates even further, thanks to AI The number of AI regulations in the United States sharply increases People across the globe are more cognizant of AI’s potential impact— and more nervous
  3. Llama 2-70B PaLM (540B) Megatron-Turing NLG 530B GPT-3 175B (davinci)

    RoBERTa Large BERT-Large Transformer AlexNet Gemini Ultra Claude 2 GPT-4 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 1000 10K 100K 1M 10M 100M 1B 10B 100B Language Vision Multimodal Publication date Training compute (petaFLOP - log scale) Training compute of notable machine learning models by domain, 2012–23 Source: Epoch, 2023 | Chart: 2024 AI Index report 2024 AI Index Report / Figure 1.3.7
  4. 2019 2020 2021 2022 2023 0% 10% 20% 30% 40%

    50% 60% 70% Foundation models (% of total) 15.44%, Limited 18.79%, No access 65.77%, Open Foundation models (% of total) by access type, 2019–23 Source: Bommasani et al., 2023 | Chart: 2024 AI Index report 2024 AI Index Report / Figure 1.3.14
  5. 930 3,288 160,018 4,324,883 6,405,653 1,319,586 12,389,056 78,352,034 3,931,897 191,400,000

    Transformer BERT-Large RoBERTa Large GPT-3 175B (davinci) Megatron-Turing NLG 530B LaMDA PaLM (540B) GPT-4 Llama 2 70B Gemini Ultra 2017 2018 2019 2020 2021 2022 2023 0 50M 100M 150M 200M Training cost (in U.S. dollars) Estimated training cost of select AI models, 2017–23 Source: Epoch, 2023 | Chart: 2024 AI Index report 2024 AI Index Report / Figure 1.3.21
  6. https://unsplash.com/de/fotos/TOzgRFJ0JxY Contrary to how it may seem when we observe

    its output, an LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
  7. Hoffman Scaling Laws / Chinchilla / 2022 1.4T tokens to

    train LLM of size 70B parameters around 20 text tokens per param https://unsplash.com/de/fotos/foto-von-bibliothekssaal-dsvJgiBJTOs
  8. 2:1 20:1 200:1 <<# training tokens>> : <<# model parameters>>

    GPT-3 / 176B 1.7:1 Llama / 65B 22:1 Llama-2 / 70B 29:1 Llama-3 / 70B 214:1 Phi / 65B 22:1 Phi-2 / 2.7B 519:1 Gemma / 7B 857:1 Mistral / 7B 1143:1 undertrained overtrained Phi-3-mini / 3.8B 868:1
  9. Blackwell GPU FP8 2.0 PFLOPS 2.5X Hopper new FP6 2.0

    PFLOPS 2.5X new FP4 4.0 PFLOPS 5 X HBM Model Size 740 B params 6X HBM Bandwidth 34 T params/sec 5X NVLINK 7 2 TB/s 4 X
  10. Training Params Context Attention Llama 1 1.0T 7B 2K 1.0T

    13B 2K 1.4T 33B 2K 1.4T 65B 2K Llama 2 2T 7B 4K MQA 2T 13B 4K MQA 2T 34B 4K GQA 2T 70B 4K GQA Llama 3 15T 8B 8K GQA
  11. transformers predict the next token given some tokens … causal

    language modelling predicts the probability distribution of the next token …
  12. 8 bits 23 bits fp32 8 bits 10 bits tf32

    8 bits 23 bits bfloat16 5 bits fp16 10 bits 4 bits 3 bits fp8/e4m3 5 bits 2 bits fp8/e5m2 2 bits fp4
  13. https://unsplash.com/de/fotos/lavendelfeld-c1Jp-fo53U8 The mistral constantly disperses clouds and brings dry air,

    making the south of France famous for its cloudless weather. The low clouds formed before the mistral are carried away by the current, and then no new clouds are formed in the dry air. https://unsplash.com/de/fotos/lavendelfeld-c1Jp-fo53U8
  14. https://unsplash.com/de/fotos/weisses-segelboot-auf-dem-gewasser-bei-sonnenuntergang-U45J8RWFWvA Running on an Nvidia A100 GPU with 80GB …

    • a Mamba of 1.4 billion parameters produced 1,446 tokens per second • a transformer of 1.3 billion parameters produced 344 tokens per second
  15. % ollama serve <<…>> [GIN-debug] POST /api/pull [GIN-debug] POST /api/generate

    [GIN-debug] POST /api/chat [GIN-debug] POST /api/embeddings [GIN-debug] POST /api/create [GIN-debug] POST /api/push [GIN-debug] POST /api/copy [GIN-debug] DELETE /api/delete [GIN-debug] POST /api/show [GIN-debug] POST /api/blobs/:digest [GIN-debug] HEAD /api/blobs/:digest [GIN-debug] POST /v1/chat/completions [GIN-debug] GET / [GIN-debug] GET /api/tags [GIN-debug] GET /api/version [GIN-debug] HEAD / [GIN-debug] HEAD /api/tags [GIN-debug] HEAD /api/version <<…>>
  16. % ollama run llama3 'What is the Ackermann function?‘ <…>

    A fascinating topic! The Ackermann function, named after Wilfried Ackermann, is a mathematical function that grows extremely rapidly. In fact, it's one of the fastest-growing functions known in mathematics. The Ackermann function is defined recursively as follows: 1. `A(0, 0) = 0` 2. `A(m, 0) = m + 1` for any positive integer `m` 3. `A(m, n + 1) = A(A(m - 1, n), n)` if `m > 0` <<…>> As you can see, the function grows rapidly: `A(m, 0)` becomes larger and larger as `m` increases. In fact, it's much faster than many other well-known functions, like Ackermann's function itself! The Ackermann function has some remarkable properties: 1. It's a non-principal recursive function, meaning that its growth rate is not bounded by any fixed power of `n`. 2. It grows much faster than the Ackermann function itself (which is already quite rapid). 3. The function can be shown to be uncomputable in the sense that it cannot be computed exactly using standard algorithms. The Ackermann function has been studied extensively in mathematics, particularly in the areas of computability theory and mathematical logic. It's an important example of a rapidly growing function that challenges our understanding of computation and mathematical structures %
  17. LM Studio write the ackermann function in lisp … write

    the ackermann function non-recursive in rust … write the ackermann function in COBOL …
  18. The Ackermann function is a mathematical function that is recursively

    defined. It grows very rapidly, and is an example of a function that is not primitive recursive. It's often used to show the limits of computable functions in the context of theoretical computer science.
  19. https://unsplash.com/de/fotos/silber-iphone-6-auf-blauer-oberflache-Wzs4-QEmCUQ OpenELM (Open Source Efficient LLM) / Apple #params: 270

    M / 450 M / 1.08 B / 3.04 B Llama Tokenizer RMSNorm RoPE GQA SwiGLU Flash Attention Layer-wise scaling
  20. https://unsplash.com/de/fotos/graue-felsinsel-wU-DD2LB2fA OpenELM The reproducibility and transparency of large language models

    are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks. … our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations. We also release code to convert models to MLX library for inference and fine-tuning on Apple devices. This comprehensive release aims to empower and strengthen the open research community, paving the way for future open research endeavors.
  21. https://www.semianalysis.com/p/google-we-have-no-moat-and-neither Leaked Internal Google Document Claims Open Source AI Will

    Outcompete Google and OpenAI semianalysis.com / May 04, 2023 … we aren’t positioned to win this arms race and neither is OpenAI … Open-source models are faster, more customizable, more private, and pound-for-pound more capable. They are doing things with $100 and 13B params that we struggle with at $10M and 540B.
  22. https://unsplash.com/de/fotos/person-die-mit-brauner-ledertasche-spazieren-geht-6dW3xyQvcYE … small and open is the new black …

    … context length is the new currency … … iterate from simple to complex prompting … … iterate from one-shot to few-shot learning … … seldom – finetune a model … … consider applying an agentic workflow … … get familiar with 7/8B models … … monitor closely phone models … … look at new ways of scaling transformers …