Slide 3
Slide 3 text
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1. Compute requirements
Llama3 70B training requires more than 1.2 TB
of VRAM
Scaling low [1]:
FLOPS ≈ 6 x Parameters x Tokens
Chinchilla low[2]
Models needs to be trained with 20 x (Num.
Parameters) Tokens
3
Parameters FLOPS Tokens
1 Billion 1.21e+20 20.2 Billion
10 Billion 1.23e+23 205.1 Billion
175 Billion 3.85e+24 3.7 Trillion
1 Trillion 1.27e+26 21.2 Trillion
10 Trillion 1.30e+28 216.2 Trillion
Parameters
(FP32/Bf16)
420 GB
Gradients
(FP32)
280 GB
Adam Optimizer States
(FP32)
560 GB
VRAM consumption Llama3 70B (Without Activations etc.)
[1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020.
Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
[2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark,
A. and Hennigan, T., 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.