rights reserved. 1. Compute requirements Llama3 70B training requires more than 1.2 TB of VRAM Scaling low [1]: FLOPS ≈ 6 x Parameters x Tokens Chinchilla low[2] Models needs to be trained with 20 x (Num. Parameters) Tokens 3 Parameters FLOPS Tokens 1 Billion 1.21e+20 20.2 Billion 10 Billion 1.23e+23 205.1 Billion 175 Billion 3.85e+24 3.7 Trillion 1 Trillion 1.27e+26 21.2 Trillion 10 Trillion 1.30e+28 216.2 Trillion Parameters (FP32/Bf16) 420 GB Gradients (FP32) 280 GB Adam Optimizer States (FP32) 560 GB VRAM consumption Llama3 70B (Without Activations etc.) [1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. [2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark, A. and Hennigan, T., 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.