Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large Language Models as Markov Chains - LC2 Se...

Oussama Zekri
February 04, 2025

Large Language Models as Markov Chains - LC2 Seminar Imperial

To enjoy the animations in the slides, I recommend downloading the file and reading it with an appropriate reader (e.g. Acrobat Reader) !

Paper :

https://arxiv.org/abs/2410.02724

Abstract :

Large language models (LLMs) are remarkably efficient across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the LLMs' generalization capabilities remains elusive. In our paper, we approach this task by drawing an equivalence between autoregressive transformer-based language models and Markov chains defined on a finite state space. This allows us to study the multi-step inference mechanism of LLMs from first principles. We relate the obtained results to the pathological behavior observed with LLMs such as repetitions and incoherent replies with high temperature. Finally, we leverage the proposed formalization to derive pre-training and in-context learning generalization bounds for LLMs under realistic data and model assumptions. Experiments with the most recent Llama and Gemma herds of models show that our theory correctly captures their behavior in practice.

Oussama Zekri

February 04, 2025
Tweet

Transcript

  1. Large Language models as Markov Chains Oussama Zekri1 LC2 Seminar

    - Imperial College London. 1 ENS Paris-Saclay and Imperial College London, oussama.zekri@ens-paris-saclay.fr January 23, 2025
  2. Co-authors Ambroise Odonnat Abdelhakim Benechehab Linus Bleistein Nicolas Boullé Ievgen

    Redko from Huawei Noah’s Ark Lab, Inria, Imperial College London. O. Zekri – Large Language models as Markov Chains 2
  3. Emerging capabilities of pretrained LLMs ✓ ✓ ✓ Emerging capabilities

    ! [Brown et al. (2020), Gruver et al. (2023), Liu et al. (2024), ...] O. Zekri – Large Language models as Markov Chains 4
  4. Emerging capabilities of pretrained LLMs ✓ ✓ ✓ Emerging capabilities

    ! [Brown et al. (2020), Gruver et al. (2023), Liu et al. (2024), ...] × × × Poorly understood theoretically, with many open problems remaining in the literature. O. Zekri – Large Language models as Markov Chains 4
  5. Emerging capabilities of pretrained LLMs ✓ ✓ ✓ Emerging capabilities

    ! [Brown et al. (2020), Gruver et al. (2023), Liu et al. (2024), ...] × × × Poorly understood theoretically, with many open problems remaining in the literature. Goal: Derive theoretical insights. O. Zekri – Large Language models as Markov Chains 4
  6. Background on Autoregressive Language Modeling Goal: Predict the next word

    based on previous ones. O. Zekri – Large Language models as Markov Chains 5
  7. Background on Autoregressive Language Modeling Goal: Predict the next word

    based on previous ones. Autoregressive Property: Each word depends only on past words. I am the Previous words (context) danger Word being predicted O. Zekri – Large Language models as Markov Chains 5
  8. Background on Autoregressive Language Modeling Goal: Predict the next word

    based on previous ones. Autoregressive Property: Each word depends only on past words. I am the Previous words (context) danger Word being predicted Modelization: Probability of a sequence (x1, x2, . . . , xN ): P(x1, x2, . . . , xN ) = P(x1)P(x2 | x1) · · · P(xN | x1, x2, . . . , xN−1) = N n=1 P(xn | x1, x2, . . . , xn−1) O. Zekri – Large Language models as Markov Chains 5
  9. Background on Autoregressive Models Best models so far : Generative

    Transformers for Autoregressive Modeling fT,K Θ O. Zekri – Large Language models as Markov Chains 6
  10. Background on Autoregressive Models Best models so far : Generative

    Transformers for Autoregressive Modeling fT,K Θ Vocabulary size T. Context window K. Parameter set Θ. GPT-3 : T = 50257, K = 2048 and |Θ| ∼ 175B O. Zekri – Large Language models as Markov Chains 6
  11. Context Window K Context Window K = 7 in navy

    blue. x1 x2 x3 x4 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 Top. A sequence of length N = 4. Bottom. A sequence of length N = 10. O. Zekri – Large Language models as Markov Chains 7
  12. Background on Markov Chains Ω discrete finite state-space. Ω =

    {z1, ..., z|Ω| }. Markov Chain: A sequence of random variables where each observation depends only on the previous one. O. Zekri – Large Language models as Markov Chains 8
  13. Background on Markov Chains Ω discrete finite state-space. Ω =

    {z1, ..., z|Ω| }. Markov Chain: A sequence of random variables where each observation depends only on the previous one. Mathematical Formulation: A process (Zn)n≥0 supported on Ω is a Markov chain if P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) This is called the Markov property. O. Zekri – Large Language models as Markov Chains 8
  14. Background on Markov Chains Ω discrete finite state-space. Ω =

    {z1, ..., z|Ω| }. Markov Chain: A sequence of random variables where each observation depends only on the previous one. Mathematical Formulation: A process (Zn)n≥0 supported on Ω is a Markov chain if P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) This is called the Markov property. Transition matrix: Q is a square matrix of size |Ω| defined as ∀x, y ∈ Ω, Q(x, y) = P(Zn+1 = y | Zn = x) O. Zekri – Large Language models as Markov Chains 8
  15. Are LLMs really Markov chains? State space: Ω = {z1,

    ..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted O. Zekri – Large Language models as Markov Chains 9
  16. Are LLMs really Markov chains? State space: Ω = {z1,

    ..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? O. Zekri – Large Language models as Markov Chains 9
  17. Are LLMs really Markov chains? State space: Ω = {z1,

    ..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? LLMs are clearly not Markov chains at the token level (|Ω| = T). O. Zekri – Large Language models as Markov Chains 9
  18. Are LLMs really Markov chains? State space: Ω = {z1,

    ..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? LLMs are clearly not Markov chains at the token level (|Ω| = T). ✓ ✓ ✓ P(“I am the danger” | “I am the”)? O. Zekri – Large Language models as Markov Chains 9
  19. Are LLMs really Markov chains? State space: Ω = {z1,

    ..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? LLMs are clearly not Markov chains at the token level (|Ω| = T). ✓ ✓ ✓ P(“I am the danger” | “I am the”)? Whole sequence as a state... (|Ω| = ?). O. Zekri – Large Language models as Markov Chains 9
  20. Large Language Models as Markov Chains Large Language Models as

    Markov Chains O. Zekri – Large Language models as Markov Chains 10
  21. Correct State Space Vocabulary space V of size T. ▶

    Ω = V∗ K is the set of all sequences consisting of elements from V with up to K elements. O. Zekri – Large Language models as Markov Chains 11
  22. Correct State Space Vocabulary space V of size T. ▶

    Ω = V∗ K is the set of all sequences consisting of elements from V with up to K elements. Zero Transient class Recurrent class O. Zekri – Large Language models as Markov Chains 11
  23. Correct State Space Vocabulary space V of size T. ▶

    Ω = V∗ K is the set of all sequences consisting of elements from V with up to K elements. Zero Transient class Recurrent class ▶ |Ω| = T TK −1 T−1 . This is ≈ 109628 for GPT-3. O. Zekri – Large Language models as Markov Chains 11
  24. Is this Markov Chain point of view is useful? ×

    × × |Ω| = T TK −1 T−1 , grows exponentially with K. |Ω| ≈ 109628 for GPT-3. Qf cannot be stored. O. Zekri – Large Language models as Markov Chains 12
  25. Is this Markov Chain point of view is useful? ×

    × × |Ω| = T TK −1 T−1 , grows exponentially with K. |Ω| ≈ 109628 for GPT-3. Qf cannot be stored. ∼ ∼ ∼ Model weights, a few GPUs and a single forward pass are all you need to access the row you want in the matrix! O. Zekri – Large Language models as Markov Chains 12
  26. Is this Markov Chain point of view is useful? ×

    × × |Ω| = T TK −1 T−1 , grows exponentially with K. |Ω| ≈ 109628 for GPT-3. Qf cannot be stored. ∼ ∼ ∼ Model weights, a few GPUs and a single forward pass are all you need to access the row you want in the matrix! ✓ ✓ ✓ Connection to the rich theory of finite Markov Chain → insight into the dynamics of LLMs. O. Zekri – Large Language models as Markov Chains 12
  27. Toy example Toy example on a "Baby" LLM with V

    = {0, 1} and |Θ| = 12688. T = 2, K = 3, |Ω| = 14 O. Zekri – Large Language models as Markov Chains 13
  28. Stationary distribution ▶ A stationary distribution π represents the long-term

    behavior of a Markov chain. O. Zekri – Large Language models as Markov Chains 14
  29. Stationary distribution ▶ A stationary distribution π represents the long-term

    behavior of a Markov chain. ▶ It verifies πQf = π and each row of Qn f tends to π when n → ∞. O. Zekri – Large Language models as Markov Chains 14
  30. Stationary distribution ▶ A stationary distribution π represents the long-term

    behavior of a Markov chain. ▶ It verifies πQf = π and each row of Qn f tends to π when n → ∞. ▶ A finite state unichain has a unique stationary distribution. O. Zekri – Large Language models as Markov Chains 14
  31. Stationary distribution ▶ A stationary distribution π represents the long-term

    behavior of a Markov chain. ▶ It verifies πQf = π and each row of Qn f tends to π when n → ∞. ▶ A finite state unichain has a unique stationary distribution. Convergence speed to the sta�onary distribu�on O. Zekri – Large Language models as Markov Chains 14
  32. Speed of convergence ▶ Convergence speed to the stationary distribution.

    O. Zekri – Large Language models as Markov Chains 15
  33. Speed of convergence ▶ Convergence speed to the stationary distribution.

    Proposition For all n ≥ K, |(Qn f )i,j − (eπ)i,j| ≤ (1 − 2ε)⌊ n K ⌋−1, where ε = min i,j∈R2 {(QK f )i,j} > 0. O. Zekri – Large Language models as Markov Chains 15
  34. Speed of convergence ▶ Convergence speed to the stationary distribution.

    Proposition For all n ≥ K, |(Qn f )i,j − (eπ)i,j| ≤ (1 − 2ε)⌊ n K ⌋−1, where ε = min i,j∈R2 {(QK f )i,j} > 0. ▶ Impact of the temperature. O. Zekri – Large Language models as Markov Chains 15
  35. Speed of convergence ▶ Convergence speed to the stationary distribution.

    Proposition For all n ≥ K, |(Qn f )i,j − (eπ)i,j| ≤ (1 − 2ε)⌊ n K ⌋−1, where ε = min i,j∈R2 {(QK f )i,j} > 0. ▶ Impact of the temperature. Temperature =2 Temperature Temperature =1 Temperature =1 Temperature =0.2 O. Zekri – Large Language models as Markov Chains 15
  36. Sample complexity Question : How much training data do I

    need for Qf to be close to Q∗? O. Zekri – Large Language models as Markov Chains 17
  37. Sample complexity Question : How much training data do I

    need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. O. Zekri – Large Language models as Markov Chains 17
  38. Sample complexity Question : How much training data do I

    need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. ✓ ✓ ✓ Dependency on model parameters. O. Zekri – Large Language models as Markov Chains 17
  39. Sample complexity Question : How much training data do I

    need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. ✓ ✓ ✓ Dependency on model parameters. Sample complexity Let ϵ > 0. If Ntrain ≥ N∗ := O 1 ϵ2 , then we have with high probability, d(Q∗, Qf ) ≤ ϵ. O. Zekri – Large Language models as Markov Chains 17
  40. Setup Question : How far our matrix Qf is from

    the reference matrix Q∗? O. Zekri – Large Language models as Markov Chains 18
  41. Setup Question : How far our matrix Qf is from

    the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . O. Zekri – Large Language models as Markov Chains 18
  42. Setup Question : How far our matrix Qf is from

    the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? O. Zekri – Large Language models as Markov Chains 18
  43. Setup Question : How far our matrix Qf is from

    the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? Generalization problem Training error : R(Θ) Test error : R(Θ) O. Zekri – Large Language models as Markov Chains 18
  44. Setup Question : How far our matrix Qf is from

    the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? Generalization problem Training error : R(Θ) Test error : R(Θ) The generalization problem consists of bounding the generalization error G := R(Θ) − R(Θ). O. Zekri – Large Language models as Markov Chains 18
  45. Pre-training generalization bound Assumptions ✓ ✓ ✓ Pre-training data S

    = (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. O. Zekri – Large Language models as Markov Chains 19
  46. Pre-training generalization bound Assumptions ✓ ✓ ✓ Pre-training data S

    = (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. ✓ ✓ ✓ Dependence on Transformer’s layer norms, embedding dimension, and number of heads is captured through a term Bmodel. O. Zekri – Large Language models as Markov Chains 19
  47. Pre-training generalization bound Assumptions ✓ ✓ ✓ Pre-training data S

    = (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. ✓ ✓ ✓ Dependence on Transformer’s layer norms, embedding dimension, and number of heads is captured through a term Bmodel. Theorem (Z. et al., 2024) With high probability, Gpre ≤ O ∥Γ∥ Bmodel Ntrain O. Zekri – Large Language models as Markov Chains 19
  48. Sample Complexity: Numerically? Numerical verification with GPT-3 1e-5 5e-4 1e-3

    2e-3 3e-3 4e-3 5e-3 Approximation tolerance 1010 1011 1012 1013 1014 1015 Sample complexity N* Real training size Real training size To be compared to the real training size ≈ 1011. O. Zekri – Large Language models as Markov Chains 20
  49. In Context Learning In-Context Learning : Model’s ability to learn

    and adapt to patterns, without updating its internal parameters. In-context learning Input prompt Output Natural language processing apple, pomme, pear, poire, cherry cerise Turkey, baklava, France, croissant, Japan mochi Supervised learning 𝑦! = 𝑓 𝑥! + noise 𝑥+, 𝑦+, 𝑥,, … , 𝑥-.+, 𝑦-.+, 𝑥- 𝑓(𝑥! ) Dynamical system 𝑥!"# = 𝑓 𝑥! + noise 𝑥+, 𝑥,, 𝑥/, … , 𝑥-.,, 𝑥-.+, 𝑥- 𝑓(𝑥! ) Figure: From [Transformers as Algorithms, Li et al. 2023] O. Zekri – Large Language models as Markov Chains 22
  50. In Context Learning In-Context Learning : Model’s ability to learn

    and adapt to patterns, without updating its internal parameters. In-context learning Input prompt Output Natural language processing apple, pomme, pear, poire, cherry cerise Turkey, baklava, France, croissant, Japan mochi Supervised learning 𝑦! = 𝑓 𝑥! + noise 𝑥+, 𝑦+, 𝑥,, … , 𝑥-.+, 𝑦-.+, 𝑥- 𝑓(𝑥! ) Dynamical system 𝑥!"# = 𝑓 𝑥! + noise 𝑥+, 𝑥,, 𝑥/, … , 𝑥-.,, 𝑥-.+, 𝑥- 𝑓(𝑥! ) Figure: From [Transformers as Algorithms, Li et al. 2023] ▶ Not as costly as model pre-training. O. Zekri – Large Language models as Markov Chains 22
  51. In Context Learning In-Context Learning : Model’s ability to learn

    and adapt to patterns, without updating its internal parameters. In-context learning Input prompt Output Natural language processing apple, pomme, pear, poire, cherry cerise Turkey, baklava, France, croissant, Japan mochi Supervised learning 𝑦! = 𝑓 𝑥! + noise 𝑥+, 𝑦+, 𝑥,, … , 𝑥-.+, 𝑦-.+, 𝑥- 𝑓(𝑥! ) Dynamical system 𝑥!"# = 𝑓 𝑥! + noise 𝑥+, 𝑥,, 𝑥/, … , 𝑥-.,, 𝑥-.+, 𝑥- 𝑓(𝑥! ) Figure: From [Transformers as Algorithms, Li et al. 2023] ▶ Not as costly as model pre-training. ▶ Access to a ground truth matrix Q∗. O. Zekri – Large Language models as Markov Chains 22
  52. In Context Learning of Markov chains Setup ▶ A d−state

    Markov chain X = (X1, . . . , XNicl ), and the sequence of its first n terms is denoted by S = (S1, . . . , Sn). O. Zekri – Large Language models as Markov Chains 25
  53. In Context Learning of Markov chains Setup ▶ A d−state

    Markov chain X = (X1, . . . , XNicl ), and the sequence of its first n terms is denoted by S = (S1, . . . , Sn). ▶ Mixing time of S, denoted as tmin. O. Zekri – Large Language models as Markov Chains 25
  54. In Context Learning of Markov chains Setup ▶ A d−state

    Markov chain X = (X1, . . . , XNicl ), and the sequence of its first n terms is denoted by S = (S1, . . . , Sn). ▶ Mixing time of S, denoted as tmin. Theorem (Z. et al., 2024) With high probability, Test error ≤ Distribution shift + O   tmin log (d) Nicl   O. Zekri – Large Language models as Markov Chains 25
  55. Experiments : In-Context Scaling Laws 100 101 102 103 Context

    Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% confidence intervals. O. Zekri – Large Language models as Markov Chains 26
  56. Experiments : In-Context Scaling Laws 100 101 102 103 Context

    Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% confidence intervals. ✓ ✓ ✓ Randomly generated datas : not seen during training. O. Zekri – Large Language models as Markov Chains 26
  57. Experiments : In-Context Scaling Laws 100 101 102 103 Context

    Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% confidence intervals. ✓ ✓ ✓ Randomly generated datas : not seen during training. ✓ ✓ ✓ Nicl dependence in line with theoretical result. O. Zekri – Large Language models as Markov Chains 26
  58. Experiments : In-Context Scaling Laws 100 101 102 103 Context

    Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% confidence intervals. ✓ ✓ ✓ Randomly generated datas : not seen during training. ✓ ✓ ✓ Nicl dependence in line with theoretical result. ✓ ✓ ✓ Most recent models stay much closer to the theoretical result. O. Zekri – Large Language models as Markov Chains 26
  59. Experiments : Influence of tmin 100 101 102 103 Context

    Length Nicl 0.0 0.2 0.4 0.6 0.8 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 Small Nicl Scaling law 100 101 102 Ratio Nicl /tmin 10 1 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 O ( t m in /N icl ) Figure: Influence of tmin . Ricl as functions of Nicl and Nicl /tmin , with 95% confidence intervals. O. Zekri – Large Language models as Markov Chains 27
  60. Experiments : Influence of tmin 100 101 102 103 Context

    Length Nicl 0.0 0.2 0.4 0.6 0.8 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 Small Nicl Scaling law 100 101 102 Ratio Nicl /tmin 10 1 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 O ( t m in /N icl ) Figure: Influence of tmin . Ricl as functions of Nicl and Nicl /tmin , with 95% confidence intervals. ✓ ✓ ✓ tmin dependence in line with theoretical result. O. Zekri – Large Language models as Markov Chains 27
  61. Experiments : Number of states 100 101 102 103 Context

    Length Nicl 10 1 100 Error icl Frequentist Gemma 2B O(N −1/2 icl ) 100 101 102 Context Length Nicl 10 1 100 Error icl O(N −1/2 icl ) Figure: Impact of the number of states. Left. Random 3-state Markov chain. Right. Brownian motion discretized as a 700-state Markov chain. O. Zekri – Large Language models as Markov Chains 28
  62. Experiments : Number of states 100 101 102 103 Context

    Length Nicl 10 1 100 Error icl Frequentist Gemma 2B O(N −1/2 icl ) 100 101 102 Context Length Nicl 10 1 100 Error icl O(N −1/2 icl ) Figure: Impact of the number of states. Left. Random 3-state Markov chain. Right. Brownian motion discretized as a 700-state Markov chain. ✓ ✓ ✓ Frequentist’s bound O( d/Nicl) vs. LLM’s bound O( log(d)/Nicl). O. Zekri – Large Language models as Markov Chains 28
  63. Experiments : Number of states 100 101 102 103 Context

    Length Nicl 10 1 100 Error icl Frequentist Gemma 2B O(N −1/2 icl ) 100 101 102 Context Length Nicl 10 1 100 Error icl O(N −1/2 icl ) Figure: Impact of the number of states. Left. Random 3-state Markov chain. Right. Brownian motion discretized as a 700-state Markov chain. ✓ ✓ ✓ Frequentist’s bound O( d/Nicl) vs. LLM’s bound O( log(d)/Nicl). ✓ ✓ ✓ As d grows, frequentist method struggles. O. Zekri – Large Language models as Markov Chains 28
  64. Take home message Take home message O. Zekri – Large

    Language models as Markov Chains 29
  65. Take home message ✓ ✓ ✓ Explicit characterization of the

    inference mechanism in LLMs through an equivalent finite-state Markov chain. O. Zekri – Large Language models as Markov Chains 30
  66. Take home message ✓ ✓ ✓ Explicit characterization of the

    inference mechanism in LLMs through an equivalent finite-state Markov chain. ✓ ✓ ✓ Existence and uniqueness of a stationary distribution. Generalization bounds on pre-training and in-context learning (ICL) phases. O. Zekri – Large Language models as Markov Chains 30
  67. Take home message ✓ ✓ ✓ Explicit characterization of the

    inference mechanism in LLMs through an equivalent finite-state Markov chain. ✓ ✓ ✓ Existence and uniqueness of a stationary distribution. Generalization bounds on pre-training and in-context learning (ICL) phases. ✓ ✓ ✓ Experiments validate our theory with Llama2 7B & 13B, Gemma 2B, Mistral 7B, and even Llama 3.2. O. Zekri – Large Language models as Markov Chains 30
  68. Thank you for your attention ! You can follow me

    on social networks! Thank you for your attention ! My website : www.oussamazekri.fr O. Zekri – Large Language models as Markov Chains 31
  69. Formalization ▶ Ω = V∗ K : set of sequences

    of at most K elements from V ▶ Autoregressive LLM fT,K Θ ⇐⇒ finite Markov chain MT,K ▶ MT,K has sparse transition matrix Qf ∈ R|V∗ K |×|V∗ K | Let vi, vj ∈ V∗ K be two sequences of up to K tokens. We have: 1 Qf (vi, vj) = 0 if vj is not a completion of vi, i.e., ∃l ∈ {1, . . . , |vi| − 1}, s.t. (vi)l+1 ̸= (vj)l 2 Qf (vi, vj) = {fT,K Θ (vi)}j otherwise. → This is the proba of predicting (vj)|vi| as the next token. O. Zekri – Large Language models as Markov Chains 2
  70. Setup Question : How far our matrix Qf is from

    the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? Generalization problem R(Θ) := E[R(Θ)], R(Θ) := 1 N N n=1 dTV(Q∗(Sn, ·), Qf (Sn, ·)), (1) The generalization problem consists of bounding the difference R(Θ) − R(Θ). O. Zekri – Large Language models as Markov Chains 3
  71. Pre-training generalization bound Assumptions ✓ ✓ ✓ Pre-training data S

    = (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. ✓ ✓ ✓ Assumption only on the last transformer layer: Bounded unembedding matrix, i.e. ∥W⊤ U ∥2,1 ≤ BU . Theorem Let 0 < δ < 1, then with probability at least 1 − δ, Rpre(Θ) ≤ Rpre(Θ) + ¯ B √ Ntrain log 2 δ , where ¯ B = 2∥Γ∥ max{log (T) + 2BU /τ, log (1/c0)}1/2. O. Zekri – Large Language models as Markov Chains 4
  72. Fine-grained bound More fine-grained bound with additional assumption. W =

    Θ | ∀ℓ ∈ [L], ∥W(ℓ) V ∥∞ ≤ BV , ∥W(ℓ) O ∥∞ ≤ BO, ∥W(ℓ) 1 ∥∞ ≤ B1, ∥W(ℓ) 2 ∥∞ ≤ B2, ∥W⊤ U ∥2,1 ≤ BU . Corollary Let 0 < δ < 1, then with probability at least 1 − δ, Rpre(Θ) ≤ Rpre(Θ) + ¯ B √ Ntrain log 2 δ , where ¯ B = 2∥Γ∥ max{log (T) + 2(BΘ)L/τ, log (1/c0)}1/2, and BΘ = [(1 + rmB1B2)(1 + r3 H BOBV )](BtokBU )1/L. O. Zekri – Large Language models as Markov Chains 5
  73. Sample complexity Question : How much training data do I

    need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. ✓ ✓ ✓ Dependency on model parameters. Corollary Let δ ∈ [0, 1] and let ϵ > 0. If Ntrain ≥ N∗ := ⌈4 ¯ B2 ϵ2 log 2 δ ⌉ and if we assume a perfect pre-training error for fΘ, then we have with probability at least 1 − δ, ES∼PL ∥Q∗(S, ·) − Qf (S, ·)∥1 ≤ ϵ. O. Zekri – Large Language models as Markov Chains 6
  74. In Context Learning of Markov chains Setup ▶ A d−state

    Markov chain X = (X1, . . . , XNicl ), and the sequence of its first n terms is denoted by S = (S1, . . . , Sn). ▶ Mixing time of S, denoted as tmix(ε). ▶ Almost distance K(Θ1, Θ2) := 1 N N n=1 ESn [dTV(PΘ1 (· | Sn), PΘ2 (· | Sn))]. Theorem Let δ > 0. Then, with probability at least 1 − δ, Ricl(Θ) ≤ infϑ∈Wmc {Ricl(ϑ) + K(ϑ, Θ)} + ¯ B tmin Nicl log 2 δ , (2) where ¯ B = 2 max{log (d) + 2BU /τ, log (1/pmin)}1/2. O. Zekri – Large Language models as Markov Chains 7