Slide 1

Slide 1 text

Large Language models as Markov Chains Oussama Zekri1 LC2 Seminar - Imperial College London. 1 ENS Paris-Saclay and Imperial College London, oussama.zekri@ens-paris-saclay.fr January 23, 2025

Slide 2

Slide 2 text

Co-authors Ambroise Odonnat Abdelhakim Benechehab Linus Bleistein Nicolas Boullé Ievgen Redko from Huawei Noah’s Ark Lab, Inria, Imperial College London. O. Zekri – Large Language models as Markov Chains 2

Slide 3

Slide 3 text

Introduction Introduction O. Zekri – Large Language models as Markov Chains 3

Slide 4

Slide 4 text

Emerging capabilities of pretrained LLMs O. Zekri – Large Language models as Markov Chains 4

Slide 5

Slide 5 text

Emerging capabilities of pretrained LLMs O. Zekri – Large Language models as Markov Chains 4

Slide 6

Slide 6 text

Emerging capabilities of pretrained LLMs ✓ ✓ ✓ Emerging capabilities ! [Brown et al. (2020), Gruver et al. (2023), Liu et al. (2024), ...] O. Zekri – Large Language models as Markov Chains 4

Slide 7

Slide 7 text

Emerging capabilities of pretrained LLMs ✓ ✓ ✓ Emerging capabilities ! [Brown et al. (2020), Gruver et al. (2023), Liu et al. (2024), ...] × × × Poorly understood theoretically, with many open problems remaining in the literature. O. Zekri – Large Language models as Markov Chains 4

Slide 8

Slide 8 text

Emerging capabilities of pretrained LLMs ✓ ✓ ✓ Emerging capabilities ! [Brown et al. (2020), Gruver et al. (2023), Liu et al. (2024), ...] × × × Poorly understood theoretically, with many open problems remaining in the literature. Goal: Derive theoretical insights. O. Zekri – Large Language models as Markov Chains 4

Slide 9

Slide 9 text

Background on Autoregressive Language Modeling Goal: Predict the next word based on previous ones. O. Zekri – Large Language models as Markov Chains 5

Slide 10

Slide 10 text

Background on Autoregressive Language Modeling Goal: Predict the next word based on previous ones. Autoregressive Property: Each word depends only on past words. I am the Previous words (context) danger Word being predicted O. Zekri – Large Language models as Markov Chains 5

Slide 11

Slide 11 text

Background on Autoregressive Language Modeling Goal: Predict the next word based on previous ones. Autoregressive Property: Each word depends only on past words. I am the Previous words (context) danger Word being predicted Modelization: Probability of a sequence (x1, x2, . . . , xN ): P(x1, x2, . . . , xN ) = P(x1)P(x2 | x1) · · · P(xN | x1, x2, . . . , xN−1) = N n=1 P(xn | x1, x2, . . . , xn−1) O. Zekri – Large Language models as Markov Chains 5

Slide 12

Slide 12 text

Background on Autoregressive Models Best models so far : Generative Transformers for Autoregressive Modeling fT,K Θ O. Zekri – Large Language models as Markov Chains 6

Slide 13

Slide 13 text

Background on Autoregressive Models Best models so far : Generative Transformers for Autoregressive Modeling fT,K Θ Vocabulary size T. Context window K. Parameter set Θ. GPT-3 : T = 50257, K = 2048 and |Θ| ∼ 175B O. Zekri – Large Language models as Markov Chains 6

Slide 14

Slide 14 text

Context Window K Context Window K = 7 in navy blue. x1 x2 x3 x4 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 Top. A sequence of length N = 4. Bottom. A sequence of length N = 10. O. Zekri – Large Language models as Markov Chains 7

Slide 15

Slide 15 text

Background on Markov Chains Ω discrete finite state-space. Ω = {z1, ..., z|Ω| }. Markov Chain: A sequence of random variables where each observation depends only on the previous one. O. Zekri – Large Language models as Markov Chains 8

Slide 16

Slide 16 text

Background on Markov Chains Ω discrete finite state-space. Ω = {z1, ..., z|Ω| }. Markov Chain: A sequence of random variables where each observation depends only on the previous one. Mathematical Formulation: A process (Zn)n≥0 supported on Ω is a Markov chain if P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) This is called the Markov property. O. Zekri – Large Language models as Markov Chains 8

Slide 17

Slide 17 text

Background on Markov Chains Ω discrete finite state-space. Ω = {z1, ..., z|Ω| }. Markov Chain: A sequence of random variables where each observation depends only on the previous one. Mathematical Formulation: A process (Zn)n≥0 supported on Ω is a Markov chain if P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) This is called the Markov property. Transition matrix: Q is a square matrix of size |Ω| defined as ∀x, y ∈ Ω, Q(x, y) = P(Zn+1 = y | Zn = x) O. Zekri – Large Language models as Markov Chains 8

Slide 18

Slide 18 text

Are LLMs really Markov chains? State space: Ω = {z1, ..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted O. Zekri – Large Language models as Markov Chains 9

Slide 19

Slide 19 text

Are LLMs really Markov chains? State space: Ω = {z1, ..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? O. Zekri – Large Language models as Markov Chains 9

Slide 20

Slide 20 text

Are LLMs really Markov chains? State space: Ω = {z1, ..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? LLMs are clearly not Markov chains at the token level (|Ω| = T). O. Zekri – Large Language models as Markov Chains 9

Slide 21

Slide 21 text

Are LLMs really Markov chains? State space: Ω = {z1, ..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? LLMs are clearly not Markov chains at the token level (|Ω| = T). ✓ ✓ ✓ P(“I am the danger” | “I am the”)? O. Zekri – Large Language models as Markov Chains 9

Slide 22

Slide 22 text

Are LLMs really Markov chains? State space: Ω = {z1, ..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? LLMs are clearly not Markov chains at the token level (|Ω| = T). ✓ ✓ ✓ P(“I am the danger” | “I am the”)? Whole sequence as a state... (|Ω| = ?). O. Zekri – Large Language models as Markov Chains 9

Slide 23

Slide 23 text

Large Language Models as Markov Chains Large Language Models as Markov Chains O. Zekri – Large Language models as Markov Chains 10

Slide 24

Slide 24 text

Correct State Space Vocabulary space V of size T. ▶ Ω = V∗ K is the set of all sequences consisting of elements from V with up to K elements. O. Zekri – Large Language models as Markov Chains 11

Slide 25

Slide 25 text

Correct State Space Vocabulary space V of size T. ▶ Ω = V∗ K is the set of all sequences consisting of elements from V with up to K elements. Zero Transient class Recurrent class O. Zekri – Large Language models as Markov Chains 11

Slide 26

Slide 26 text

Correct State Space Vocabulary space V of size T. ▶ Ω = V∗ K is the set of all sequences consisting of elements from V with up to K elements. Zero Transient class Recurrent class ▶ |Ω| = T TK −1 T−1 . This is ≈ 109628 for GPT-3. O. Zekri – Large Language models as Markov Chains 11

Slide 27

Slide 27 text

Is this Markov Chain point of view is useful? × × × |Ω| = T TK −1 T−1 , grows exponentially with K. |Ω| ≈ 109628 for GPT-3. Qf cannot be stored. O. Zekri – Large Language models as Markov Chains 12

Slide 28

Slide 28 text

Is this Markov Chain point of view is useful? × × × |Ω| = T TK −1 T−1 , grows exponentially with K. |Ω| ≈ 109628 for GPT-3. Qf cannot be stored. ∼ ∼ ∼ Model weights, a few GPUs and a single forward pass are all you need to access the row you want in the matrix! O. Zekri – Large Language models as Markov Chains 12

Slide 29

Slide 29 text

Is this Markov Chain point of view is useful? × × × |Ω| = T TK −1 T−1 , grows exponentially with K. |Ω| ≈ 109628 for GPT-3. Qf cannot be stored. ∼ ∼ ∼ Model weights, a few GPUs and a single forward pass are all you need to access the row you want in the matrix! ✓ ✓ ✓ Connection to the rich theory of finite Markov Chain → insight into the dynamics of LLMs. O. Zekri – Large Language models as Markov Chains 12

Slide 30

Slide 30 text

Toy example Toy example on a "Baby" LLM with V = {0, 1} and |Θ| = 12688. T = 2, K = 3, |Ω| = 14 O. Zekri – Large Language models as Markov Chains 13

Slide 31

Slide 31 text

Stationary distribution ▶ A stationary distribution π represents the long-term behavior of a Markov chain. O. Zekri – Large Language models as Markov Chains 14

Slide 32

Slide 32 text

Stationary distribution ▶ A stationary distribution π represents the long-term behavior of a Markov chain. ▶ It verifies πQf = π and each row of Qn f tends to π when n → ∞. O. Zekri – Large Language models as Markov Chains 14

Slide 33

Slide 33 text

Stationary distribution ▶ A stationary distribution π represents the long-term behavior of a Markov chain. ▶ It verifies πQf = π and each row of Qn f tends to π when n → ∞. ▶ A finite state unichain has a unique stationary distribution. O. Zekri – Large Language models as Markov Chains 14

Slide 34

Slide 34 text

Stationary distribution ▶ A stationary distribution π represents the long-term behavior of a Markov chain. ▶ It verifies πQf = π and each row of Qn f tends to π when n → ∞. ▶ A finite state unichain has a unique stationary distribution. Convergence speed to the sta�onary distribu�on O. Zekri – Large Language models as Markov Chains 14

Slide 35

Slide 35 text

Speed of convergence ▶ Convergence speed to the stationary distribution. O. Zekri – Large Language models as Markov Chains 15

Slide 36

Slide 36 text

Speed of convergence ▶ Convergence speed to the stationary distribution. Proposition For all n ≥ K, |(Qn f )i,j − (eπ)i,j| ≤ (1 − 2ε)⌊ n K ⌋−1, where ε = min i,j∈R2 {(QK f )i,j} > 0. O. Zekri – Large Language models as Markov Chains 15

Slide 37

Slide 37 text

Speed of convergence ▶ Convergence speed to the stationary distribution. Proposition For all n ≥ K, |(Qn f )i,j − (eπ)i,j| ≤ (1 − 2ε)⌊ n K ⌋−1, where ε = min i,j∈R2 {(QK f )i,j} > 0. ▶ Impact of the temperature. O. Zekri – Large Language models as Markov Chains 15

Slide 38

Slide 38 text

Speed of convergence ▶ Convergence speed to the stationary distribution. Proposition For all n ≥ K, |(Qn f )i,j − (eπ)i,j| ≤ (1 − 2ε)⌊ n K ⌋−1, where ε = min i,j∈R2 {(QK f )i,j} > 0. ▶ Impact of the temperature. Temperature =2 Temperature Temperature =1 Temperature =1 Temperature =0.2 O. Zekri – Large Language models as Markov Chains 15

Slide 39

Slide 39 text

Generalization bounds on pre-training Generalization bounds on pre-training O. Zekri – Large Language models as Markov Chains 16

Slide 40

Slide 40 text

Sample complexity Question : How much training data do I need for Qf to be close to Q∗? O. Zekri – Large Language models as Markov Chains 17

Slide 41

Slide 41 text

Sample complexity Question : How much training data do I need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. O. Zekri – Large Language models as Markov Chains 17

Slide 42

Slide 42 text

Sample complexity Question : How much training data do I need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. ✓ ✓ ✓ Dependency on model parameters. O. Zekri – Large Language models as Markov Chains 17

Slide 43

Slide 43 text

Sample complexity Question : How much training data do I need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. ✓ ✓ ✓ Dependency on model parameters. Sample complexity Let ϵ > 0. If Ntrain ≥ N∗ := O 1 ϵ2 , then we have with high probability, d(Q∗, Qf ) ≤ ϵ. O. Zekri – Large Language models as Markov Chains 17

Slide 44

Slide 44 text

Setup Question : How far our matrix Qf is from the reference matrix Q∗? O. Zekri – Large Language models as Markov Chains 18

Slide 45

Slide 45 text

Setup Question : How far our matrix Qf is from the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . O. Zekri – Large Language models as Markov Chains 18

Slide 46

Slide 46 text

Setup Question : How far our matrix Qf is from the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? O. Zekri – Large Language models as Markov Chains 18

Slide 47

Slide 47 text

Setup Question : How far our matrix Qf is from the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? Generalization problem Training error : R(Θ) Test error : R(Θ) O. Zekri – Large Language models as Markov Chains 18

Slide 48

Slide 48 text

Setup Question : How far our matrix Qf is from the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? Generalization problem Training error : R(Θ) Test error : R(Θ) The generalization problem consists of bounding the generalization error G := R(Θ) − R(Θ). O. Zekri – Large Language models as Markov Chains 18

Slide 49

Slide 49 text

Pre-training generalization bound Assumptions ✓ ✓ ✓ Pre-training data S = (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. O. Zekri – Large Language models as Markov Chains 19

Slide 50

Slide 50 text

Pre-training generalization bound Assumptions ✓ ✓ ✓ Pre-training data S = (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. ✓ ✓ ✓ Dependence on Transformer’s layer norms, embedding dimension, and number of heads is captured through a term Bmodel. O. Zekri – Large Language models as Markov Chains 19

Slide 51

Slide 51 text

Pre-training generalization bound Assumptions ✓ ✓ ✓ Pre-training data S = (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. ✓ ✓ ✓ Dependence on Transformer’s layer norms, embedding dimension, and number of heads is captured through a term Bmodel. Theorem (Z. et al., 2024) With high probability, Gpre ≤ O ∥Γ∥ Bmodel Ntrain O. Zekri – Large Language models as Markov Chains 19

Slide 52

Slide 52 text

Sample Complexity: Numerically? Numerical verification with GPT-3 1e-5 5e-4 1e-3 2e-3 3e-3 4e-3 5e-3 Approximation tolerance 1010 1011 1012 1013 1014 1015 Sample complexity N* Real training size Real training size To be compared to the real training size ≈ 1011. O. Zekri – Large Language models as Markov Chains 20

Slide 53

Slide 53 text

In-Context Learning and experiments In-Context Learning and experiments O. Zekri – Large Language models as Markov Chains 21

Slide 54

Slide 54 text

In Context Learning In-Context Learning : Model’s ability to learn and adapt to patterns, without updating its internal parameters. In-context learning Input prompt Output Natural language processing apple, pomme, pear, poire, cherry cerise Turkey, baklava, France, croissant, Japan mochi Supervised learning 𝑦! = 𝑓 𝑥! + noise 𝑥+, 𝑦+, 𝑥,, … , 𝑥-.+, 𝑦-.+, 𝑥- 𝑓(𝑥! ) Dynamical system 𝑥!"# = 𝑓 𝑥! + noise 𝑥+, 𝑥,, 𝑥/, … , 𝑥-.,, 𝑥-.+, 𝑥- 𝑓(𝑥! ) Figure: From [Transformers as Algorithms, Li et al. 2023] O. Zekri – Large Language models as Markov Chains 22

Slide 55

Slide 55 text

In Context Learning In-Context Learning : Model’s ability to learn and adapt to patterns, without updating its internal parameters. In-context learning Input prompt Output Natural language processing apple, pomme, pear, poire, cherry cerise Turkey, baklava, France, croissant, Japan mochi Supervised learning 𝑦! = 𝑓 𝑥! + noise 𝑥+, 𝑦+, 𝑥,, … , 𝑥-.+, 𝑦-.+, 𝑥- 𝑓(𝑥! ) Dynamical system 𝑥!"# = 𝑓 𝑥! + noise 𝑥+, 𝑥,, 𝑥/, … , 𝑥-.,, 𝑥-.+, 𝑥- 𝑓(𝑥! ) Figure: From [Transformers as Algorithms, Li et al. 2023] ▶ Not as costly as model pre-training. O. Zekri – Large Language models as Markov Chains 22

Slide 56

Slide 56 text

In Context Learning In-Context Learning : Model’s ability to learn and adapt to patterns, without updating its internal parameters. In-context learning Input prompt Output Natural language processing apple, pomme, pear, poire, cherry cerise Turkey, baklava, France, croissant, Japan mochi Supervised learning 𝑦! = 𝑓 𝑥! + noise 𝑥+, 𝑦+, 𝑥,, … , 𝑥-.+, 𝑦-.+, 𝑥- 𝑓(𝑥! ) Dynamical system 𝑥!"# = 𝑓 𝑥! + noise 𝑥+, 𝑥,, 𝑥/, … , 𝑥-.,, 𝑥-.+, 𝑥- 𝑓(𝑥! ) Figure: From [Transformers as Algorithms, Li et al. 2023] ▶ Not as costly as model pre-training. ▶ Access to a ground truth matrix Q∗. O. Zekri – Large Language models as Markov Chains 22

Slide 57

Slide 57 text

Frequentist method O. Zekri – Large Language models as Markov Chains 23

Slide 58

Slide 58 text

LLM ICL-based method O. Zekri – Large Language models as Markov Chains 24

Slide 59

Slide 59 text

In Context Learning of Markov chains Setup ▶ A d−state Markov chain X = (X1, . . . , XNicl ), and the sequence of its first n terms is denoted by S = (S1, . . . , Sn). O. Zekri – Large Language models as Markov Chains 25

Slide 60

Slide 60 text

In Context Learning of Markov chains Setup ▶ A d−state Markov chain X = (X1, . . . , XNicl ), and the sequence of its first n terms is denoted by S = (S1, . . . , Sn). ▶ Mixing time of S, denoted as tmin. O. Zekri – Large Language models as Markov Chains 25

Slide 61

Slide 61 text

In Context Learning of Markov chains Setup ▶ A d−state Markov chain X = (X1, . . . , XNicl ), and the sequence of its first n terms is denoted by S = (S1, . . . , Sn). ▶ Mixing time of S, denoted as tmin. Theorem (Z. et al., 2024) With high probability, Test error ≤ Distribution shift + O   tmin log (d) Nicl   O. Zekri – Large Language models as Markov Chains 25

Slide 62

Slide 62 text

Experiments : In-Context Scaling Laws 100 101 102 103 Context Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% confidence intervals. O. Zekri – Large Language models as Markov Chains 26

Slide 63

Slide 63 text

Experiments : In-Context Scaling Laws 100 101 102 103 Context Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% confidence intervals. ✓ ✓ ✓ Randomly generated datas : not seen during training. O. Zekri – Large Language models as Markov Chains 26

Slide 64

Slide 64 text

Experiments : In-Context Scaling Laws 100 101 102 103 Context Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% confidence intervals. ✓ ✓ ✓ Randomly generated datas : not seen during training. ✓ ✓ ✓ Nicl dependence in line with theoretical result. O. Zekri – Large Language models as Markov Chains 26

Slide 65

Slide 65 text

Experiments : In-Context Scaling Laws 100 101 102 103 Context Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% confidence intervals. ✓ ✓ ✓ Randomly generated datas : not seen during training. ✓ ✓ ✓ Nicl dependence in line with theoretical result. ✓ ✓ ✓ Most recent models stay much closer to the theoretical result. O. Zekri – Large Language models as Markov Chains 26

Slide 66

Slide 66 text

Experiments : Influence of tmin 100 101 102 103 Context Length Nicl 0.0 0.2 0.4 0.6 0.8 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 Small Nicl Scaling law 100 101 102 Ratio Nicl /tmin 10 1 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 O ( t m in /N icl ) Figure: Influence of tmin . Ricl as functions of Nicl and Nicl /tmin , with 95% confidence intervals. O. Zekri – Large Language models as Markov Chains 27

Slide 67

Slide 67 text

Experiments : Influence of tmin 100 101 102 103 Context Length Nicl 0.0 0.2 0.4 0.6 0.8 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 Small Nicl Scaling law 100 101 102 Ratio Nicl /tmin 10 1 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 O ( t m in /N icl ) Figure: Influence of tmin . Ricl as functions of Nicl and Nicl /tmin , with 95% confidence intervals. ✓ ✓ ✓ tmin dependence in line with theoretical result. O. Zekri – Large Language models as Markov Chains 27

Slide 68

Slide 68 text

Experiments : Number of states 100 101 102 103 Context Length Nicl 10 1 100 Error icl Frequentist Gemma 2B O(N −1/2 icl ) 100 101 102 Context Length Nicl 10 1 100 Error icl O(N −1/2 icl ) Figure: Impact of the number of states. Left. Random 3-state Markov chain. Right. Brownian motion discretized as a 700-state Markov chain. O. Zekri – Large Language models as Markov Chains 28

Slide 69

Slide 69 text

Experiments : Number of states 100 101 102 103 Context Length Nicl 10 1 100 Error icl Frequentist Gemma 2B O(N −1/2 icl ) 100 101 102 Context Length Nicl 10 1 100 Error icl O(N −1/2 icl ) Figure: Impact of the number of states. Left. Random 3-state Markov chain. Right. Brownian motion discretized as a 700-state Markov chain. ✓ ✓ ✓ Frequentist’s bound O( d/Nicl) vs. LLM’s bound O( log(d)/Nicl). O. Zekri – Large Language models as Markov Chains 28

Slide 70

Slide 70 text

Experiments : Number of states 100 101 102 103 Context Length Nicl 10 1 100 Error icl Frequentist Gemma 2B O(N −1/2 icl ) 100 101 102 Context Length Nicl 10 1 100 Error icl O(N −1/2 icl ) Figure: Impact of the number of states. Left. Random 3-state Markov chain. Right. Brownian motion discretized as a 700-state Markov chain. ✓ ✓ ✓ Frequentist’s bound O( d/Nicl) vs. LLM’s bound O( log(d)/Nicl). ✓ ✓ ✓ As d grows, frequentist method struggles. O. Zekri – Large Language models as Markov Chains 28

Slide 71

Slide 71 text

Take home message Take home message O. Zekri – Large Language models as Markov Chains 29

Slide 72

Slide 72 text

Take home message ✓ ✓ ✓ Explicit characterization of the inference mechanism in LLMs through an equivalent finite-state Markov chain. O. Zekri – Large Language models as Markov Chains 30

Slide 73

Slide 73 text

Take home message ✓ ✓ ✓ Explicit characterization of the inference mechanism in LLMs through an equivalent finite-state Markov chain. ✓ ✓ ✓ Existence and uniqueness of a stationary distribution. Generalization bounds on pre-training and in-context learning (ICL) phases. O. Zekri – Large Language models as Markov Chains 30

Slide 74

Slide 74 text

Take home message ✓ ✓ ✓ Explicit characterization of the inference mechanism in LLMs through an equivalent finite-state Markov chain. ✓ ✓ ✓ Existence and uniqueness of a stationary distribution. Generalization bounds on pre-training and in-context learning (ICL) phases. ✓ ✓ ✓ Experiments validate our theory with Llama2 7B & 13B, Gemma 2B, Mistral 7B, and even Llama 3.2. O. Zekri – Large Language models as Markov Chains 30

Slide 75

Slide 75 text

Thank you for your attention ! You can follow me on social networks! Thank you for your attention ! My website : www.oussamazekri.fr O. Zekri – Large Language models as Markov Chains 31

Slide 76

Slide 76 text

Appendix Appendix O. Zekri – Large Language models as Markov Chains 1

Slide 77

Slide 77 text

Formalization ▶ Ω = V∗ K : set of sequences of at most K elements from V ▶ Autoregressive LLM fT,K Θ ⇐⇒ finite Markov chain MT,K ▶ MT,K has sparse transition matrix Qf ∈ R|V∗ K |×|V∗ K | Let vi, vj ∈ V∗ K be two sequences of up to K tokens. We have: 1 Qf (vi, vj) = 0 if vj is not a completion of vi, i.e., ∃l ∈ {1, . . . , |vi| − 1}, s.t. (vi)l+1 ̸= (vj)l 2 Qf (vi, vj) = {fT,K Θ (vi)}j otherwise. → This is the proba of predicting (vj)|vi| as the next token. O. Zekri – Large Language models as Markov Chains 2

Slide 78

Slide 78 text

Setup Question : How far our matrix Qf is from the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? Generalization problem R(Θ) := E[R(Θ)], R(Θ) := 1 N N n=1 dTV(Q∗(Sn, ·), Qf (Sn, ·)), (1) The generalization problem consists of bounding the difference R(Θ) − R(Θ). O. Zekri – Large Language models as Markov Chains 3

Slide 79

Slide 79 text

Pre-training generalization bound Assumptions ✓ ✓ ✓ Pre-training data S = (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. ✓ ✓ ✓ Assumption only on the last transformer layer: Bounded unembedding matrix, i.e. ∥W⊤ U ∥2,1 ≤ BU . Theorem Let 0 < δ < 1, then with probability at least 1 − δ, Rpre(Θ) ≤ Rpre(Θ) + ¯ B √ Ntrain log 2 δ , where ¯ B = 2∥Γ∥ max{log (T) + 2BU /τ, log (1/c0)}1/2. O. Zekri – Large Language models as Markov Chains 4

Slide 80

Slide 80 text

Fine-grained bound More fine-grained bound with additional assumption. W = Θ | ∀ℓ ∈ [L], ∥W(ℓ) V ∥∞ ≤ BV , ∥W(ℓ) O ∥∞ ≤ BO, ∥W(ℓ) 1 ∥∞ ≤ B1, ∥W(ℓ) 2 ∥∞ ≤ B2, ∥W⊤ U ∥2,1 ≤ BU . Corollary Let 0 < δ < 1, then with probability at least 1 − δ, Rpre(Θ) ≤ Rpre(Θ) + ¯ B √ Ntrain log 2 δ , where ¯ B = 2∥Γ∥ max{log (T) + 2(BΘ)L/τ, log (1/c0)}1/2, and BΘ = [(1 + rmB1B2)(1 + r3 H BOBV )](BtokBU )1/L. O. Zekri – Large Language models as Markov Chains 5

Slide 81

Slide 81 text

Sample complexity Question : How much training data do I need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. ✓ ✓ ✓ Dependency on model parameters. Corollary Let δ ∈ [0, 1] and let ϵ > 0. If Ntrain ≥ N∗ := ⌈4 ¯ B2 ϵ2 log 2 δ ⌉ and if we assume a perfect pre-training error for fΘ, then we have with probability at least 1 − δ, ES∼PL ∥Q∗(S, ·) − Qf (S, ·)∥1 ≤ ϵ. O. Zekri – Large Language models as Markov Chains 6

Slide 82

Slide 82 text

In Context Learning of Markov chains Setup ▶ A d−state Markov chain X = (X1, . . . , XNicl ), and the sequence of its first n terms is denoted by S = (S1, . . . , Sn). ▶ Mixing time of S, denoted as tmix(ε). ▶ Almost distance K(Θ1, Θ2) := 1 N N n=1 ESn [dTV(PΘ1 (· | Sn), PΘ2 (· | Sn))]. Theorem Let δ > 0. Then, with probability at least 1 − δ, Ricl(Θ) ≤ infϑ∈Wmc {Ricl(ϑ) + K(ϑ, Θ)} + ¯ B tmin Nicl log 2 δ , (2) where ¯ B = 2 max{log (d) + 2BU /τ, log (1/pmin)}1/2. O. Zekri – Large Language models as Markov Chains 7