Large Language Models as Markov Chains - LC2 Seminar Imperial

Large Language models as Markov Chains Oussama Zekri1 LC2 Seminar
- Imperial College London. 1 ENS Paris-Saclay and Imperial College London, [email protected] January 23, 2025

Co-authors Ambroise Odonnat Abdelhakim Benechehab Linus Bleistein Nicolas Boullé Ievgen
Redko from Huawei Noah’s Ark Lab, Inria, Imperial College London. O. Zekri – Large Language models as Markov Chains 2

Introduction Introduction O. Zekri – Large Language models as Markov
Chains 3

Emerging capabilities of pretrained LLMs O. Zekri – Large Language
models as Markov Chains 4

Emerging capabilities of pretrained LLMs ✓ ✓ ✓ Emerging capabilities
! [Brown et al. (2020), Gruver et al. (2023), Liu et al. (2024), ...] O. Zekri – Large Language models as Markov Chains 4

! [Brown et al. (2020), Gruver et al. (2023), Liu et al. (2024), ...] × × × Poorly understood theoretically, with many open problems remaining in the literature. O. Zekri – Large Language models as Markov Chains 4

! [Brown et al. (2020), Gruver et al. (2023), Liu et al. (2024), ...] × × × Poorly understood theoretically, with many open problems remaining in the literature. Goal: Derive theoretical insights. O. Zekri – Large Language models as Markov Chains 4

Background on Autoregressive Language Modeling Goal: Predict the next word
based on previous ones. O. Zekri – Large Language models as Markov Chains 5

based on previous ones. Autoregressive Property: Each word depends only on past words. I am the Previous words (context) danger Word being predicted O. Zekri – Large Language models as Markov Chains 5

based on previous ones. Autoregressive Property: Each word depends only on past words. I am the Previous words (context) danger Word being predicted Modelization: Probability of a sequence (x1, x2, . . . , xN ): P(x1, x2, . . . , xN ) = P(x1)P(x2 | x1) · · · P(xN | x1, x2, . . . , xN−1) = N n=1 P(xn | x1, x2, . . . , xn−1) O. Zekri – Large Language models as Markov Chains 5

Background on Autoregressive Models Best models so far : Generative
Transformers for Autoregressive Modeling fT,K Θ O. Zekri – Large Language models as Markov Chains 6

Background on Autoregressive Models Best models so far : Generative
Transformers for Autoregressive Modeling fT,K Θ Vocabulary size T. Context window K. Parameter set Θ. GPT-3 : T = 50257, K = 2048 and |Θ| ∼ 175B O. Zekri – Large Language models as Markov Chains 6

Context Window K Context Window K = 7 in navy
blue. x1 x2 x3 x4 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 Top. A sequence of length N = 4. Bottom. A sequence of length N = 10. O. Zekri – Large Language models as Markov Chains 7

Background on Markov Chains Ω discrete ﬁnite state-space. Ω =
{z1, ..., z|Ω| }. Markov Chain: A sequence of random variables where each observation depends only on the previous one. O. Zekri – Large Language models as Markov Chains 8

{z1, ..., z|Ω| }. Markov Chain: A sequence of random variables where each observation depends only on the previous one. Mathematical Formulation: A process (Zn)n≥0 supported on Ω is a Markov chain if P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) This is called the Markov property. O. Zekri – Large Language models as Markov Chains 8

{z1, ..., z|Ω| }. Markov Chain: A sequence of random variables where each observation depends only on the previous one. Mathematical Formulation: A process (Zn)n≥0 supported on Ω is a Markov chain if P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) This is called the Markov property. Transition matrix: Q is a square matrix of size |Ω| deﬁned as ∀x, y ∈ Ω, Q(x, y) = P(Zn+1 = y | Zn = x) O. Zekri – Large Language models as Markov Chains 8

Are LLMs really Markov chains? State space: Ω = {z1,
..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted O. Zekri – Large Language models as Markov Chains 9

..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? O. Zekri – Large Language models as Markov Chains 9

..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? LLMs are clearly not Markov chains at the token level (|Ω| = T). O. Zekri – Large Language models as Markov Chains 9

..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? LLMs are clearly not Markov chains at the token level (|Ω| = T). ✓ ✓ ✓ P(“I am the danger” | “I am the”)? O. Zekri – Large Language models as Markov Chains 9

..., z|Ω| } Markov Chain: P(Zn+1 | Zn, Zn−1, . . . , Z0) = P(Zn+1 | Zn) I am the Previous words (context) danger Word being predicted × × × P(“danger” | “the”)? LLMs are clearly not Markov chains at the token level (|Ω| = T). ✓ ✓ ✓ P(“I am the danger” | “I am the”)? Whole sequence as a state... (|Ω| = ?). O. Zekri – Large Language models as Markov Chains 9

Large Language Models as Markov Chains Large Language Models as
Markov Chains O. Zekri – Large Language models as Markov Chains 10

Correct State Space Vocabulary space V of size T. ▶
Ω = V∗ K is the set of all sequences consisting of elements from V with up to K elements. O. Zekri – Large Language models as Markov Chains 11

Ω = V∗ K is the set of all sequences consisting of elements from V with up to K elements. Zero Transient class Recurrent class O. Zekri – Large Language models as Markov Chains 11

Ω = V∗ K is the set of all sequences consisting of elements from V with up to K elements. Zero Transient class Recurrent class ▶ |Ω| = T TK −1 T−1 . This is ≈ 109628 for GPT-3. O. Zekri – Large Language models as Markov Chains 11

Is this Markov Chain point of view is useful? ×
× × |Ω| = T TK −1 T−1 , grows exponentially with K. |Ω| ≈ 109628 for GPT-3. Qf cannot be stored. O. Zekri – Large Language models as Markov Chains 12

× × |Ω| = T TK −1 T−1 , grows exponentially with K. |Ω| ≈ 109628 for GPT-3. Qf cannot be stored. ∼ ∼ ∼ Model weights, a few GPUs and a single forward pass are all you need to access the row you want in the matrix! O. Zekri – Large Language models as Markov Chains 12

× × |Ω| = T TK −1 T−1 , grows exponentially with K. |Ω| ≈ 109628 for GPT-3. Qf cannot be stored. ∼ ∼ ∼ Model weights, a few GPUs and a single forward pass are all you need to access the row you want in the matrix! ✓ ✓ ✓ Connection to the rich theory of ﬁnite Markov Chain → insight into the dynamics of LLMs. O. Zekri – Large Language models as Markov Chains 12

Toy example Toy example on a "Baby" LLM with V
= {0, 1} and |Θ| = 12688. T = 2, K = 3, |Ω| = 14 O. Zekri – Large Language models as Markov Chains 13

Stationary distribution ▶ A stationary distribution π represents the long-term
behavior of a Markov chain. O. Zekri – Large Language models as Markov Chains 14

behavior of a Markov chain. ▶ It veriﬁes πQf = π and each row of Qn f tends to π when n → ∞. O. Zekri – Large Language models as Markov Chains 14

behavior of a Markov chain. ▶ It veriﬁes πQf = π and each row of Qn f tends to π when n → ∞. ▶ A ﬁnite state unichain has a unique stationary distribution. O. Zekri – Large Language models as Markov Chains 14

behavior of a Markov chain. ▶ It veriﬁes πQf = π and each row of Qn f tends to π when n → ∞. ▶ A ﬁnite state unichain has a unique stationary distribution. Convergence speed to the sta�onary distribu�on O. Zekri – Large Language models as Markov Chains 14

Speed of convergence ▶ Convergence speed to the stationary distribution.
O. Zekri – Large Language models as Markov Chains 15

Proposition For all n ≥ K, |(Qn f )i,j − (eπ)i,j| ≤ (1 − 2ε)⌊ n K ⌋−1, where ε = min i,j∈R2 {(QK f )i,j} > 0. O. Zekri – Large Language models as Markov Chains 15

Proposition For all n ≥ K, |(Qn f )i,j − (eπ)i,j| ≤ (1 − 2ε)⌊ n K ⌋−1, where ε = min i,j∈R2 {(QK f )i,j} > 0. ▶ Impact of the temperature. O. Zekri – Large Language models as Markov Chains 15

Proposition For all n ≥ K, |(Qn f )i,j − (eπ)i,j| ≤ (1 − 2ε)⌊ n K ⌋−1, where ε = min i,j∈R2 {(QK f )i,j} > 0. ▶ Impact of the temperature. Temperature =2 Temperature Temperature =1 Temperature =1 Temperature =0.2 O. Zekri – Large Language models as Markov Chains 15

Generalization bounds on pre-training Generalization bounds on pre-training O. Zekri
– Large Language models as Markov Chains 16

Sample complexity Question : How much training data do I
need for Qf to be close to Q∗? O. Zekri – Large Language models as Markov Chains 17

need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. O. Zekri – Large Language models as Markov Chains 17

need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. ✓ ✓ ✓ Dependency on model parameters. O. Zekri – Large Language models as Markov Chains 17

need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. ✓ ✓ ✓ Dependency on model parameters. Sample complexity Let ϵ > 0. If Ntrain ≥ N∗ := O 1 ϵ2 , then we have with high probability, d(Q∗, Qf ) ≤ ϵ. O. Zekri – Large Language models as Markov Chains 17

Setup Question : How far our matrix Qf is from
the reference matrix Q∗? O. Zekri – Large Language models as Markov Chains 18

the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . O. Zekri – Large Language models as Markov Chains 18

the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? O. Zekri – Large Language models as Markov Chains 18

the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? Generalization problem Training error : R(Θ) Test error : R(Θ) O. Zekri – Large Language models as Markov Chains 18

the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? Generalization problem Training error : R(Θ) Test error : R(Θ) The generalization problem consists of bounding the generalization error G := R(Θ) − R(Θ). O. Zekri – Large Language models as Markov Chains 18

Pre-training generalization bound Assumptions ✓ ✓ ✓ Pre-training data S
= (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. O. Zekri – Large Language models as Markov Chains 19

= (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. ✓ ✓ ✓ Dependence on Transformer’s layer norms, embedding dimension, and number of heads is captured through a term Bmodel. O. Zekri – Large Language models as Markov Chains 19

= (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. ✓ ✓ ✓ Dependence on Transformer’s layer norms, embedding dimension, and number of heads is captured through a term Bmodel. Theorem (Z. et al., 2024) With high probability, Gpre ≤ O ∥Γ∥ Bmodel Ntrain O. Zekri – Large Language models as Markov Chains 19

Sample Complexity: Numerically? Numerical veriﬁcation with GPT-3 1e-5 5e-4 1e-3
2e-3 3e-3 4e-3 5e-3 Approximation tolerance 1010 1011 1012 1013 1014 1015 Sample complexity N* Real training size Real training size To be compared to the real training size ≈ 1011. O. Zekri – Large Language models as Markov Chains 20

In-Context Learning and experiments In-Context Learning and experiments O. Zekri
– Large Language models as Markov Chains 21

In Context Learning In-Context Learning : Model’s ability to learn
and adapt to patterns, without updating its internal parameters. In-context learning Input prompt Output Natural language processing apple, pomme, pear, poire, cherry cerise Turkey, baklava, France, croissant, Japan mochi Supervised learning 𝑦! = 𝑓 𝑥! + noise 𝑥+, 𝑦+, 𝑥,, … , 𝑥-.+, 𝑦-.+, 𝑥- 𝑓(𝑥! ) Dynamical system 𝑥!"# = 𝑓 𝑥! + noise 𝑥+, 𝑥,, 𝑥/, … , 𝑥-.,, 𝑥-.+, 𝑥- 𝑓(𝑥! ) Figure: From [Transformers as Algorithms, Li et al. 2023] O. Zekri – Large Language models as Markov Chains 22

and adapt to patterns, without updating its internal parameters. In-context learning Input prompt Output Natural language processing apple, pomme, pear, poire, cherry cerise Turkey, baklava, France, croissant, Japan mochi Supervised learning 𝑦! = 𝑓 𝑥! + noise 𝑥+, 𝑦+, 𝑥,, … , 𝑥-.+, 𝑦-.+, 𝑥- 𝑓(𝑥! ) Dynamical system 𝑥!"# = 𝑓 𝑥! + noise 𝑥+, 𝑥,, 𝑥/, … , 𝑥-.,, 𝑥-.+, 𝑥- 𝑓(𝑥! ) Figure: From [Transformers as Algorithms, Li et al. 2023] ▶ Not as costly as model pre-training. O. Zekri – Large Language models as Markov Chains 22

and adapt to patterns, without updating its internal parameters. In-context learning Input prompt Output Natural language processing apple, pomme, pear, poire, cherry cerise Turkey, baklava, France, croissant, Japan mochi Supervised learning 𝑦! = 𝑓 𝑥! + noise 𝑥+, 𝑦+, 𝑥,, … , 𝑥-.+, 𝑦-.+, 𝑥- 𝑓(𝑥! ) Dynamical system 𝑥!"# = 𝑓 𝑥! + noise 𝑥+, 𝑥,, 𝑥/, … , 𝑥-.,, 𝑥-.+, 𝑥- 𝑓(𝑥! ) Figure: From [Transformers as Algorithms, Li et al. 2023] ▶ Not as costly as model pre-training. ▶ Access to a ground truth matrix Q∗. O. Zekri – Large Language models as Markov Chains 22

Frequentist method O. Zekri – Large Language models as Markov
Chains 23

LLM ICL-based method O. Zekri – Large Language models as
Markov Chains 24

In Context Learning of Markov chains Setup ▶ A d−state
Markov chain X = (X1, . . . , XNicl ), and the sequence of its ﬁrst n terms is denoted by S = (S1, . . . , Sn). O. Zekri – Large Language models as Markov Chains 25

Markov chain X = (X1, . . . , XNicl ), and the sequence of its ﬁrst n terms is denoted by S = (S1, . . . , Sn). ▶ Mixing time of S, denoted as tmin. O. Zekri – Large Language models as Markov Chains 25

Markov chain X = (X1, . . . , XNicl ), and the sequence of its ﬁrst n terms is denoted by S = (S1, . . . , Sn). ▶ Mixing time of S, denoted as tmin. Theorem (Z. et al., 2024) With high probability, Test error ≤ Distribution shift + O   tmin log (d) Nicl   O. Zekri – Large Language models as Markov Chains 25

Experiments : In-Context Scaling Laws 100 101 102 103 Context
Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% conﬁdence intervals. O. Zekri – Large Language models as Markov Chains 26

Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% conﬁdence intervals. ✓ ✓ ✓ Randomly generated datas : not seen during training. O. Zekri – Large Language models as Markov Chains 26

Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% conﬁdence intervals. ✓ ✓ ✓ Randomly generated datas : not seen during training. ✓ ✓ ✓ Nicl dependence in line with theoretical result. O. Zekri – Large Language models as Markov Chains 26

Length Nicl 10 1 100 Error icl Llama2 7B Llama2 13B Mistral 7B v0.1 Gemma 2B O(N −1/2 icl ) Figure: In-context scaling laws. Ricl as functions of Nicl , with 95% conﬁdence intervals. ✓ ✓ ✓ Randomly generated datas : not seen during training. ✓ ✓ ✓ Nicl dependence in line with theoretical result. ✓ ✓ ✓ Most recent models stay much closer to the theoretical result. O. Zekri – Large Language models as Markov Chains 26

Experiments : Influence of tmin 100 101 102 103 Context
Length Nicl 0.0 0.2 0.4 0.6 0.8 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 Small Nicl Scaling law 100 101 102 Ratio Nicl /tmin 10 1 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 O ( t m in /N icl ) Figure: Influence of tmin . Ricl as functions of Nicl and Nicl /tmin , with 95% confidence intervals. O. Zekri – Large Language models as Markov Chains 27

Experiments : Influence of tmin 100 101 102 103 Context
Length Nicl 0.0 0.2 0.4 0.6 0.8 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 Small Nicl Scaling law 100 101 102 Ratio Nicl /tmin 10 1 Error icl tmin =43.7 tmin =32.6 tmin =15.2 tmin =4.6 O ( t m in /N icl ) Figure: Influence of tmin . Ricl as functions of Nicl and Nicl /tmin , with 95% confidence intervals. ✓ ✓ ✓ tmin dependence in line with theoretical result. O. Zekri – Large Language models as Markov Chains 27

Experiments : Number of states 100 101 102 103 Context
Length Nicl 10 1 100 Error icl Frequentist Gemma 2B O(N −1/2 icl ) 100 101 102 Context Length Nicl 10 1 100 Error icl O(N −1/2 icl ) Figure: Impact of the number of states. Left. Random 3-state Markov chain. Right. Brownian motion discretized as a 700-state Markov chain. O. Zekri – Large Language models as Markov Chains 28

Length Nicl 10 1 100 Error icl Frequentist Gemma 2B O(N −1/2 icl ) 100 101 102 Context Length Nicl 10 1 100 Error icl O(N −1/2 icl ) Figure: Impact of the number of states. Left. Random 3-state Markov chain. Right. Brownian motion discretized as a 700-state Markov chain. ✓ ✓ ✓ Frequentist’s bound O( d/Nicl) vs. LLM’s bound O( log(d)/Nicl). O. Zekri – Large Language models as Markov Chains 28

Length Nicl 10 1 100 Error icl Frequentist Gemma 2B O(N −1/2 icl ) 100 101 102 Context Length Nicl 10 1 100 Error icl O(N −1/2 icl ) Figure: Impact of the number of states. Left. Random 3-state Markov chain. Right. Brownian motion discretized as a 700-state Markov chain. ✓ ✓ ✓ Frequentist’s bound O( d/Nicl) vs. LLM’s bound O( log(d)/Nicl). ✓ ✓ ✓ As d grows, frequentist method struggles. O. Zekri – Large Language models as Markov Chains 28

Take home message Take home message O. Zekri – Large
Language models as Markov Chains 29

Take home message ✓ ✓ ✓ Explicit characterization of the
inference mechanism in LLMs through an equivalent ﬁnite-state Markov chain. O. Zekri – Large Language models as Markov Chains 30

inference mechanism in LLMs through an equivalent ﬁnite-state Markov chain. ✓ ✓ ✓ Existence and uniqueness of a stationary distribution. Generalization bounds on pre-training and in-context learning (ICL) phases. O. Zekri – Large Language models as Markov Chains 30

inference mechanism in LLMs through an equivalent ﬁnite-state Markov chain. ✓ ✓ ✓ Existence and uniqueness of a stationary distribution. Generalization bounds on pre-training and in-context learning (ICL) phases. ✓ ✓ ✓ Experiments validate our theory with Llama2 7B & 13B, Gemma 2B, Mistral 7B, and even Llama 3.2. O. Zekri – Large Language models as Markov Chains 30

Thank you for your attention ! You can follow me
on social networks! Thank you for your attention ! My website : www.oussamazekri.fr O. Zekri – Large Language models as Markov Chains 31

Appendix Appendix O. Zekri – Large Language models as Markov
Chains 1

Formalization ▶ Ω = V∗ K : set of sequences
of at most K elements from V ▶ Autoregressive LLM fT,K Θ ⇐⇒ ﬁnite Markov chain MT,K ▶ MT,K has sparse transition matrix Qf ∈ R|V∗ K |×|V∗ K | Let vi, vj ∈ V∗ K be two sequences of up to K tokens. We have: 1 Qf (vi, vj) = 0 if vj is not a completion of vi, i.e., ∃l ∈ {1, . . . , |vi| − 1}, s.t. (vi)l+1 ̸= (vj)l 2 Qf (vi, vj) = {fT,K Θ (vi)}j otherwise. → This is the proba of predicting (vj)|vi| as the next token. O. Zekri – Large Language models as Markov Chains 2

the reference matrix Q∗? ▶ For GPT-3, only 5 × 1011 training tokens, but TK+1 ≈ 109632 non zero elements in Qf . ▶ Generalization capacity of the model? Generalization problem R(Θ) := E[R(Θ)], R(Θ) := 1 N N n=1 dTV(Q∗(Sn, ·), Qf (Sn, ·)), (1) The generalization problem consists of bounding the diﬀerence R(Θ) − R(Θ). O. Zekri – Large Language models as Markov Chains 3

= (S1, . . . , SNtrain ) a sequence of dependent random variables with a Marton coupling matrix Γ. ✓ ✓ ✓ Assumption only on the last transformer layer: Bounded unembedding matrix, i.e. ∥W⊤ U ∥2,1 ≤ BU . Theorem Let 0 < δ < 1, then with probability at least 1 − δ, Rpre(Θ) ≤ Rpre(Θ) + ¯ B √ Ntrain log 2 δ , where ¯ B = 2∥Γ∥ max{log (T) + 2BU /τ, log (1/c0)}1/2. O. Zekri – Large Language models as Markov Chains 4

Fine-grained bound More ﬁne-grained bound with additional assumption. W =
Θ | ∀ℓ ∈ [L], ∥W(ℓ) V ∥∞ ≤ BV , ∥W(ℓ) O ∥∞ ≤ BO, ∥W(ℓ) 1 ∥∞ ≤ B1, ∥W(ℓ) 2 ∥∞ ≤ B2, ∥W⊤ U ∥2,1 ≤ BU . Corollary Let 0 < δ < 1, then with probability at least 1 − δ, Rpre(Θ) ≤ Rpre(Θ) + ¯ B √ Ntrain log 2 δ , where ¯ B = 2∥Γ∥ max{log (T) + 2(BΘ)L/τ, log (1/c0)}1/2, and BΘ = [(1 + rmB1B2)(1 + r3 H BOBV )](BtokBU )1/L. O. Zekri – Large Language models as Markov Chains 5

need for Qf to be close to Q∗? ✓ ✓ ✓ Number of sequences that an LLM requires such that Qf is ε-close to Q∗. ✓ ✓ ✓ Dependency on model parameters. Corollary Let δ ∈ [0, 1] and let ϵ > 0. If Ntrain ≥ N∗ := ⌈4 ¯ B2 ϵ2 log 2 δ ⌉ and if we assume a perfect pre-training error for fΘ, then we have with probability at least 1 − δ, ES∼PL ∥Q∗(S, ·) − Qf (S, ·)∥1 ≤ ϵ. O. Zekri – Large Language models as Markov Chains 6

Markov chain X = (X1, . . . , XNicl ), and the sequence of its ﬁrst n terms is denoted by S = (S1, . . . , Sn). ▶ Mixing time of S, denoted as tmix(ε). ▶ Almost distance K(Θ1, Θ2) := 1 N N n=1 ESn [dTV(PΘ1 (· | Sn), PΘ2 (· | Sn))]. Theorem Let δ > 0. Then, with probability at least 1 − δ, Ricl(Θ) ≤ infϑ∈Wmc {Ricl(ϑ) + K(ϑ, Θ)} + ¯ B tmin Nicl log 2 δ , (2) where ¯ B = 2 max{log (d) + 2BU /τ, log (1/pmin)}1/2. O. Zekri – Large Language models as Markov Chains 7

Large Language Models as Markov Chains - LC2 Se...

Large Language Models as Markov Chains - LC2 Seminar Imperial

Other Decks in Science

Featured

Transcript