Online Structured Prediction with Fenchel–Young Losses and Improved Surrogate Regret for Online Multiclass Classification with Logistic Loss

Online Structured Prediction with Fenchel–Young Losses and Improved Surrogate Regret
for Online Multiclass Classification with Logistic Loss 1The University of Tokyo, 2Kyoto University Shinsaku Sakaue1, Han Bao2, Taira Tsuchiya1, Taihei Oki1 AFSA-MI-CS Joint Seminar @ UTokyo

Self-introduction 1 Research Interest Apr. 2014 ー Mar. 2016 Apr.
2016 ー Now Oct. 2018 ー Mar. 2020 Apr. 2020 ー Now The University of Tokyo (Master; supervised by Prof. Takeda) NTT Communication Science Laboratory Kyoto University (Doctor; supervised by Prof. Minato) The University of Tokyo (ERATO Assist. Prof. from NTT) Education and Employment History • Optimization (discrete and continuous) • Data structures (BDD, ZDD, etc.) • Learning theory (online learning, sample complexity)

Overview 2 Reveal 𝒙! Reveal 𝒚! Target loss 𝐿: 𝒴×𝒴
→ ℝ"# Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙! , 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary Play - 𝒚! Online structured prediction with a surrogate loss 🤔 😈 Main result: Finite surrogate regret bound Learner’s target loss Best possible for the leaner ∑!"# $ 𝔼 𝐿! (% 𝒚! ) = ∑!"# $ 𝑆! 𝑼 + 𝑅$ Surrogate regret (cf. Perceptron’s finite mistake bound)

Contents 3 Introduction to structured prediction Ø Surrogate loss framework
Ø Fenchel–Young loss Online structured prediction Ø Surrogate regret Ø Existing results on online multiclass classification Our results Ø Randomized decoding Ø Exploiting the surrogate gap

Introduction to Structured Prediction

5 🤔 🍜 Which shop should I go to for
lunch today? 🍣

lunch today? 🍣

lunch today? 🍣 😋 Open

lunch today? 🍣 ✖ 😭 Closed

9 🍜 On weekdays 🍣 ✖ Closed

10 🍜 On holidays 🍣 ✖ Closed

11 🍜 𝒳 = weekday, holiday 🍣 = { 1,
0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %}

0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)

0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Observe 𝒙 = 1, 0 % Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)

0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Observe 𝒙 = 1, 0 % Predict - 𝒚 = 1, 0 % Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)

0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Observe 𝒙 = 1, 0 % Predict - 𝒚 = 1, 0 % ✖ Closed 😋 No loss Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)

0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Observe 𝒙 = 1, 0 % Predict - 𝒚 = 0, 1 % Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)

0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Observe 𝒙 = 1, 0 % Predict - 𝒚 = 0, 1 % ✖ Closed 😭 Incurs loss of 1 Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)

Goal Structured Prediction (supervised learning with structured outputs) 18 𝒳:
input vector space (𝒳 ⊆ 𝔹& '(1) for simplicity). 𝒴: finite set of structured outputs. 𝐿: 𝒴×𝒴 → ℝ"# : target loss. Learn ℎ: 𝒳 → 𝒴 to minimize 𝔼𝒙, 𝒚 𝐿 ℎ 𝒙 , 𝒚 .

Structured Prediction (supervised learning with structured outputs) 19 𝒳: input
vector space (𝒳 ⊆ 𝔹& '(1) for simplicity). 𝒴: finite set of structured outputs. 𝐿: 𝒴×𝒴 → ℝ"# : target loss. Multiclass classification • 𝒴 = {🍜, 🍣, 🍖} • 𝐿 - 𝒚; 𝒚 = 𝟙(- 𝒚 ≠ 𝒚) Multilabel classification • 𝒴 = {∅, {🍜}, … , {🍜, 🍣,🍖}} • 𝐿 - 𝒚; 𝒚 = ∑+ 𝟙(@ 𝑦+ ≠ 𝑦+ ) Matching • 𝒴 = {all bipartite matchings} • 𝐿 - 𝒚; 𝒚 = #mismatches Lab. 1 👦 👩🦰 🧑🦰 Lab. 2 Lab. 3 Lab. 1 👦 👩🦰 🧑🦰 Lab. 2 Lab. 3 - 𝒚 𝐿 - 𝒚; 𝒚 = 2 𝒚 Goal Learn ℎ: 𝒳 → 𝒴 to minimize 𝔼𝒙, 𝒚 𝐿 ℎ 𝒙 , 𝒚 .

Assumptions 20 1. 𝒴 is embedded in ℝ$. 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Birkhoff polytope for matching (1,0,0) (0,1,0) (0,0,1) Probability simplex for multiclass classification (0,0,0) (1,0,0) (0,0,1) (0,1,0) (0,1,1) (1,0,1) (1,1,1) (1,1,0) Unit hypercube for multilabel classification

Assumptions 21 1. 𝒴 is embedded in ℝ$. 2. 𝐿(-
𝒚; 𝒚) is extended as an affine function of - 𝒚 ∈ conv(𝒴). The 0-1 loss 𝟙(- 𝒚 ≠ 𝒆+ ) can be extended as an affine function of - 𝒚 ∈ △$ : 𝐿 - 𝒚; 𝒆+ = - 𝒚% 0 ⋯ 1 ⋮ ⋱ ⋮ 1 ⋯ 0 𝒆+ = ∑,-+ @ 𝑦, = 1 − @ 𝑦+ . The (generalized) structure encoding loss function (SELF), including 0-1 and Hamming losses, is affine: 𝐿 - 𝒚; 𝒚 = - 𝒚, 𝑨𝒚 + 𝒃 + 𝑐(𝒚). Blondel. Structured prediction with projection oracles. NeurIPS, 2019. Ciliberto et al. A general framework for consistent structured prediction with implicit loss embeddings. JMLR, 2020.

22 🍜 𝒳 = { 1, 0 %, 0, 1
%} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Learning ℎ: 𝒳 → 𝒴 for discrete 𝒴 is not easy… Try logistic regression! weekday holiday 🍜 🍣

23 🍜 𝒳 = { 1, 0 %, 0, 1
%} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 weekday holiday 🍜 🍣 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴

24 🍜 𝒳 = { 1, 0 %, 0, 1
%} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴 Estimate score 𝜽 = 𝑾𝒙 Observe 𝒙 = 1, 0 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) weekday holiday 🍜 🍣

25 🍜 𝒳 = { 1, 0 %, 0, 1
%} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Observe 𝒙 = 1, 0 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) 😋 ✖ Estimate score 𝜽 = 𝑾𝒙 weekday holiday 🍜 🍣 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴

26 🍜 𝒳 = { 1, 0 %, 0, 1
%} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Observe 𝒙 = 1, 0 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) 😋 ✖ Observe 𝒚 = 1, 0 % Update 𝑾 based on ∇𝑆(𝜽; 𝒚) to encourage current trend. Estimate score 𝜽 = 𝑾𝒙 weekday holiday 🍜 🍣 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴

27 🍜 𝒳 = { 1, 0 %, 0, 1
%} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Estimate score 𝜽 = 𝑾𝒙 Observe 𝒙 = 0, 1 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) weekday holiday 🍜 🍣 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴

28 🍜 𝒳 = { 1, 0 %, 0, 1
%} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Estimate score 𝜽 = 𝑾𝒙 Observe 𝒙 = 0, 1 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) ✖ 😭 weekday holiday 🍜 🍣 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴

29 🍜 𝒳 = { 1, 0 %, 0, 1
%} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Estimate score 𝜽 = 𝑾𝒙 Observe 𝒙 = 0, 1 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) ✖ 😭 weekday holiday 🍜 🍣 Observe 𝒚 = 1, 0 % Update 𝑾 based on ∇𝑆(𝜽; 𝒚) to discourage current trend. Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴

Surrogate Loss Framework 30 𝑆(𝜽; 𝒚) measures how well 𝜽
= 𝑾𝒙 is aligned with 𝒚 𝒳 ℝ$ 𝒴 𝑾 𝜓 𝒙 𝜽 - 𝒚 𝒚 Define score space ℝ$ between 𝒳 and 𝒴. Focus on linear estimator 𝑾: 𝒳 ∋ 𝒙 ↦ 𝜽 = 𝑾𝒙 ∈ ℝ$. Learn 𝑾 by minimizing surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# . Ø Logistic loss: 𝑆 𝜽; 𝒆# = −log" $%& '! ∑"#$ % $%& '" . Decoding function 𝜓 maps score 𝜽 ∈ ℝ$ to prediction - 𝒚 ∈ 𝒴.

Definition Fenchel–Young Loss (Blondel et al. 2020) 31 Let Ω:
ℝ$ → ℝ ∪ {+∞} be a convex regularizer with 𝒴 ⊆ dom(Ω). 𝑆5 𝜽; 𝒚 ≔ Ω 𝒚 + Ω∗ 𝜽 − 𝜽, 𝒚 . cf. Fenchel–Young inequality Fenchel–Young loss 𝑆5: ℝ$×𝒴 → ℝ"# generated by Ω is defined as Ω∗ 𝜽 ≔ sup 𝒚∈ℝ! 𝜽, 𝒚 − Ω(𝒚) is convex conjugate. Ω 𝒚 + Ω∗ 𝜽 ≥ 𝜽, 𝒚 ∀ 𝒚, 𝜽 ∈ dom Ω ×dom(Ω∗). Fenchel–Young loss quantifies the discrepancy between 𝜽 and 𝒚 as the gap in the Fenchel–Young inequality. Blondel et al. Learning with Fenchel–Young losses. JMLR, 2020.

Definition Regularized Prediction 32 Let Ω: ℝ$ → ℝ ∪
{+∞} be a convex regularizer with 𝒴 ⊆ dom(Ω). - 𝒚5 𝜽 ∈ argmax 𝜽, 𝒚 − Ω 𝒚 𝒚 ∈ dom(Ω) . The regularized prediction is defined as 𝜽 conv(𝒴) - 𝒚 𝜽 - 𝒚5 𝜽 If Ω 𝒚 = ∑+=> $ 𝑦+ ln 𝑦+ + 𝕀△! (𝒚) (Shannon entropy), - 𝒚5 𝜽 + = @AB C" ∑ #$% ! @AB C# (softmax). If Ω 𝒚 = > & 𝒚 & & + 𝕀EFGH 𝒴 (𝒚) (squared ℓ& ), - 𝒚5 𝜽 = argmin 𝒚 − 𝜽 & & 𝒚 ∈ conv 𝒴 .

Logistic Loss is a Fenchel–Young Loss 33 If Ω 𝒚
= ∑+=> $ 𝑦+ ln 𝑦+ + 𝕀△! (𝒚), 𝑆5 𝜽; 𝒆+ = Ω 𝒆+ + Ω∗ 𝜽 − 𝜽, 𝒆+ 𝑆5 𝜽; 𝒆+ = 0 + max 𝜽, 𝒚 − ∑+=> $ 𝑦+ ln 𝑦+ 𝒚 ∈△$ − 𝜃+ 𝑆5 𝜽; 𝒆+ = ∑+ 𝜃+ @AB C" ∑# @AB C# − ∑+ @AB C" ∑# @AB C# ln @AB C" ∑# @AB C# − 𝜃+ 𝑆5 𝜽; 𝒆+ = ∑+ 𝜃+ @AB C" ∑ # @AB C# − ∑+ @AB C" ∑ # @AB C# 𝜃+ − ln ∑, exp 𝜃, − 𝜃+ 𝑆5 𝜽; 𝒆+ = ∑" @AB C" ∑ # @AB C# ln ∑, exp 𝜃, − 𝜃+ 𝑆5 𝜽; 𝒆+ = −ln @AB C" ∑ # @AB C# = > JFK&@ −log& @AB C" ∑ # @AB C# . Logistic loss

SparseMAP (Niculae et al. 2018) is a Fenchel–Young Loss 34
If Ω 𝒚 = > & 𝒚 & & + 𝕀EFGH 𝒴 (𝒚), 𝑆5 𝜽; 𝒚 = Ω 𝒚 + Ω∗ 𝜽 − 𝜽, 𝒚 𝑆5 𝜽; 𝒚 = > & 𝒚 & & + max 𝜽, 𝒚 − > & 𝒚 & & 𝒚 ∈ conv 𝒴 − 𝜽, 𝒚 𝑆5 𝜽; 𝒚 = > & 𝒚 − 𝜽 & & − > & min 𝒚 − 𝜽 & & 𝒚 ∈ conv 𝒴 . Niculae et al. SparseMAP: Differentiable Sparse Structured Inference. ICML, 2018. Figure 1 in Niculae et al. (2018)

Properties of Fenchel–Young Loss 35 Ω is expressed as Ψ
+ 𝕀EFGH(𝒴) , where 1. conv 𝒴 ⊆ dom(Ψ) and dom Ψ∗ = ℝ$, 2. Ψ is 𝜆-strongly convex with respect to ⋅ (ℓ> or ℓ& ), and 3. Legendre-type, i.e., ∇Ψ 𝒚+ & → +∞ whenever 𝒚> , 𝒚& , … converges to boundary of int dom Ψ . Proposition (Blondel et al. 2020) FY loss 𝑆5 𝜽; 𝒚 ≔ Ω 𝒚 + Ω∗ 𝜽 − 𝜽, 𝒚 generated by the above Ω satisfies Assumptions on Ω ∇𝑆5 𝜽; 𝒚 = - 𝒚5 𝜽 − 𝒚 and ∇𝑆5 𝜽; 𝒚 & ≤ & L 𝑆5 𝜽; 𝒚 . Blondel et al. Learning with Fenchel–Young losses. JMLR, 2020.

Properties of Fenchel–Young Loss: ∇𝑆& 𝜽; 𝒚 = % 𝒚&
𝜽 − 𝒚 36 ∇𝑆5 𝜽; 𝒚 ≔ ∇(Ω 𝒚 + Ω∗ 𝜽 − 𝜽, 𝒚 = ∇Ω∗ 𝜽 − 𝒚, where ∇Ω∗ 𝜽 = ∇max 𝜽, 𝒚 − Ω 𝒚 𝒚 ∈ conv 𝒴 ∇Ω∗ 𝜽 = argmax 𝜽, 𝒚 − Ω 𝒚 𝒚 ∈ conv 𝒴 ∵ dom Ω = dom Ψ + 𝕀)*+, 𝒴 = conv(𝒴) ∵ Danskin’s theorem ∇Ω∗ 𝜽 = - 𝒚5 𝜽 . −∇𝑆5 𝜽; 𝒚 conv(𝒴) 𝒚 - 𝒚5 𝜽 The gradient of FY loss is the residual.

Properties of Fenchel–Young Loss: ∇𝑆& 𝜽; 𝒚 ' ≤ '
( 𝑆& 𝜽; 𝒚 37 Define Bregman divergence as 𝐵 𝒚 || 𝒚′ ≔ Ψ 𝒚 − Ψ 𝒚M − ∇Ψ 𝒚M , 𝒚 − 𝒚′ . 𝑆5 𝜽; 𝒚 = Ω 𝒚 + Ω∗ 𝜽 − 𝜽, 𝒚 𝑆5 𝜽; 𝒚 = Ψ 𝒚 − 𝜽, 𝒚 + 𝜽, - 𝒚5 (𝜽) − Ψ - 𝒚5 (𝜽) ∵ Ω = Ψ + 𝕀)*+, 𝒴 and : 𝒚.(𝜽) attains max. 𝑆5 𝜽; 𝒚 = 𝐵 𝒚 ||∇Ψ∗ 𝜽 − 𝐵 - 𝒚5 (𝜽)||∇Ψ∗ 𝜽 We have ∇Ψ ∇Ψ∗ 𝜽 = 𝜽 for Legendre-type Ψ with dom Ψ∗ = ℝ$, hence 𝐵 𝒚 ||∇Ψ∗ 𝜽 = Ψ 𝒚 − Ψ ∇Ψ∗ 𝜽 − 𝜽, 𝒚 − ∇Ψ∗ 𝜽 . 𝑆5 𝜽; 𝒚 ≥ 𝐵 𝒚 || - 𝒚5 (𝜽) ∵ Pythagorean theorem for Bregman divergence 𝑆5 𝜽; 𝒚 ≥ L & 𝒚 − - 𝒚5 𝜽 & ∵ 𝜆-strong convexity of Ψ 𝑆5 𝜽; 𝒚 ≥ L & ∇𝑆5 𝜽; 𝒚 &. ∵ : 𝒚. 𝜽 − 𝒚 = ∇𝑆. 𝜽; 𝒚

Online Structured Prediction

Online Structured Prediction 39 Learn linear estimator 𝑾! interactively for
𝑡 = 1, … , 𝑇. 𝑾! : 𝒙! ↦ 𝑾! 𝒙! ∈ ℝ$ Decoding function 𝜓: ℝ$ → 𝒴 Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙!, 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary 🤔 😈

𝑡 = 1, … , 𝑇. Reveal 𝒙! 𝑾! : 𝒙! ↦ 𝑾! 𝒙! ∈ ℝ$ Decoding function 𝜓: ℝ$ → 𝒴 Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙!, 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary 🤔 😈

𝑡 = 1, … , 𝑇. Reveal 𝒙! 𝑾! : 𝒙! ↦ 𝑾! 𝒙! ∈ ℝ$ Decoding function 𝜓: ℝ$ → 𝒴 Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙!, 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary Play - 𝒚! = 𝜓(𝑾! 𝒙! ) 🤔 😈

𝑡 = 1, … , 𝑇. Reveal 𝒙! Reveal 𝒚! 𝑾! : 𝒙! ↦ 𝑾! 𝒙! ∈ ℝ$ Decoding function 𝜓: ℝ$ → 𝒴 Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙!, 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary Play - 𝒚! = 𝜓(𝑾! 𝒙! ) 🤔 😈

𝑡 = 1, … , 𝑇. Reveal 𝒙! Reveal 𝒚! 𝑾! : 𝒙! ↦ 𝑾! 𝒙! ∈ ℝ$ Decoding function 𝜓: ℝ$ → 𝒴 Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙!, 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary Incur 𝐿(- 𝒚!; 𝒚!), update 𝑾! Play - 𝒚! = 𝜓(𝑾! 𝒙! ) Learner updates 𝑾! based on 𝑆! (𝑾) ≔ 𝑆(𝑾𝒙! ; 𝒚! ). Ø E.g., OGD: 𝑾/0! ← 𝑾/ − 𝜂∇𝑆/(𝑾/). 🤔 😈

Surrogate Regret 44 Let 𝐿! - 𝒚 ≔ 𝐿(- 𝒚;
𝒚!) and 𝑆! 𝑾 ≔ 𝑆(𝑾𝒙!; 𝒚!) for 𝑡 = 1, … , 𝑇. ∑!"# $ 𝔼 𝐿! (% 𝒚! ) = ∑!"# $ 𝑆! 𝑼 + 𝑅$ Surrogate regret

Surrogate Regret 45 Extra target loss compared to the best
possible in the surrogate loss framework. Target loss of playing - 𝒚>, … , - 𝒚N Surrogate loss of the best linear model 𝑼 𝑅A = Let 𝐿! - 𝒚 ≔ 𝐿(- 𝒚; 𝒚!) and 𝑆! 𝑾 ≔ 𝑆(𝑾𝒙!; 𝒚!) for 𝑡 = 1, … , 𝑇. ∑!"# $ 𝔼 𝐿! (% 𝒚! ) = ∑!"# $ 𝑆! 𝑼 + 𝑅$ Surrogate regret

Previous Results for Online Classification 46 ∑BCD A 𝔼 𝟙(/
𝒚B ≠ 𝒚B ) = ∑BCD A 𝑆B 𝑼 + 𝑅A Gaptron (Van der Hoeven 2020; 𝑆! = logistic, hinge, smooth hinge) Perceptron (Rosenblatt 1958; 𝑆! = hinge loss) 𝑅N = 𝑂( 𝑼 O &) if 𝑑 = 2 and separable (𝑆! 𝑼 = 0). 𝑅N = 𝑂(𝑑 𝑼 O &) Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 1958. Van der Hoeven. Exploiting the surrogate gap in online multiclass classification. NeurIPS, 2020.

𝒚B ≠ 𝒚B ) = ∑BCD A 𝑆B 𝑼 + 𝑅A Gaptron (Van der Hoeven 2020; 𝑆! = logistic, hinge, smooth hinge) Perceptron (Rosenblatt 1958; 𝑆! = hinge loss) 𝑅N = 𝑂( 𝑼 O &) if 𝑑 = 2 and separable (𝑆! 𝑼 = 0). 𝑅N = 𝑂(𝑑 𝑼 O &) Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 1958. Van der Hoeven. Exploiting the surrogate gap in online multiclass classification. NeurIPS, 2020. Surrogate regret is always finite! Idea: exploiting the surrogate gap

𝒚B ≠ 𝒚B ) = ∑BCD A 𝑆B 𝑼 + 𝑅A Perceptron (Rosenblatt 1958; 𝑆! = hinge loss) 𝑅N = 𝑂( 𝑼 O &) if 𝑑 = 2 and separable (𝑆! 𝑼 = 0). Can we achieve finite surrogate regret in online structured prediction? Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 1958. Van der Hoeven. Exploiting the surrogate gap in online multiclass classification. NeurIPS, 2020. Gaptron (Van der Hoeven 2020; 𝑆! = logistic, hinge, smooth hinge) 𝑅N = 𝑂(𝑑 𝑼 O &) Surrogate regret is always finite! Idea: exploiting the surrogate gap

Our Results

Assumptions 50 Example The 0-1 loss 𝟙(𝒚′ ≠ 𝒆+) can
be extended to be affine in 𝒚M ∈ △$ as 𝐿 𝒚M; 𝒆+ = 𝒚M% 0 ⋯ 1 ⋮ ⋱ ⋮ 1 ⋯ 0 𝒆+ = ∑,-+ 𝑦, M = 1 − 𝑦+ M = > & ∑,-+ 𝑦, M + 1 − 𝑦+ M = > & 𝒚M − 𝒆+ >. Hence 𝛾 = > & for ⋅ = ⋅ > . 1. 𝐿 𝒚M; 𝒚 ≤ 𝛾 𝒚M − 𝒚 for 𝒚M, 𝒚 ∈ conv 𝒴 ×𝒴.

Assumptions 51 𝒚M − 𝒚 ≥ 𝜈 𝒚 𝒚′ 𝒴
2. 𝒚M − 𝒚 ≥ 𝜈 for 𝒚, 𝒚′ ∈ 𝒴 with 𝒚 ≠ 𝒚′. Two distinct structures are easy to distinguish in terms of ⋅ . (1,0,0) (0,1,0) (0,0,1) 𝒆+' − 𝒆+ > = 2 Example In the case of multiclass classification, we have 𝜈 = 2 for ⋅ = ⋅ > . 1. 𝐿 𝒚M; 𝒚 ≤ 𝛾 𝒚M − 𝒚 for 𝒚M, 𝒚 ∈ conv 𝒴 ×𝒴.

Assumptions 52 3. 𝑆(𝜽; 𝒚) satisfies - 𝒚5 𝜽 −
𝒚 & ≤ & L 𝑆 𝜽; 𝒚 for 𝜆 > PQ R and ∇𝑆 𝜽; 𝒚 & & ≤ 𝑏𝑆 𝜽; 𝒚 . 2. 𝒚M − 𝒚 ≥ 𝜈 for 𝒚, 𝒚′ ∈ 𝒴 with 𝒚 ≠ 𝒚′. 1. 𝐿 𝒚M; 𝒚 ≤ 𝛾 𝒚M − 𝒚 for 𝒚M, 𝒚 ∈ conv 𝒴 ×𝒴. FY loss with Ω = Ψ + 𝕀EFGH(𝒴) with strong convexity of 𝜆 > PQ R satisfies 3 (𝑏 = & L ).

Assumptions 53 3. 𝑆(𝜽; 𝒚) satisfies - 𝒚5 𝜽 −
𝒚 & ≤ & L 𝑆 𝜽; 𝒚 for 𝜆 > PQ R and ∇𝑆 𝜽; 𝒚 & & ≤ 𝑏𝑆 𝜽; 𝒚 . 2. 𝒚M − 𝒚 ≥ 𝜈 for 𝒚, 𝒚′ ∈ 𝒴 with 𝒚 ≠ 𝒚′. 1. 𝐿 𝒚M; 𝒚 ≤ 𝛾 𝒚M − 𝒚 for 𝒚M, 𝒚 ∈ conv 𝒴 ×𝒴. FY loss with Ω = Ψ + 𝕀EFGH(𝒴) with strong convexity of 𝜆 > PQ R satisfies 3 (𝑏 = & L ). If Ψ 𝒚 = ∑+=> $ 𝑦+ ln 𝑦+ (1-SC in ℓ> ) and 𝑆 𝜽; 𝒆+ = log& e ⋅ 𝑆5 (𝜽; 𝒆+ ) (logistic), - 𝒚5 𝜽 − 𝒚 > & = ∇𝑆5 𝜽; 𝒚 > & ≤ 2𝑆5 𝜽; 𝒚 ≤ & JFK&@ 𝑆 𝜽; 𝒚 (𝜆 = log& e), and ∇𝑆 𝜽; 𝒚 & & ≤ log& e & ∇𝑆5 𝜽; 𝒚 > & ≤ log& e & ⋅ 2𝑆5 𝜽; 𝒚 ∇𝑆 𝜽; 𝒚 & & ≤ log&e & ∇𝑆5 𝜽; 𝒚 > & ≤ 2log&e ⋅ 𝑆 𝜽; 𝒆+ (𝑏 = 2log&e).

Main Result 54 Finite surrogate regret for online structured prediction
∑BCD A 𝔼 𝐿B (/ 𝒚B ) ≤ ∑BCD A 𝑆B 𝑼 + 1 − FG HI JD K F 𝑼 L ! . Under the assumptions, we can achieve Finite! 2. 𝒚M − 𝒚 ≥ 𝜈 for 𝒚, 𝒚′ ∈ 𝒴 with 𝒚 ≠ 𝒚′. 1. 𝐿 𝒚M; 𝒚 ≤ 𝛾 𝒚M − 𝒚 for 𝒚M, 𝒚 ∈ conv 𝒴 ×𝒴. 3. 𝑆(𝜽; 𝒚) satisfies - 𝒚5 𝜽 − 𝒚 & ≤ & L 𝑆 𝜽; 𝒚 for 𝜆 > PQ R and ∇𝑆 𝜽; 𝒚 & & ≤ 𝑏𝑆 𝜽; 𝒚 .

Application to Multiclass Classification 55 Improved surrogate regret for online
multiclass classification with logistic loss By applying the main result to online multiclass classification, we obtain Improves upon 𝑂(𝑑 𝑼 1 ") (Van der Hoeven 2020) by a factor of 𝑑, the number of classes. ∑BCD A 𝔼 𝟙(/ 𝒚B ≠ 𝒚B ) ≤ ∑BCD A 𝑆B 𝑼 + 𝑼 ! " ! DJPQ ! PQ ! . 3. Logistic satisfies - 𝒚5 𝜽 − 𝒆+ > & ≤ & JFK&@ 𝑆 𝜽; 𝒆+ and ∇𝑆 𝜽; 𝒚 & & ≤ 2log& e 𝑆 𝜽; 𝒆+ . 1. 0-1 loss = > & 𝒚M − 𝒆+ > for 𝒚′ ∈ △$ , hence 𝛾 = > & . 2. 𝒆+' − 𝒆+ > = 2 for 𝑖′ ≠ 𝑖, hence 𝜈 = 2.

Learner’s Strategy: OGD with Randomized Decoding 56 Set 𝑾> to
all-zero matrix For 𝑡 = 1, … , 𝑇 Observe 𝒙! Compute 𝜽! = 𝑾!𝒙! Select - 𝒚! = 𝜓(𝜽!), where 𝜓 is randomized decoding (next slide) Incur 𝐿! - 𝒚! and observe 𝒚! 𝑾!S> ← 𝑾! − 𝜂∇𝑆! (𝑾! ) Randomized-decoding lemma For any 𝜽, 𝒚 ∈ ℝ$×𝒴, it holds that 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ PQ LR 𝑆(𝜽; 𝒚).

Exploiting Surrogate Gap in Online Structured Prediction 57 We have
(i) 𝔼 𝐿! 𝜓 𝑾! 𝒙! ≤ (1 − 𝑎)𝑆! (𝑾! ) (ii) ∇𝑆! 𝑾! O & ≤ 𝑏𝑆!(𝑾!). and Randomized decoding 𝜓 ensures (i) with 𝑎 = 1 − 23 45 ∈ (0,1). Assumption on the surrogate: ∇𝑆 𝜽; 𝒚 " " ≤ 𝑏𝑆 𝜽; 𝒚 .

∑!=> N 𝑆! 𝑾! − 𝑆! 𝑼 ≤ 𝑼 ( & &U + U & ∑!=> N ∇𝑆! 𝑾! O & ≤ 𝑼 ( & &U + UV & ∑!=> N 𝑆!(𝑾!). (i) 𝔼 𝐿! 𝜓 𝑾! 𝒙! ≤ (1 − 𝑎)𝑆! (𝑾! ) (ii) ∇𝑆! 𝑾! O & ≤ 𝑏𝑆!(𝑾!). and Randomized decoding 𝜓 ensures (i) with 𝑎 = 1 − 23 45 ∈ (0,1). From OGD’s regret bound and (ii), Assumption on the surrogate: ∇𝑆 𝜽; 𝒚 " " ≤ 𝑏𝑆 𝜽; 𝒚 .

∑!=> N 𝑆! 𝑾! − 𝑆! 𝑼 ≤ 𝑼 ( & &U + U & ∑!=> N ∇𝑆! 𝑾! O & ≤ 𝑼 ( & &U + UV & ∑!=> N 𝑆!(𝑾!). (i) 𝔼 𝐿! 𝜓 𝑾! 𝒙! ≤ (1 − 𝑎)𝑆! (𝑾! ) ∑!=> N 𝔼 𝐿! 𝜓 𝑾!𝒙! − 𝑆!(𝑼) ≤ ∑!=> N 𝑆! 𝑾! − 𝑆! 𝑼 − 𝑎 ∑!=> N 𝑆!(𝑾!) ≤ 𝑼 ( & &U − 𝑎 − UV & ∑!=> N 𝑆! (𝑾! ). From OGD’s regret bound and (ii), (ii) ∇𝑆! 𝑾! O & ≤ 𝑏𝑆!(𝑾!). From (i) and the above regret bound, surrogate regret is bounded as follows: and Setting 𝜂 = &W V yields a finite bound of V 𝑼 ( & PW , where 𝑎 = 1 − PQ LR . Randomized decoding 𝜓 ensures (i) with 𝑎 = 1 − 23 45 ∈ (0,1). Assumption on the surrogate: ∇𝑆 𝜽; 𝒚 " " ≤ 𝑏𝑆 𝜽; 𝒚 .

Randomized Decoding 60 1. Compute regularized pred. - 𝒚5 𝜽
≔ argmax𝒚'∈EFGH(𝒴) 𝜽, 𝒚′ − Ω(𝒚M). - 𝒚5 𝜽

≔ argmax𝒚'∈EFGH(𝒴) 𝜽, 𝒚′ − Ω(𝒚M). - 𝒚5 𝜽 𝒚∗ Δ∗ 2. Let Δ∗ ≔ min𝒚∗∈𝒴 𝒚∗ − - 𝒚5 𝜽 , i.e., distance to closest 𝒚∗ ∈ 𝒴.

≔ argmax𝒚'∈EFGH(𝒴) 𝜽, 𝒚′ − Ω(𝒚M). - 𝒚5 𝜽 𝒚∗ Δ∗ - 𝒚 = 𝜓 𝜽 = 𝒚∗ w. p. 1 − 𝑝, † 𝒚 w. p. 𝑝, where 𝔼 † 𝒚 = - 𝒚5 𝜽 . 3. Set 𝑝 = min 1, 2Δ∗/𝜈 and return - 𝒚 ∈ 𝒴 as follows: Smaller 𝑝 (or Δ∗) means higher confidence in 𝒚∗ 2. Let Δ∗ ≔ min𝒚∗∈𝒴 𝒚∗ − - 𝒚5 𝜽 , i.e., distance to closest 𝒚∗ ∈ 𝒴.

≔ argmax𝒚'∈EFGH(𝒴) 𝜽, 𝒚′ − Ω(𝒚M). - 𝒚5 𝜽 𝒚∗ Δ∗ - 𝒚 = 𝜓 𝜽 = 𝒚∗ w. p. 1 − 𝑝, † 𝒚 w. p. 𝑝, where 𝔼 † 𝒚 = - 𝒚5 𝜽 . 3. Set 𝑝 = min 1, 2Δ∗/𝜈 and return - 𝒚 ∈ 𝒴 as follows: Smaller 𝑝 (or Δ∗) means higher confidence in 𝒚∗ Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ PQ LR 𝑆(𝜽; 𝒚). 2. Let Δ∗ ≔ min𝒚∗∈𝒴 𝒚∗ − - 𝒚5 𝜽 , i.e., distance to closest 𝒚∗ ∈ 𝒴.

Proof 64 - 𝒚5 𝜽 𝒚∗ Δ∗ By the assumption
on the surrogate loss, 2 𝜆 𝑆 𝜽; 𝒚 ≥ - 𝒚5 𝜽 − 𝒚 & =∶ Δ&. Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ PQ LR 𝑆(𝜽; 𝒚).

Proof 65 - 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal
∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ PQ LR 𝑆(𝜽; 𝒚). Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. By the assumption on the surrogate loss, 2 𝜆 𝑆 𝜽; 𝒚 ≥ - 𝒚5 𝜽 − 𝒚 & =∶ Δ&.

∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. 1. Compute regularized pred. : 𝒚. 𝜽 ≔ argmax𝒚&∈)*+,(𝒴) 𝜽, 𝒚′ − Ω(𝒚;). : 𝒚 = 𝜓 𝜽 = 𝒚∗ w. p. 1 − 𝑝, c 𝒚 w. p. 𝑝, where 𝔼 c 𝒚 = : 𝒚. 𝜽 . 3. Set 𝑝 = min 1, 2Δ∗/𝜈 and return : 𝒚 ∈ 𝒴 as follows: 2. Let Δ∗ ≔ min𝒚∗∈𝒴 𝒚∗ − : 𝒚. 𝜽 , i.e., distance to closest 𝒚∗ ∈ 𝒴. 𝔼 𝐿 𝜓 𝜽 ; 𝒚 = 1 − 𝑝 𝐿 𝒚∗; 𝒚 + 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 Expected target loss is Procedure 𝔼 𝐿 c 𝒚; 𝒚 = 𝐿 : 𝒚. 𝜽 ; 𝒚 since 𝐿(⋅; 𝒚) is affine.

∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. 1. Compute regularized pred. : 𝒚. 𝜽 ≔ argmax𝒚&∈)*+,(𝒴) 𝜽, 𝒚′ − Ω(𝒚;). : 𝒚 = 𝜓 𝜽 = 𝒚∗ w. p. 1 − 𝑝, c 𝒚 w. p. 𝑝, where 𝔼 c 𝒚 = : 𝒚. 𝜽 . 3. Set 𝑝 = min 1, 2Δ∗/𝜈 and return : 𝒚 ∈ 𝒴 as follows: 𝔼 𝐿 𝜓 𝜽 ; 𝒚 = 1 − 𝑝 𝐿 𝒚∗; 𝒚 + 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 if Δ∗ ≥ 𝜈/2 or 𝒚∗ = 𝒚 --- (i), 1 − 𝑝 𝐿 𝒚∗; 𝒚 + 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 if Δ∗ < 𝜈/2 and 𝒚∗ ≠ 𝒚 --- (ii). = Expected target loss is Procedure 𝔼 𝐿 c 𝒚; 𝒚 = 𝐿 : 𝒚. 𝜽 ; 𝒚 since 𝐿(⋅; 𝒚) is affine. 2. Let Δ∗ ≔ min𝒚∗∈𝒴 𝒚∗ − : 𝒚. 𝜽 , i.e., distance to closest 𝒚∗ ∈ 𝒴.

Case (i): Δ∗ ≥ 𝜈/2 or 𝒚∗ = 𝒚 68
- 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. ≤ & R Δ∗ ⋅ 𝐿 - 𝒚5 𝜽 ; 𝒚 𝔼 𝐿 𝜓 𝜽 ; 𝒚 = 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 ≤ & R Δ ⋅ 𝐿 - 𝒚5 𝜽 ; 𝒚 ≤ & R Δ ⋅ 𝛾Δ ∵ 𝐿 - 𝒚5 𝜽 ; 𝒚 ≤ 𝛾 - 𝒚5 𝜽 − 𝒚 = 𝛾Δ ∵ 𝑝 = min 1, & R Δ∗ ≤ & R Δ∗ ∵ Δ∗ ≤ Δ

Case (ii): Δ∗ < 𝜈/2 and 𝒚∗ ≠ 𝒚 69
- 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. ≤ 1 − & R Δ∗ 𝐿 𝒚∗; 𝒚 + & R Δ∗ ⋅ 𝐿 - 𝒚5 𝜽 ; 𝒚 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ 1 − & R Δ∗ 𝛾 𝒚∗ − 𝒚 + & R Δ∗ ⋅ 𝛾 - 𝒚5 𝜽 − 𝒚 ∵ 𝐿 𝒚′; 𝒚 ≤ 𝛾 𝒚′ − 𝒚 ∵ 𝑝 = min 1, & R Δ∗ = & R Δ∗ = 1 − 𝑝 𝐿 𝒚∗; 𝒚 + 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 ≤ 1 − & R Δ∗ 𝛾( 𝒚∗ − - 𝒚5 𝜽 + - 𝒚5 𝜽 − 𝒚 ) + & R Δ∗ ⋅ 𝛾 - 𝒚5 𝜽 − 𝒚 ∵ triangle ineq. ≤ 1 − & R Δ∗ 𝛾(Δ∗ + Δ) + & R Δ∗ ⋅ 𝛾Δ

Case (ii): Δ∗ < 𝜈/2 and 𝒚∗ ≠ 𝒚 70
- 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. We’ve confirmed 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ 1 − & R Δ∗ 𝛾(Δ∗ + Δ) + & R Δ∗ ⋅ 𝛾Δ. It remains to show RHS ≤ &Q R Δ&.

Case (ii): Δ∗ < 𝜈/2 and 𝒚∗ ≠ 𝒚 71
- 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. We’ve confirmed 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ 1 − & R Δ∗ 𝛾(Δ∗ + Δ) + & R Δ∗ ⋅ 𝛾Δ. It remains to show RHS ≤ &Q R Δ&. Dividing by 𝛾𝜈 and letting 𝑢 = Δ∗/𝜈 and 𝑣 = Δ/𝜈, the desired inequality is 1 − 2𝑢 𝑢 + 𝑣 + 2𝑢𝑣 ≤ 2𝑣& ⟺ 2𝑢& + 2𝑣& − 𝑢 − 𝑣 ≥ 0, where 𝑢 + 𝑣 ≥ 1 and 0 ≤ 𝑢 < 1/2 < 𝑣. ∵ 𝑢 + 𝑣 = (Δ∗ + Δ)/𝜈 ≥ 𝒚∗ − 𝒚 /𝜈 ≥ 1 and 𝑢 = Δ∗/𝜈 < 1/2 (hence 𝑣 ≥ 1 − 𝑢 > 1/2). 𝒚∗ − 𝒚 ≥ 𝜈

Case (ii): Δ∗ < 𝜈/2 and 𝒚∗ ≠ 𝒚 72
- 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. From 𝑢 + 𝑣 ≥ 1 and 0 ≤ 𝑢 < 1/2 < 𝑣, 2𝑢& + 2𝑣& − 𝑢 − 𝑣 = 2𝑢& + 2𝑣& − 3𝑢 − 3𝑣 + 4𝑢𝑣 + 1 + (1 − 2𝑢)(2𝑣 − 1) = 𝑢 + 𝑣 − 1 2𝑢 + 2𝑣 − 1 + (1 − 2𝑢)(2𝑣 − 1) obtaining 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ 1 − & R Δ∗ 𝛾(Δ∗ + Δ) + & R Δ∗ ⋅ 𝛾Δ ≤ &Q R Δ& as desired. ≥ 0, 𝒚∗ − 𝒚 ≥ 𝜈

Conclusion 73 Finite surrogate regret bound for online structured prediction
with FY loss. Additional results • High-probability bound. • Online-to-batch conversion. • Making the bound vanish as 𝐿/𝑆 → 0 (thanks to a suggestion by a COLT reviewer). Future directions • Extension to bandit feedback setting. • Combining with Fitzpatrick loss (Rakotomandimby et al. 2024). Improved bound of 𝑂( 𝑼 O &) for online multiclass classification with logistic loss. Key idea: Exploiting the surrogate gap with randomized decoding. Rakotomandimby et al. Learning with Fitzpatrick Losses. arXiv:405.14574, 2024. https://arxiv.org/abs/2405.14574.

Supplementary: Regret Bound of OGD 74 From convexity of 𝑆!
, 𝑆! 𝑾! − 𝑆! 𝑼 ≤ ∇𝑆! 𝑾! , 𝑾! − 𝑼 = 𝑾*Y𝑼 ( & Y 𝑾*+%Y𝑼 ( & &U + U & ∇𝑆! 𝑾! O &. By telescoping & ignoring − 𝑾NS> − 𝑼 O &≤ 0, ∑!=> N 𝑆! 𝑾! − 𝑆! 𝑼 ≤ 𝑼 ( & &U + U & ∑!=> N ∇𝑆! 𝑾! O &. 1: Fix 𝑾% = 0 2: For 𝑡 = 1, … , 𝑇 3: Observe 𝑆& (or 𝒙& and 𝒚& ) 4: 𝑾&'% ← 𝑾& − 𝜂∇𝑆& (𝑾& ) From the update rule, 𝑾!S> − 𝑼 O & = 𝑾! − 𝑼 O & + 𝜂& ∇𝑆! 𝑾! O & − 2𝜂 ∇𝑆! 𝑾! , 𝑾! − 𝑼 . Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. ICML, 2003. Orabona. A modern introduction to online learning. arXiv:1912.13213, 2023. https://arxiv.org/abs/1912.13213v6.

Online Structured Prediction with Fenchel–Young...

Online Structured Prediction with Fenchel–Young Losses and Improved Surrogate Regret for Online Multiclass Classification with Logistic Loss

More Decks by Shinsaku Sakaue

Featured

Transcript