Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Online Structured Prediction with Fenchel–Young...

Shinsaku Sakaue
July 23, 2024
64

Online Structured Prediction with Fenchel–Young Losses and Improved Surrogate Regret for Online Multiclass Classification with Logistic Loss

Shinsaku Sakaue

July 23, 2024
Tweet

Transcript

  1. Online Structured Prediction with Fenchel–Young Losses and Improved Surrogate Regret

    for Online Multiclass Classification with Logistic Loss 1The University of Tokyo, 2Kyoto University Shinsaku Sakaue1, Han Bao2, Taira Tsuchiya1, Taihei Oki1 AFSA-MI-CS Joint Seminar @ UTokyo
  2. Self-introduction 1 Research Interest Apr. 2014 ー Mar. 2016 Apr.

    2016 ー Now Oct. 2018 ー Mar. 2020 Apr. 2020 ー Now The University of Tokyo (Master; supervised by Prof. Takeda) NTT Communication Science Laboratory Kyoto University (Doctor; supervised by Prof. Minato) The University of Tokyo (ERATO Assist. Prof. from NTT) Education and Employment History • Optimization (discrete and continuous) • Data structures (BDD, ZDD, etc.) • Learning theory (online learning, sample complexity)
  3. Overview 2 Reveal 𝒙! Reveal 𝒚! Target loss 𝐿: 𝒴×𝒴

    → ℝ"# Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙! , 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary Play - 𝒚! Online structured prediction with a surrogate loss 🤔 😈 Main result: Finite surrogate regret bound Learner’s target loss Best possible for the leaner ∑!"# $ 𝔼 𝐿! (% 𝒚! ) = ∑!"# $ 𝑆! 𝑼 + 𝑅$ Surrogate regret (cf. Perceptron’s finite mistake bound)
  4. Contents 3 Introduction to structured prediction Ø Surrogate loss framework

    Ø Fenchel–Young loss Online structured prediction Ø Surrogate regret Ø Existing results on online multiclass classification Our results Ø Randomized decoding Ø Exploiting the surrogate gap
  5. 7 🤔 🍜 Which shop should I go to for

    lunch today? 🍣 😋 Open
  6. 8 🤔 🍜 Which shop should I go to for

    lunch today? 🍣 ✖ 😭 Closed
  7. 11 🍜 𝒳 = weekday, holiday 🍣 = { 1,

    0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %}
  8. 12 🍜 𝒳 = weekday, holiday 🍣 = { 1,

    0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)
  9. 13 🍜 𝒳 = weekday, holiday 🍣 = { 1,

    0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Observe 𝒙 = 1, 0 % Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)
  10. 14 🍜 𝒳 = weekday, holiday 🍣 = { 1,

    0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Observe 𝒙 = 1, 0 % Predict - 𝒚 = 1, 0 % Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)
  11. 15 🍜 𝒳 = weekday, holiday 🍣 = { 1,

    0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Observe 𝒙 = 1, 0 % Predict - 𝒚 = 1, 0 % ✖ Closed 😋 No loss Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)
  12. 16 🍜 𝒳 = weekday, holiday 🍣 = { 1,

    0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Observe 𝒙 = 1, 0 % Predict - 𝒚 = 0, 1 % Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)
  13. 17 🍜 𝒳 = weekday, holiday 🍣 = { 1,

    0 %, 0, 1 %} 𝒴 = {🍜, 🍣} = { 1, 0 %, 0, 1 %} 🤔 Prediction model ℎ: 𝒳 → 𝒴 Observe 𝒙 = 1, 0 % Predict - 𝒚 = 0, 1 % ✖ Closed 😭 Incurs loss of 1 Target loss 𝐿 % 𝒚; 𝒚 = 𝟙(% 𝒚 ≠ 𝒚)
  14. Goal Structured Prediction (supervised learning with structured outputs) 18 𝒳:

    input vector space (𝒳 ⊆ 𝔹& '(1) for simplicity). 𝒴: finite set of structured outputs. 𝐿: 𝒴×𝒴 → ℝ"# : target loss. Learn ℎ: 𝒳 → 𝒴 to minimize 𝔼𝒙, 𝒚 𝐿 ℎ 𝒙 , 𝒚 .
  15. Structured Prediction (supervised learning with structured outputs) 19 𝒳: input

    vector space (𝒳 ⊆ 𝔹& '(1) for simplicity). 𝒴: finite set of structured outputs. 𝐿: 𝒴×𝒴 → ℝ"# : target loss. Multiclass classification • 𝒴 = {🍜, 🍣, 🍖} • 𝐿 - 𝒚; 𝒚 = 𝟙(- 𝒚 ≠ 𝒚) Multilabel classification • 𝒴 = {∅, {🍜}, … , {🍜, 🍣,🍖}} • 𝐿 - 𝒚; 𝒚 = ∑+ 𝟙(@ 𝑦+ ≠ 𝑦+ ) Matching • 𝒴 = {all bipartite matchings} • 𝐿 - 𝒚; 𝒚 = #mismatches Lab. 1 👦 👩🦰 🧑🦰 Lab. 2 Lab. 3 Lab. 1 👦 👩🦰 🧑🦰 Lab. 2 Lab. 3 - 𝒚 𝐿 - 𝒚; 𝒚 = 2 𝒚 Goal Learn ℎ: 𝒳 → 𝒴 to minimize 𝔼𝒙, 𝒚 𝐿 ℎ 𝒙 , 𝒚 .
  16. Assumptions 20 1. 𝒴 is embedded in ℝ$. 1 1

    1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Birkhoff polytope for matching (1,0,0) (0,1,0) (0,0,1) Probability simplex for multiclass classification (0,0,0) (1,0,0) (0,0,1) (0,1,0) (0,1,1) (1,0,1) (1,1,1) (1,1,0) Unit hypercube for multilabel classification
  17. Assumptions 21 1. 𝒴 is embedded in ℝ$. 2. 𝐿(-

    𝒚; 𝒚) is extended as an affine function of - 𝒚 ∈ conv(𝒴). The 0-1 loss 𝟙(- 𝒚 ≠ 𝒆+ ) can be extended as an affine function of - 𝒚 ∈ △$ : 𝐿 - 𝒚; 𝒆+ = - 𝒚% 0 ⋯ 1 ⋮ ⋱ ⋮ 1 ⋯ 0 𝒆+ = ∑,-+ @ 𝑦, = 1 − @ 𝑦+ . The (generalized) structure encoding loss function (SELF), including 0-1 and Hamming losses, is affine: 𝐿 - 𝒚; 𝒚 = - 𝒚, 𝑨𝒚 + 𝒃 + 𝑐(𝒚). Blondel. Structured prediction with projection oracles. NeurIPS, 2019. Ciliberto et al. A general framework for consistent structured prediction with implicit loss embeddings. JMLR, 2020.
  18. 22 🍜 𝒳 = { 1, 0 %, 0, 1

    %} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Learning ℎ: 𝒳 → 𝒴 for discrete 𝒴 is not easy… Try logistic regression! weekday holiday 🍜 🍣
  19. 23 🍜 𝒳 = { 1, 0 %, 0, 1

    %} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 weekday holiday 🍜 🍣 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴
  20. 24 🍜 𝒳 = { 1, 0 %, 0, 1

    %} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴 Estimate score 𝜽 = 𝑾𝒙 Observe 𝒙 = 1, 0 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) weekday holiday 🍜 🍣
  21. 25 🍜 𝒳 = { 1, 0 %, 0, 1

    %} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Observe 𝒙 = 1, 0 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) 😋 ✖ Estimate score 𝜽 = 𝑾𝒙 weekday holiday 🍜 🍣 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴
  22. 26 🍜 𝒳 = { 1, 0 %, 0, 1

    %} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Observe 𝒙 = 1, 0 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) 😋 ✖ Observe 𝒚 = 1, 0 % Update 𝑾 based on ∇𝑆(𝜽; 𝒚) to encourage current trend. Estimate score 𝜽 = 𝑾𝒙 weekday holiday 🍜 🍣 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴
  23. 27 🍜 𝒳 = { 1, 0 %, 0, 1

    %} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Estimate score 𝜽 = 𝑾𝒙 Observe 𝒙 = 0, 1 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) weekday holiday 🍜 🍣 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴
  24. 28 🍜 𝒳 = { 1, 0 %, 0, 1

    %} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Estimate score 𝜽 = 𝑾𝒙 Observe 𝒙 = 0, 1 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) ✖ 😭 weekday holiday 🍜 🍣 Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴
  25. 29 🍜 𝒳 = { 1, 0 %, 0, 1

    %} 🍣 𝒴 = { 1, 0 %, 0, 1 %} 🤔 Estimate score 𝜽 = 𝑾𝒙 Observe 𝒙 = 0, 1 % 𝜃! 𝜃" Predict - 𝒚 = 𝜓(𝜽) ✖ 😭 weekday holiday 🍜 🍣 Observe 𝒚 = 1, 0 % Update 𝑾 based on ∇𝑆(𝜽; 𝒚) to discourage current trend. Linear estimator 𝑾: 𝒳 → ℝ! Logistic loss 𝑆: ℝ!×𝒴 → ℝ Decoding function 𝜓: ℝ! → 𝒴
  26. Surrogate Loss Framework 30 𝑆(𝜽; 𝒚) measures how well 𝜽

    = 𝑾𝒙 is aligned with 𝒚 𝒳 ℝ$ 𝒴 𝑾 𝜓 𝒙 𝜽 - 𝒚 𝒚 Define score space ℝ$ between 𝒳 and 𝒴. Focus on linear estimator 𝑾: 𝒳 ∋ 𝒙 ↦ 𝜽 = 𝑾𝒙 ∈ ℝ$. Learn 𝑾 by minimizing surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# . Ø Logistic loss: 𝑆 𝜽; 𝒆# = −log" $%& '! ∑"#$ % $%& '" . Decoding function 𝜓 maps score 𝜽 ∈ ℝ$ to prediction - 𝒚 ∈ 𝒴.
  27. Definition Fenchel–Young Loss (Blondel et al. 2020) 31 Let Ω:

    ℝ$ → ℝ ∪ {+∞} be a convex regularizer with 𝒴 ⊆ dom(Ω). 𝑆5 𝜽; 𝒚 ≔ Ω 𝒚 + Ω∗ 𝜽 − 𝜽, 𝒚 . cf. Fenchel–Young inequality Fenchel–Young loss 𝑆5: ℝ$×𝒴 → ℝ"# generated by Ω is defined as Ω∗ 𝜽 ≔ sup 𝒚∈ℝ! 𝜽, 𝒚 − Ω(𝒚) is convex conjugate. Ω 𝒚 + Ω∗ 𝜽 ≥ 𝜽, 𝒚 ∀ 𝒚, 𝜽 ∈ dom Ω ×dom(Ω∗). Fenchel–Young loss quantifies the discrepancy between 𝜽 and 𝒚 as the gap in the Fenchel–Young inequality. Blondel et al. Learning with Fenchel–Young losses. JMLR, 2020.
  28. Definition Regularized Prediction 32 Let Ω: ℝ$ → ℝ ∪

    {+∞} be a convex regularizer with 𝒴 ⊆ dom(Ω). - 𝒚5 𝜽 ∈ argmax 𝜽, 𝒚 − Ω 𝒚 𝒚 ∈ dom(Ω) . The regularized prediction is defined as 𝜽 conv(𝒴) - 𝒚 𝜽 - 𝒚5 𝜽 If Ω 𝒚 = ∑+=> $ 𝑦+ ln 𝑦+ + 𝕀△! (𝒚) (Shannon entropy), - 𝒚5 𝜽 + = @AB C" ∑ #$% ! @AB C# (softmax). If Ω 𝒚 = > & 𝒚 & & + 𝕀EFGH 𝒴 (𝒚) (squared ℓ& ), - 𝒚5 𝜽 = argmin 𝒚 − 𝜽 & & 𝒚 ∈ conv 𝒴 .
  29. Logistic Loss is a Fenchel–Young Loss 33 If Ω 𝒚

    = ∑+=> $ 𝑦+ ln 𝑦+ + 𝕀△! (𝒚), 𝑆5 𝜽; 𝒆+ = Ω 𝒆+ + Ω∗ 𝜽 − 𝜽, 𝒆+ 𝑆5 𝜽; 𝒆+ = 0 + max 𝜽, 𝒚 − ∑+=> $ 𝑦+ ln 𝑦+ 𝒚 ∈△$ − 𝜃+ 𝑆5 𝜽; 𝒆+ = ∑+ 𝜃+ @AB C" ∑# @AB C# − ∑+ @AB C" ∑# @AB C# ln @AB C" ∑# @AB C# − 𝜃+ 𝑆5 𝜽; 𝒆+ = ∑+ 𝜃+ @AB C" ∑ # @AB C# − ∑+ @AB C" ∑ # @AB C# 𝜃+ − ln ∑, exp 𝜃, − 𝜃+ 𝑆5 𝜽; 𝒆+ = ∑" @AB C" ∑ # @AB C# ln ∑, exp 𝜃, − 𝜃+ 𝑆5 𝜽; 𝒆+ = −ln @AB C" ∑ # @AB C# = > JFK&@ −log& @AB C" ∑ # @AB C# . Logistic loss
  30. SparseMAP (Niculae et al. 2018) is a Fenchel–Young Loss 34

    If Ω 𝒚 = > & 𝒚 & & + 𝕀EFGH 𝒴 (𝒚), 𝑆5 𝜽; 𝒚 = Ω 𝒚 + Ω∗ 𝜽 − 𝜽, 𝒚 𝑆5 𝜽; 𝒚 = > & 𝒚 & & + max 𝜽, 𝒚 − > & 𝒚 & & 𝒚 ∈ conv 𝒴 − 𝜽, 𝒚 𝑆5 𝜽; 𝒚 = > & 𝒚 − 𝜽 & & − > & min 𝒚 − 𝜽 & & 𝒚 ∈ conv 𝒴 . Niculae et al. SparseMAP: Differentiable Sparse Structured Inference. ICML, 2018. Figure 1 in Niculae et al. (2018)
  31. Properties of Fenchel–Young Loss 35 Ω is expressed as Ψ

    + 𝕀EFGH(𝒴) , where 1. conv 𝒴 ⊆ dom(Ψ) and dom Ψ∗ = ℝ$, 2. Ψ is 𝜆-strongly convex with respect to ⋅ (ℓ> or ℓ& ), and 3. Legendre-type, i.e., ∇Ψ 𝒚+ & → +∞ whenever 𝒚> , 𝒚& , … converges to boundary of int dom Ψ . Proposition (Blondel et al. 2020) FY loss 𝑆5 𝜽; 𝒚 ≔ Ω 𝒚 + Ω∗ 𝜽 − 𝜽, 𝒚 generated by the above Ω satisfies Assumptions on Ω ∇𝑆5 𝜽; 𝒚 = - 𝒚5 𝜽 − 𝒚 and ∇𝑆5 𝜽; 𝒚 & ≤ & L 𝑆5 𝜽; 𝒚 . Blondel et al. Learning with Fenchel–Young losses. JMLR, 2020.
  32. Properties of Fenchel–Young Loss: ∇𝑆& 𝜽; 𝒚 = % 𝒚&

    𝜽 − 𝒚 36 ∇𝑆5 𝜽; 𝒚 ≔ ∇(Ω 𝒚 + Ω∗ 𝜽 − 𝜽, 𝒚 = ∇Ω∗ 𝜽 − 𝒚, where ∇Ω∗ 𝜽 = ∇max 𝜽, 𝒚 − Ω 𝒚 𝒚 ∈ conv 𝒴 ∇Ω∗ 𝜽 = argmax 𝜽, 𝒚 − Ω 𝒚 𝒚 ∈ conv 𝒴 ∵ dom Ω = dom Ψ + 𝕀)*+, 𝒴 = conv(𝒴) ∵ Danskin’s theorem ∇Ω∗ 𝜽 = - 𝒚5 𝜽 . −∇𝑆5 𝜽; 𝒚 conv(𝒴) 𝒚 - 𝒚5 𝜽 The gradient of FY loss is the residual.
  33. Properties of Fenchel–Young Loss: ∇𝑆& 𝜽; 𝒚 ' ≤ '

    ( 𝑆& 𝜽; 𝒚 37 Define Bregman divergence as 𝐵 𝒚 || 𝒚′ ≔ Ψ 𝒚 − Ψ 𝒚M − ∇Ψ 𝒚M , 𝒚 − 𝒚′ . 𝑆5 𝜽; 𝒚 = Ω 𝒚 + Ω∗ 𝜽 − 𝜽, 𝒚 𝑆5 𝜽; 𝒚 = Ψ 𝒚 − 𝜽, 𝒚 + 𝜽, - 𝒚5 (𝜽) − Ψ - 𝒚5 (𝜽) ∵ Ω = Ψ + 𝕀)*+, 𝒴 and : 𝒚.(𝜽) attains max. 𝑆5 𝜽; 𝒚 = 𝐵 𝒚 ||∇Ψ∗ 𝜽 − 𝐵 - 𝒚5 (𝜽)||∇Ψ∗ 𝜽 We have ∇Ψ ∇Ψ∗ 𝜽 = 𝜽 for Legendre-type Ψ with dom Ψ∗ = ℝ$, hence 𝐵 𝒚 ||∇Ψ∗ 𝜽 = Ψ 𝒚 − Ψ ∇Ψ∗ 𝜽 − 𝜽, 𝒚 − ∇Ψ∗ 𝜽 . 𝑆5 𝜽; 𝒚 ≥ 𝐵 𝒚 || - 𝒚5 (𝜽) ∵ Pythagorean theorem for Bregman divergence 𝑆5 𝜽; 𝒚 ≥ L & 𝒚 − - 𝒚5 𝜽 & ∵ 𝜆-strong convexity of Ψ 𝑆5 𝜽; 𝒚 ≥ L & ∇𝑆5 𝜽; 𝒚 &. ∵ : 𝒚. 𝜽 − 𝒚 = ∇𝑆. 𝜽; 𝒚
  34. Online Structured Prediction 39 Learn linear estimator 𝑾! interactively for

    𝑡 = 1, … , 𝑇. 𝑾! : 𝒙! ↦ 𝑾! 𝒙! ∈ ℝ$ Decoding function 𝜓: ℝ$ → 𝒴 Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙!, 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary 🤔 😈
  35. Online Structured Prediction 40 Learn linear estimator 𝑾! interactively for

    𝑡 = 1, … , 𝑇. Reveal 𝒙! 𝑾! : 𝒙! ↦ 𝑾! 𝒙! ∈ ℝ$ Decoding function 𝜓: ℝ$ → 𝒴 Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙!, 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary 🤔 😈
  36. Online Structured Prediction 41 Learn linear estimator 𝑾! interactively for

    𝑡 = 1, … , 𝑇. Reveal 𝒙! 𝑾! : 𝒙! ↦ 𝑾! 𝒙! ∈ ℝ$ Decoding function 𝜓: ℝ$ → 𝒴 Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙!, 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary Play - 𝒚! = 𝜓(𝑾! 𝒙! ) 🤔 😈
  37. Online Structured Prediction 42 Learn linear estimator 𝑾! interactively for

    𝑡 = 1, … , 𝑇. Reveal 𝒙! Reveal 𝒚! 𝑾! : 𝒙! ↦ 𝑾! 𝒙! ∈ ℝ$ Decoding function 𝜓: ℝ$ → 𝒴 Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙!, 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary Play - 𝒚! = 𝜓(𝑾! 𝒙! ) 🤔 😈
  38. Online Structured Prediction 43 Learn linear estimator 𝑾! interactively for

    𝑡 = 1, … , 𝑇. Reveal 𝒙! Reveal 𝒚! 𝑾! : 𝒙! ↦ 𝑾! 𝒙! ∈ ℝ$ Decoding function 𝜓: ℝ$ → 𝒴 Surrogate loss 𝑆: ℝ$×𝒴 → ℝ"# Learner Select 𝒙!, 𝒚! ∈ 𝒳×𝒴 w/o knowledge of - 𝒚! Adversary Incur 𝐿(- 𝒚!; 𝒚!), update 𝑾! Play - 𝒚! = 𝜓(𝑾! 𝒙! ) Learner updates 𝑾! based on 𝑆! (𝑾) ≔ 𝑆(𝑾𝒙! ; 𝒚! ). Ø E.g., OGD: 𝑾/0! ← 𝑾/ − 𝜂∇𝑆/(𝑾/). 🤔 😈
  39. Surrogate Regret 44 Let 𝐿! - 𝒚 ≔ 𝐿(- 𝒚;

    𝒚!) and 𝑆! 𝑾 ≔ 𝑆(𝑾𝒙!; 𝒚!) for 𝑡 = 1, … , 𝑇. ∑!"# $ 𝔼 𝐿! (% 𝒚! ) = ∑!"# $ 𝑆! 𝑼 + 𝑅$ Surrogate regret
  40. Surrogate Regret 45 Extra target loss compared to the best

    possible in the surrogate loss framework. Target loss of playing - 𝒚>, … , - 𝒚N Surrogate loss of the best linear model 𝑼 𝑅A = Let 𝐿! - 𝒚 ≔ 𝐿(- 𝒚; 𝒚!) and 𝑆! 𝑾 ≔ 𝑆(𝑾𝒙!; 𝒚!) for 𝑡 = 1, … , 𝑇. ∑!"# $ 𝔼 𝐿! (% 𝒚! ) = ∑!"# $ 𝑆! 𝑼 + 𝑅$ Surrogate regret
  41. Previous Results for Online Classification 46 ∑BCD A 𝔼 𝟙(/

    𝒚B ≠ 𝒚B ) = ∑BCD A 𝑆B 𝑼 + 𝑅A Gaptron (Van der Hoeven 2020; 𝑆! = logistic, hinge, smooth hinge) Perceptron (Rosenblatt 1958; OGD with 𝑆! = hinge loss) 𝑅N = 𝑂( 𝑼 O &) if 𝑑 = 2 and separable (𝑆! 𝑼 = 0). 𝑅N = 𝑂(𝑑 𝑼 O &) Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 1958. Van der Hoeven. Exploiting the surrogate gap in online multiclass classification. NeurIPS, 2020.
  42. Previous Results for Online Classification 47 ∑BCD A 𝔼 𝟙(/

    𝒚B ≠ 𝒚B ) = ∑BCD A 𝑆B 𝑼 + 𝑅A Gaptron (Van der Hoeven 2020; 𝑆! = logistic, hinge, smooth hinge) Perceptron (Rosenblatt 1958; OGD with 𝑆! = hinge loss) 𝑅N = 𝑂( 𝑼 O &) if 𝑑 = 2 and separable (𝑆! 𝑼 = 0). 𝑅N = 𝑂(𝑑 𝑼 O &) Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 1958. Van der Hoeven. Exploiting the surrogate gap in online multiclass classification. NeurIPS, 2020. Surrogate regret is always finite! Idea: exploiting the surrogate gap
  43. Previous Results for Online Classification 48 ∑BCD A 𝔼 𝟙(/

    𝒚B ≠ 𝒚B ) = ∑BCD A 𝑆B 𝑼 + 𝑅A Perceptron (Rosenblatt 1958; OGD with 𝑆! = hinge loss) 𝑅N = 𝑂( 𝑼 O &) if 𝑑 = 2 and separable (𝑆! 𝑼 = 0). Can we achieve finite surrogate regret in online structured prediction? Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 1958. Van der Hoeven. Exploiting the surrogate gap in online multiclass classification. NeurIPS, 2020. Gaptron (Van der Hoeven 2020; 𝑆! = logistic, hinge, smooth hinge) 𝑅N = 𝑂(𝑑 𝑼 O &) Surrogate regret is always finite! Idea: exploiting the surrogate gap
  44. Assumptions 50 Example The 0-1 loss 𝟙(𝒚′ ≠ 𝒆+) can

    be extended to be affine in 𝒚M ∈ △$ as 𝐿 𝒚M; 𝒆+ = 𝒚M% 0 ⋯ 1 ⋮ ⋱ ⋮ 1 ⋯ 0 𝒆+ = ∑,-+ 𝑦, M = 1 − 𝑦+ M = > & ∑,-+ 𝑦, M + 1 − 𝑦+ M = > & 𝒚M − 𝒆+ >. Hence 𝛾 = > & for ⋅ = ⋅ > . 1. 𝐿 𝒚M; 𝒚 ≤ 𝛾 𝒚M − 𝒚 for 𝒚M, 𝒚 ∈ conv 𝒴 ×𝒴.
  45. Assumptions 51 𝒚M − 𝒚 ≥ 𝜈 𝒚 𝒚′ 𝒴

    2. 𝒚M − 𝒚 ≥ 𝜈 for 𝒚, 𝒚′ ∈ 𝒴 with 𝒚 ≠ 𝒚′. Two distinct structures are easy to distinguish in terms of ⋅ . (1,0,0) (0,1,0) (0,0,1) 𝒆+' − 𝒆+ > = 2 Example In the case of multiclass classification, we have 𝜈 = 2 for ⋅ = ⋅ > . 1. 𝐿 𝒚M; 𝒚 ≤ 𝛾 𝒚M − 𝒚 for 𝒚M, 𝒚 ∈ conv 𝒴 ×𝒴.
  46. Assumptions 52 3. 𝑆(𝜽; 𝒚) satisfies - 𝒚5 𝜽 −

    𝒚 & ≤ & L 𝑆 𝜽; 𝒚 for 𝜆 > PQ R and ∇𝑆 𝜽; 𝒚 & & ≤ 𝑏𝑆 𝜽; 𝒚 . 2. 𝒚M − 𝒚 ≥ 𝜈 for 𝒚, 𝒚′ ∈ 𝒴 with 𝒚 ≠ 𝒚′. 1. 𝐿 𝒚M; 𝒚 ≤ 𝛾 𝒚M − 𝒚 for 𝒚M, 𝒚 ∈ conv 𝒴 ×𝒴. FY loss with Ω = Ψ + 𝕀EFGH(𝒴) with strong convexity of 𝜆 > PQ R satisfies 3 (𝑏 = & L ).
  47. Assumptions 53 3. 𝑆(𝜽; 𝒚) satisfies - 𝒚5 𝜽 −

    𝒚 & ≤ & L 𝑆 𝜽; 𝒚 for 𝜆 > PQ R and ∇𝑆 𝜽; 𝒚 & & ≤ 𝑏𝑆 𝜽; 𝒚 . 2. 𝒚M − 𝒚 ≥ 𝜈 for 𝒚, 𝒚′ ∈ 𝒴 with 𝒚 ≠ 𝒚′. 1. 𝐿 𝒚M; 𝒚 ≤ 𝛾 𝒚M − 𝒚 for 𝒚M, 𝒚 ∈ conv 𝒴 ×𝒴. FY loss with Ω = Ψ + 𝕀EFGH(𝒴) with strong convexity of 𝜆 > PQ R satisfies 3 (𝑏 = & L ). If Ψ 𝒚 = ∑+=> $ 𝑦+ ln 𝑦+ (1-SC in ℓ> ) and 𝑆 𝜽; 𝒆+ = log& e ⋅ 𝑆5 (𝜽; 𝒆+ ) (logistic), - 𝒚5 𝜽 − 𝒚 > & = ∇𝑆5 𝜽; 𝒚 > & ≤ 2𝑆5 𝜽; 𝒚 ≤ & JFK&@ 𝑆 𝜽; 𝒚 (𝜆 = log& e), and ∇𝑆 𝜽; 𝒚 & & ≤ log& e & ∇𝑆5 𝜽; 𝒚 > & ≤ log& e & ⋅ 2𝑆5 𝜽; 𝒚 ∇𝑆 𝜽; 𝒚 & & ≤ log&e & ∇𝑆5 𝜽; 𝒚 > & ≤ 2log&e ⋅ 𝑆 𝜽; 𝒆+ (𝑏 = 2log&e).
  48. Main Result 54 Finite surrogate regret for online structured prediction

    ∑BCD A 𝔼 𝐿B (/ 𝒚B ) ≤ ∑BCD A 𝑆B 𝑼 + 1 − FG HI JD K F 𝑼 L ! . Under the assumptions, we can achieve Finite! 2. 𝒚M − 𝒚 ≥ 𝜈 for 𝒚, 𝒚′ ∈ 𝒴 with 𝒚 ≠ 𝒚′. 1. 𝐿 𝒚M; 𝒚 ≤ 𝛾 𝒚M − 𝒚 for 𝒚M, 𝒚 ∈ conv 𝒴 ×𝒴. 3. 𝑆(𝜽; 𝒚) satisfies - 𝒚5 𝜽 − 𝒚 & ≤ & L 𝑆 𝜽; 𝒚 for 𝜆 > PQ R and ∇𝑆 𝜽; 𝒚 & & ≤ 𝑏𝑆 𝜽; 𝒚 .
  49. Application to Multiclass Classification 55 Improved surrogate regret for online

    multiclass classification with logistic loss By applying the main result to online multiclass classification, we obtain Improves upon 𝑂(𝑑 𝑼 1 ") (Van der Hoeven 2020) by a factor of 𝑑, the number of classes. ∑BCD A 𝔼 𝟙(/ 𝒚B ≠ 𝒚B ) ≤ ∑BCD A 𝑆B 𝑼 + 𝑼 ! " ! DJPQ ! PQ ! . 3. Logistic satisfies - 𝒚5 𝜽 − 𝒆+ > & ≤ & JFK&@ 𝑆 𝜽; 𝒆+ and ∇𝑆 𝜽; 𝒚 & & ≤ 2log& e 𝑆 𝜽; 𝒆+ . 1. 0-1 loss = > & 𝒚M − 𝒆+ > for 𝒚′ ∈ △$ , hence 𝛾 = > & . 2. 𝒆+' − 𝒆+ > = 2 for 𝑖′ ≠ 𝑖, hence 𝜈 = 2.
  50. Learner’s Strategy: OGD with Randomized Decoding 56 Set 𝑾> to

    all-zero matrix For 𝑡 = 1, … , 𝑇 Observe 𝒙! Compute 𝜽! = 𝑾!𝒙! Select - 𝒚! = 𝜓(𝜽!), where 𝜓 is randomized decoding (next slide) Incur 𝐿! - 𝒚! and observe 𝒚! 𝑾!S> ← 𝑾! − 𝜂∇𝑆! (𝑾! ) Randomized-decoding lemma For any 𝜽, 𝒚 ∈ ℝ$×𝒴, it holds that 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ PQ LR 𝑆(𝜽; 𝒚).
  51. Exploiting Surrogate Gap in Online Structured Prediction 57 We have

    (i) 𝔼 𝐿! 𝜓 𝑾! 𝒙! ≤ (1 − 𝑎)𝑆! (𝑾! ) (ii) ∇𝑆! 𝑾! O & ≤ 𝑏𝑆!(𝑾!). and Randomized decoding 𝜓 ensures (i) with 𝑎 = 1 − 23 45 ∈ (0,1). Assumption on the surrogate: ∇𝑆 𝜽; 𝒚 " " ≤ 𝑏𝑆 𝜽; 𝒚 .
  52. Exploiting Surrogate Gap in Online Structured Prediction 58 We have

    ∑!=> N 𝑆! 𝑾! − 𝑆! 𝑼 ≤ 𝑼 ( & &U + U & ∑!=> N ∇𝑆! 𝑾! O & ≤ 𝑼 ( & &U + UV & ∑!=> N 𝑆!(𝑾!). (i) 𝔼 𝐿! 𝜓 𝑾! 𝒙! ≤ (1 − 𝑎)𝑆! (𝑾! ) (ii) ∇𝑆! 𝑾! O & ≤ 𝑏𝑆!(𝑾!). and Randomized decoding 𝜓 ensures (i) with 𝑎 = 1 − 23 45 ∈ (0,1). From OGD’s regret bound and (ii), Assumption on the surrogate: ∇𝑆 𝜽; 𝒚 " " ≤ 𝑏𝑆 𝜽; 𝒚 .
  53. Exploiting Surrogate Gap in Online Structured Prediction 59 We have

    ∑!=> N 𝑆! 𝑾! − 𝑆! 𝑼 ≤ 𝑼 ( & &U + U & ∑!=> N ∇𝑆! 𝑾! O & ≤ 𝑼 ( & &U + UV & ∑!=> N 𝑆!(𝑾!). (i) 𝔼 𝐿! 𝜓 𝑾! 𝒙! ≤ (1 − 𝑎)𝑆! (𝑾! ) ∑!=> N 𝔼 𝐿! 𝜓 𝑾!𝒙! − 𝑆!(𝑼) ≤ ∑!=> N 𝑆! 𝑾! − 𝑆! 𝑼 − 𝑎 ∑!=> N 𝑆!(𝑾!) ≤ 𝑼 ( & &U − 𝑎 − UV & ∑!=> N 𝑆! (𝑾! ). From OGD’s regret bound and (ii), (ii) ∇𝑆! 𝑾! O & ≤ 𝑏𝑆!(𝑾!). From (i) and the above regret bound, surrogate regret is bounded as follows: and Setting 𝜂 = &W V yields a finite bound of V 𝑼 ( & PW , where 𝑎 = 1 − PQ LR . Randomized decoding 𝜓 ensures (i) with 𝑎 = 1 − 23 45 ∈ (0,1). Assumption on the surrogate: ∇𝑆 𝜽; 𝒚 " " ≤ 𝑏𝑆 𝜽; 𝒚 .
  54. Randomized Decoding 60 1. Compute regularized pred. - 𝒚5 𝜽

    ≔ argmax𝒚'∈EFGH(𝒴) 𝜽, 𝒚′ − Ω(𝒚M). - 𝒚5 𝜽
  55. Randomized Decoding 61 1. Compute regularized pred. - 𝒚5 𝜽

    ≔ argmax𝒚'∈EFGH(𝒴) 𝜽, 𝒚′ − Ω(𝒚M). - 𝒚5 𝜽 𝒚∗ Δ∗ 2. Let Δ∗ ≔ min𝒚∗∈𝒴 𝒚∗ − - 𝒚5 𝜽 , i.e., distance to closest 𝒚∗ ∈ 𝒴.
  56. Randomized Decoding 62 1. Compute regularized pred. - 𝒚5 𝜽

    ≔ argmax𝒚'∈EFGH(𝒴) 𝜽, 𝒚′ − Ω(𝒚M). - 𝒚5 𝜽 𝒚∗ Δ∗ - 𝒚 = 𝜓 𝜽 = 𝒚∗ w. p. 1 − 𝑝, † 𝒚 w. p. 𝑝, where 𝔼 † 𝒚 = - 𝒚5 𝜽 . 3. Set 𝑝 = min 1, 2Δ∗/𝜈 and return - 𝒚 ∈ 𝒴 as follows: Smaller 𝑝 (or Δ∗) means higher confidence in 𝒚∗ 2. Let Δ∗ ≔ min𝒚∗∈𝒴 𝒚∗ − - 𝒚5 𝜽 , i.e., distance to closest 𝒚∗ ∈ 𝒴.
  57. Randomized Decoding 63 1. Compute regularized pred. - 𝒚5 𝜽

    ≔ argmax𝒚'∈EFGH(𝒴) 𝜽, 𝒚′ − Ω(𝒚M). - 𝒚5 𝜽 𝒚∗ Δ∗ - 𝒚 = 𝜓 𝜽 = 𝒚∗ w. p. 1 − 𝑝, † 𝒚 w. p. 𝑝, where 𝔼 † 𝒚 = - 𝒚5 𝜽 . 3. Set 𝑝 = min 1, 2Δ∗/𝜈 and return - 𝒚 ∈ 𝒴 as follows: Smaller 𝑝 (or Δ∗) means higher confidence in 𝒚∗ Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ PQ LR 𝑆(𝜽; 𝒚). 2. Let Δ∗ ≔ min𝒚∗∈𝒴 𝒚∗ − - 𝒚5 𝜽 , i.e., distance to closest 𝒚∗ ∈ 𝒴.
  58. Proof 64 - 𝒚5 𝜽 𝒚∗ Δ∗ By the assumption

    on the surrogate loss, 2 𝜆 𝑆 𝜽; 𝒚 ≥ - 𝒚5 𝜽 − 𝒚 & =∶ Δ&. Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ PQ LR 𝑆(𝜽; 𝒚).
  59. Proof 65 - 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal

    ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ PQ LR 𝑆(𝜽; 𝒚). Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. By the assumption on the surrogate loss, 2 𝜆 𝑆 𝜽; 𝒚 ≥ - 𝒚5 𝜽 − 𝒚 & =∶ Δ&.
  60. Proof 66 - 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal

    ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. 1. Compute regularized pred. : 𝒚. 𝜽 ≔ argmax𝒚&∈)*+,(𝒴) 𝜽, 𝒚′ − Ω(𝒚;). : 𝒚 = 𝜓 𝜽 = 𝒚∗ w. p. 1 − 𝑝, c 𝒚 w. p. 𝑝, where 𝔼 c 𝒚 = : 𝒚. 𝜽 . 3. Set 𝑝 = min 1, 2Δ∗/𝜈 and return : 𝒚 ∈ 𝒴 as follows: 2. Let Δ∗ ≔ min𝒚∗∈𝒴 𝒚∗ − : 𝒚. 𝜽 , i.e., distance to closest 𝒚∗ ∈ 𝒴. 𝔼 𝐿 𝜓 𝜽 ; 𝒚 = 1 − 𝑝 𝐿 𝒚∗; 𝒚 + 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 Expected target loss is Procedure 𝔼 𝐿 c 𝒚; 𝒚 = 𝐿 : 𝒚. 𝜽 ; 𝒚 since 𝐿(⋅; 𝒚) is affine.
  61. Proof 67 - 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal

    ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. 1. Compute regularized pred. : 𝒚. 𝜽 ≔ argmax𝒚&∈)*+,(𝒴) 𝜽, 𝒚′ − Ω(𝒚;). : 𝒚 = 𝜓 𝜽 = 𝒚∗ w. p. 1 − 𝑝, c 𝒚 w. p. 𝑝, where 𝔼 c 𝒚 = : 𝒚. 𝜽 . 3. Set 𝑝 = min 1, 2Δ∗/𝜈 and return : 𝒚 ∈ 𝒴 as follows: 𝔼 𝐿 𝜓 𝜽 ; 𝒚 = 1 − 𝑝 𝐿 𝒚∗; 𝒚 + 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 if Δ∗ ≥ 𝜈/2 or 𝒚∗ = 𝒚 --- (i), 1 − 𝑝 𝐿 𝒚∗; 𝒚 + 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 if Δ∗ < 𝜈/2 and 𝒚∗ ≠ 𝒚 --- (ii). = Expected target loss is Procedure 𝔼 𝐿 c 𝒚; 𝒚 = 𝐿 : 𝒚. 𝜽 ; 𝒚 since 𝐿(⋅; 𝒚) is affine. 2. Let Δ∗ ≔ min𝒚∗∈𝒴 𝒚∗ − : 𝒚. 𝜽 , i.e., distance to closest 𝒚∗ ∈ 𝒴.
  62. Case (i): Δ∗ ≥ 𝜈/2 or 𝒚∗ = 𝒚 68

    - 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. ≤ & R Δ∗ ⋅ 𝐿 - 𝒚5 𝜽 ; 𝒚 𝔼 𝐿 𝜓 𝜽 ; 𝒚 = 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 ≤ & R Δ ⋅ 𝐿 - 𝒚5 𝜽 ; 𝒚 ≤ & R Δ ⋅ 𝛾Δ ∵ 𝐿 - 𝒚5 𝜽 ; 𝒚 ≤ 𝛾 - 𝒚5 𝜽 − 𝒚 = 𝛾Δ ∵ 𝑝 = min 1, & R Δ∗ ≤ & R Δ∗ ∵ Δ∗ ≤ Δ
  63. Case (ii): Δ∗ < 𝜈/2 and 𝒚∗ ≠ 𝒚 69

    - 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. ≤ 1 − & R Δ∗ 𝐿 𝒚∗; 𝒚 + & R Δ∗ ⋅ 𝐿 - 𝒚5 𝜽 ; 𝒚 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ 1 − & R Δ∗ 𝛾 𝒚∗ − 𝒚 + & R Δ∗ ⋅ 𝛾 - 𝒚5 𝜽 − 𝒚 ∵ 𝐿 𝒚′; 𝒚 ≤ 𝛾 𝒚′ − 𝒚 ∵ 𝑝 = min 1, & R Δ∗ = & R Δ∗ = 1 − 𝑝 𝐿 𝒚∗; 𝒚 + 𝑝𝐿 - 𝒚5 𝜽 ; 𝒚 ≤ 1 − & R Δ∗ 𝛾( 𝒚∗ − - 𝒚5 𝜽 + - 𝒚5 𝜽 − 𝒚 ) + & R Δ∗ ⋅ 𝛾 - 𝒚5 𝜽 − 𝒚 ∵ triangle ineq. ≤ 1 − & R Δ∗ 𝛾(Δ∗ + Δ) + & R Δ∗ ⋅ 𝛾Δ
  64. Case (ii): Δ∗ < 𝜈/2 and 𝒚∗ ≠ 𝒚 70

    - 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. We’ve confirmed 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ 1 − & R Δ∗ 𝛾(Δ∗ + Δ) + & R Δ∗ ⋅ 𝛾Δ. It remains to show RHS ≤ &Q R Δ&.
  65. Case (ii): Δ∗ < 𝜈/2 and 𝒚∗ ≠ 𝒚 71

    - 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. We’ve confirmed 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ 1 − & R Δ∗ 𝛾(Δ∗ + Δ) + & R Δ∗ ⋅ 𝛾Δ. It remains to show RHS ≤ &Q R Δ&. Dividing by 𝛾𝜈 and letting 𝑢 = Δ∗/𝜈 and 𝑣 = Δ/𝜈, the desired inequality is 1 − 2𝑢 𝑢 + 𝑣 + 2𝑢𝑣 ≤ 2𝑣& ⟺ 2𝑢& + 2𝑣& − 𝑢 − 𝑣 ≥ 0, where 𝑢 + 𝑣 ≥ 1 and 0 ≤ 𝑢 < 1/2 < 𝑣. ∵ 𝑢 + 𝑣 = (Δ∗ + Δ)/𝜈 ≥ 𝒚∗ − 𝒚 /𝜈 ≥ 1 and 𝑢 = Δ∗/𝜈 < 1/2 (hence 𝑣 ≥ 1 − 𝑢 > 1/2). 𝒚∗ − 𝒚 ≥ 𝜈
  66. Case (ii): Δ∗ < 𝜈/2 and 𝒚∗ ≠ 𝒚 72

    - 𝒚5 𝜽 𝒚∗ Δ∗ Δ 𝒚 Goal ∀ 𝜽, 𝒚 ∈ ℝ$×𝒴, 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ &Q R Δ&. From 𝑢 + 𝑣 ≥ 1 and 0 ≤ 𝑢 < 1/2 < 𝑣, 2𝑢& + 2𝑣& − 𝑢 − 𝑣 = 2𝑢& + 2𝑣& − 3𝑢 − 3𝑣 + 4𝑢𝑣 + 1 + (1 − 2𝑢)(2𝑣 − 1) = 𝑢 + 𝑣 − 1 2𝑢 + 2𝑣 − 1 + (1 − 2𝑢)(2𝑣 − 1) obtaining 𝔼 𝐿 𝜓 𝜽 ; 𝒚 ≤ 1 − & R Δ∗ 𝛾(Δ∗ + Δ) + & R Δ∗ ⋅ 𝛾Δ ≤ &Q R Δ& as desired. ≥ 0, 𝒚∗ − 𝒚 ≥ 𝜈
  67. Conclusion 73 Finite surrogate regret bound for online structured prediction

    with FY loss. Additional results • High-probability bound. • Online-to-batch conversion. • Making the bound vanish as 𝐿/𝑆 → 0 (thanks to a suggestion by a COLT reviewer). Future directions • Extension to bandit feedback setting. • Combining with Fitzpatrick loss (Rakotomandimby et al. 2024). Improved bound of 𝑂( 𝑼 O &) for online multiclass classification with logistic loss. Key idea: Exploiting the surrogate gap with randomized decoding. Rakotomandimby et al. Learning with Fitzpatrick Losses. arXiv:405.14574, 2024. https://arxiv.org/abs/2405.14574.
  68. Supplementary: Regret Bound of OGD 74 From convexity of 𝑆!

    , 𝑆! 𝑾! − 𝑆! 𝑼 ≤ ∇𝑆! 𝑾! , 𝑾! − 𝑼 = 𝑾*Y𝑼 ( & Y 𝑾*+%Y𝑼 ( & &U + U & ∇𝑆! 𝑾! O &. By telescoping & ignoring − 𝑾NS> − 𝑼 O &≤ 0, ∑!=> N 𝑆! 𝑾! − 𝑆! 𝑼 ≤ 𝑼 ( & &U + U & ∑!=> N ∇𝑆! 𝑾! O &. 1: Fix 𝑾% = 0 2: For 𝑡 = 1, … , 𝑇 3: Observe 𝑆& (or 𝒙& and 𝒚& ) 4: 𝑾&'% ← 𝑾& − 𝜂∇𝑆& (𝑾& ) From the update rule, 𝑾!S> − 𝑼 O & = 𝑾! − 𝑼 O & + 𝜂& ∇𝑆! 𝑾! O & − 2𝜂 ∇𝑆! 𝑾! , 𝑾! − 𝑼 . Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. ICML, 2003. Orabona. A modern introduction to online learning. arXiv:1912.13213, 2023. https://arxiv.org/abs/1912.13213v6.