Matt P. Dziubinski
July 01, 2015
520

# Algorithmic Differentiation for Extremum Estimation: An Introduction Using RcppEigen (useR! 2015)

Algorithmic Differentiation (AD) enables automatic computation of the derivatives of a function implemented as a computer program. It's distinct from the numerical differentiation approximation methods (like finite differences), as it's exact to the maximum extent allowed by the machine precision. At the same time, it's free from the limitations of symbolic differentiation, since it works with actual computer programs (with branches, loops, allocations, and mutation) and not only pure, algebraic expressions.

AD is useful to a wide range of applications – in particular, fitting models via extremum estimators (i.e., estimators obtained as extrema of the given cost functions). This includes calibrating financial models, training machine learning models, or estimating statistical (or econometric) models. Numerical optimization algorithms necessary for model fitting benefit greatly from the high precision of the gradient obtained using AD – with a direct impact on the ultimate results’ precision (more precise and stable parameter estimates, standard errors, confidence intervals).

This talk shows how to use AD from R – making the use of RcppEigen. We shall motivate the problem (including the issues with finite differences), introduce AD, and demonstrate its advantages over numerical approximations with a likelihood estimation example. We end by speeding up the gradient computation via parallelization.

July 01, 2015

## Transcript

1. ### Algorithmic Differentiation for Extremum Estimation An Introduction Using RcppEigen Matt

P. Dziubinski useR! 2015 in Aalborg, Denmark [email protected] // @matt_dz Department of Mathematical Sciences, Aalborg University CREATES (Center for Research in Econometric Analysis of Time Series)

References 2

4. ### Motivation: Parametric Models • Parametric model: parameter vector θ ∈

Θ ⊆ Rp, Θ — parameter space 4
5. ### Motivation: Parametric Models • Parametric model: parameter vector θ ∈

Θ ⊆ Rp, Θ — parameter space • An estimator ˆ θ is an extremum estimator if ∃Q : Θ → R s.t. ˆ θ = argmax θ∈Θ Q(θ) 4
6. ### Motivation: Parametric Models • Parametric model: parameter vector θ ∈

Θ ⊆ Rp, Θ — parameter space • An estimator ˆ θ is an extremum estimator if ∃Q : Θ → R s.t. ˆ θ = argmax θ∈Θ Q(θ) • can be obtained as an extremum (maximum or minimum) of a scalar objective function 4
7. ### Motivation: Parametric Models • Parametric model: parameter vector θ ∈

Θ ⊆ Rp, Θ — parameter space • An estimator ˆ θ is an extremum estimator if ∃Q : Θ → R s.t. ˆ θ = argmax θ∈Θ Q(θ) • can be obtained as an extremum (maximum or minimum) of a scalar objective function • class includes OLS (Ordinary Least Squares), NLS (Nonlinear Least Squares), GMM (Generalized Method of Moments), and QMLE (Quasi Maximum Likelihood Estimation) 4
8. ### Motivation: Practice • Nowadays one of the most widely applied

approaches for estimation (econometrics, statistics), calibration (ﬁnance), training ((supervised) machine learning) 5
9. ### Motivation: Practice • Nowadays one of the most widely applied

approaches for estimation (econometrics, statistics), calibration (ﬁnance), training ((supervised) machine learning) • Example: (Q)MLE: 5
10. ### Motivation: Practice • Nowadays one of the most widely applied

approaches for estimation (econometrics, statistics), calibration (ﬁnance), training ((supervised) machine learning) • Example: (Q)MLE: • Likelihood: L(θ; y) = f (y; θ), f — data PDF 5
11. ### Motivation: Practice • Nowadays one of the most widely applied

approaches for estimation (econometrics, statistics), calibration (ﬁnance), training ((supervised) machine learning) • Example: (Q)MLE: • Likelihood: L(θ; y) = f (y; θ), f — data PDF • Maximum-likelihood estimator (MLE): ˆ θMLE = argmax θ∈Θ L(θ; y) 5
12. ### Motivation: Practice • Nowadays one of the most widely applied

approaches for estimation (econometrics, statistics), calibration (ﬁnance), training ((supervised) machine learning) • Example: (Q)MLE: • Likelihood: L(θ; y) = f (y; θ), f — data PDF • Maximum-likelihood estimator (MLE): ˆ θMLE = argmax θ∈Θ L(θ; y) • In practice: Generally no are closed-form expressions — estimation relies on numerical optimization 5

14. ### Numerical Optimization: Algorithms • Numerical optimization algorithms — can be

broadly categorized into two categories: 7
15. ### Numerical Optimization: Algorithms • Numerical optimization algorithms — can be

broadly categorized into two categories: • derivative-free — do not rely on the knowledge of the objective function's gradient in the optimization process 7
16. ### Numerical Optimization: Algorithms • Numerical optimization algorithms — can be

broadly categorized into two categories: • derivative-free — do not rely on the knowledge of the objective function's gradient in the optimization process • gradient-based — need the gradient of the objective function 7
17. ### Numerical Optimization: Algorithms • Numerical optimization algorithms — can be

broadly categorized into two categories: • derivative-free — do not rely on the knowledge of the objective function's gradient in the optimization process • gradient-based — need the gradient of the objective function • iteration form: initial iterate: θ0 new iterate: θk = θk−1 + sk , iteration k ≥ 1 step: sk = αk pk length: scalar αk > 0 direction: vector pk s.t. Hk pk = ∇Q(θk) 7
18. ### Numerical Optimization: Algorithms • Numerical optimization algorithms — can be

broadly categorized into two categories: • derivative-free — do not rely on the knowledge of the objective function's gradient in the optimization process • gradient-based — need the gradient of the objective function • iteration form: initial iterate: θ0 new iterate: θk = θk−1 + sk , iteration k ≥ 1 step: sk = αk pk length: scalar αk > 0 direction: vector pk s.t. Hk pk = ∇Q(θk) • Hk : Hk = I (Steepest Ascent) — many variants, including SGD (Stochastic Gradient Descent) Hk = ∇2Q(θk) (Newton) Hk = Hessian approximation satisfying the secant equation: Hk+1 sk = ∇Qk+1 − ∇Qk (Quasi-Newton) 7
19. ### Numerical Optimization: Gradient • Gradient-based algorithms often exhibit superior convergence

rates (superlinear or even quadratic) 8
20. ### Numerical Optimization: Gradient • Gradient-based algorithms often exhibit superior convergence

rates (superlinear or even quadratic) • However: this property depends on the gradient either being exact — or at minimum sufﬁciently accurate in the limit sense, Nocedal and Wright (2006 Chapter 3) 8
21. ### Numerical Optimization: Gradient • Gradient-based algorithms often exhibit superior convergence

rates (superlinear or even quadratic) • However: this property depends on the gradient either being exact — or at minimum sufﬁciently accurate in the limit sense, Nocedal and Wright (2006 Chapter 3) • For many modern models analytical gradient formulas not available in closed-form — or non-trivial to derive 8
22. ### Numerical Optimization: Gradient • Gradient-based algorithms often exhibit superior convergence

rates (superlinear or even quadratic) • However: this property depends on the gradient either being exact — or at minimum sufﬁciently accurate in the limit sense, Nocedal and Wright (2006 Chapter 3) • For many modern models analytical gradient formulas not available in closed-form — or non-trivial to derive • This often leads to the use of numerical approximations — in particular, ﬁnite difference methods (FDMs), Nocedal and Wright (2006 Chapter 8) 8
23. ### Numerical Optimization: Gradient • Gradient-based algorithms often exhibit superior convergence

rates (superlinear or even quadratic) • However: this property depends on the gradient either being exact — or at minimum sufﬁciently accurate in the limit sense, Nocedal and Wright (2006 Chapter 3) • For many modern models analytical gradient formulas not available in closed-form — or non-trivial to derive • This often leads to the use of numerical approximations — in particular, ﬁnite difference methods (FDMs), Nocedal and Wright (2006 Chapter 8) • Inference: Covariance matrix estimators also involve the use of derivative information. 8
24. ### Numerical Optimization: Algorithmic Differentiation • Need a way obtain the

gradient — preferably: fast and accurate 9
25. ### Numerical Optimization: Algorithmic Differentiation • Need a way obtain the

gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: 9
26. ### Numerical Optimization: Algorithmic Differentiation • Need a way obtain the

gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast 9
27. ### Numerical Optimization: Algorithmic Differentiation • Need a way obtain the

gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast • and accurate 9
28. ### Numerical Optimization: Algorithmic Differentiation • Need a way obtain the

gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast • and accurate • and simple 9
29. ### Numerical Optimization: Algorithmic Differentiation • Need a way obtain the

gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast • and accurate • and simple • Essentially: 9
30. ### Numerical Optimization: Algorithmic Differentiation • Need a way obtain the

gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast • and accurate • and simple • Essentially: • an automatic computational differentiation method 9
31. ### Numerical Optimization: Algorithmic Differentiation • Need a way obtain the

gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast • and accurate • and simple • Essentially: • an automatic computational differentiation method • based on the systematic application of the chain rule 9
32. ### Numerical Optimization: Algorithmic Differentiation • Need a way obtain the

gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast • and accurate • and simple • Essentially: • an automatic computational differentiation method • based on the systematic application of the chain rule • exact to the maximum extent allowed by the machine precision 9
33. ### Numerical Optimization: Algorithmic Differentiation • Need a way obtain the

gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast • and accurate • and simple • Essentially: • an automatic computational differentiation method • based on the systematic application of the chain rule • exact to the maximum extent allowed by the machine precision • How does it compare to other computational differentiation methods? 9

37. ### Calculating Derivatives • Main approaches: • Finite Differencing • Algorithmic

Differentiation (AD), a.k.a. Automatic Differentiation 11
38. ### Calculating Derivatives • Main approaches: • Finite Differencing • Algorithmic

Differentiation (AD), a.k.a. Automatic Differentiation • Symbolic Differentiation 11

(x) h 12
40. ### Finite Differencing • Recall f ′(x) = limh→0 f (x+h)−f

(x) h • forward-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x) h , 12
41. ### Finite Differencing • Recall f ′(x) = limh→0 f (x+h)−f

(x) h • forward-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x) h , • central-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x − hei) 2h , 12
42. ### Finite Differencing • Recall f ′(x) = limh→0 f (x+h)−f

(x) h • forward-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x) h , • central-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x − hei) 2h , • Computational cost: 12
43. ### Finite Differencing • Recall f ′(x) = limh→0 f (x+h)−f

(x) h • forward-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x) h , • central-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x − hei) 2h , • Computational cost: • forward-difference formula: p + 1 function evaluations 12
44. ### Finite Differencing • Recall f ′(x) = limh→0 f (x+h)−f

(x) h • forward-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x) h , • central-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x − hei) 2h , • Computational cost: • forward-difference formula: p + 1 function evaluations • central-difference formula: 2 ∗ p function evaluations 12
45. ### Finite Differencing — Truncation Error • Truncation Error Analysis —

Taylor’s theorem 13
46. ### Finite Differencing — Truncation Error • Truncation Error Analysis —

Taylor’s theorem • Consider univariate function f and the forward-difference formula: f ′(x) ≈ f (x + h) − f (x) h 13
47. ### Finite Differencing — Truncation Error • Truncation Error Analysis —

Taylor’s theorem • Consider univariate function f and the forward-difference formula: f ′(x) ≈ f (x + h) − f (x) h • Recall Taylor's formula f (x + h) = f (x) + hf ′(x) + 1 2 h2f ′′(x) + 1 6 h3f ′′′(x) + O(h4 ) 13
48. ### Finite Differencing — Truncation Error • Truncation Error Analysis —

Taylor’s theorem • Consider univariate function f and the forward-difference formula: f ′(x) ≈ f (x + h) − f (x) h • Recall Taylor's formula f (x + h) = f (x) + hf ′(x) + 1 2 h2f ′′(x) + 1 6 h3f ′′′(x) + O(h4 ) • Hence f ′(x) = f (x + h) − f (x) h − 1 2 hf ′′(x) − 1 6 h2f ′′′(x) − O(h3 ) 13
49. ### Finite Differencing — Truncation Error • Truncation Error Analysis —

Taylor’s theorem • Consider univariate function f and the forward-difference formula: f ′(x) ≈ f (x + h) − f (x) h • Recall Taylor's formula f (x + h) = f (x) + hf ′(x) + 1 2 h2f ′′(x) + 1 6 h3f ′′′(x) + O(h4 ) • Hence f ′(x) = f (x + h) − f (x) h − 1 2 hf ′′(x) − 1 6 h2f ′′′(x) − O(h3 ) • Truncation error c1 h + c2 h2 + ... 13
50. ### Finite Differencing — Truncation Error • Truncation Error Analysis —

Taylor’s theorem • Consider univariate function f and the forward-difference formula: f ′(x) ≈ f (x + h) − f (x) h • Recall Taylor's formula f (x + h) = f (x) + hf ′(x) + 1 2 h2f ′′(x) + 1 6 h3f ′′′(x) + O(h4 ) • Hence f ′(x) = f (x + h) − f (x) h − 1 2 hf ′′(x) − 1 6 h2f ′′′(x) − O(h3 ) • Truncation error c1 h + c2 h2 + ... • Decreases with smaller h 13
51. ### Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or

distributivity) of addition (and multiplication) 14
52. ### Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or

distributivity) of addition (and multiplication) • Example: 14
53. ### Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or

distributivity) of addition (and multiplication) • Example: • Input: a = 1234.567; b = 45.678; c = 0.0004 a + b + c c + b + a (a + b + c) - (c + b + a) 14
54. ### Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or

distributivity) of addition (and multiplication) • Example: • Input: a = 1234.567; b = 45.678; c = 0.0004 a + b + c c + b + a (a + b + c) - (c + b + a) • Output: 14
55. ### Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or

distributivity) of addition (and multiplication) • Example: • Input: a = 1234.567; b = 45.678; c = 0.0004 a + b + c c + b + a (a + b + c) - (c + b + a) • Output: • 1280.2454 14
56. ### Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or

distributivity) of addition (and multiplication) • Example: • Input: a = 1234.567; b = 45.678; c = 0.0004 a + b + c c + b + a (a + b + c) - (c + b + a) • Output: • 1280.2454 • 1280.2454 14
57. ### Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or

distributivity) of addition (and multiplication) • Example: • Input: a = 1234.567; b = 45.678; c = 0.0004 a + b + c c + b + a (a + b + c) - (c + b + a) • Output: • 1280.2454 • 1280.2454 • -2.273737e-13 14
58. ### Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or

distributivity) of addition (and multiplication) • Example: • Input: a = 1234.567; b = 45.678; c = 0.0004 a + b + c c + b + a (a + b + c) - (c + b + a) • Output: • 1280.2454 • 1280.2454 • -2.273737e-13 • "Many a serious mathematician has attempted to give rigorous analyses of a sequence of ﬂoating point operations, but has found the task to be so formidable that he has tried to content himself with plausibility arguments instead." — Donald E. Knuth, TAOCP2 14

notation 15
60. ### Floating Point Representation • Floating point representation: inspired by scientiﬁc

notation • The IEEE Standard for Floating-Point Arithmetic (IEEE 754) 15
61. ### Floating Point Representation • Floating point representation: inspired by scientiﬁc

notation • The IEEE Standard for Floating-Point Arithmetic (IEEE 754) • Form: signiﬁcant digits × baseexponent 15
62. ### Floating Point Representation • Floating point representation: inspired by scientiﬁc

notation • The IEEE Standard for Floating-Point Arithmetic (IEEE 754) • Form: signiﬁcant digits × baseexponent • Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} 15
63. ### Floating Point Representation • Floating point representation: inspired by scientiﬁc

notation • The IEEE Standard for Floating-Point Arithmetic (IEEE 754) • Form: signiﬁcant digits × baseexponent • Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} • the difference between 1.0 and the next value representable by the ﬂoating-point number 1.0 + ϵ 15
64. ### Floating Point Representation • Floating point representation: inspired by scientiﬁc

notation • The IEEE Standard for Floating-Point Arithmetic (IEEE 754) • Form: signiﬁcant digits × baseexponent • Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} • the difference between 1.0 and the next value representable by the ﬂoating-point number 1.0 + ϵ • IEEE 754 binary32 (single precision): 2−23 ≈ 1.19209e-07 (1 sign bit, 8 exponent bits, leaving 23 bits for signiﬁcand) 15
65. ### Floating Point Representation • Floating point representation: inspired by scientiﬁc

notation • The IEEE Standard for Floating-Point Arithmetic (IEEE 754) • Form: signiﬁcant digits × baseexponent • Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} • the difference between 1.0 and the next value representable by the ﬂoating-point number 1.0 + ϵ • IEEE 754 binary32 (single precision): 2−23 ≈ 1.19209e-07 (1 sign bit, 8 exponent bits, leaving 23 bits for signiﬁcand) • IEEE 754 binary64 (double precision): 2−52 ≈ 2.22045e-16 (1 sign bit, 11 exponent bits, leaving 52 bits for signiﬁcand) 15
66. ### Floating Point Representation • Floating point representation: inspired by scientiﬁc

notation • The IEEE Standard for Floating-Point Arithmetic (IEEE 754) • Form: signiﬁcant digits × baseexponent • Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} • the difference between 1.0 and the next value representable by the ﬂoating-point number 1.0 + ϵ • IEEE 754 binary32 (single precision): 2−23 ≈ 1.19209e-07 (1 sign bit, 8 exponent bits, leaving 23 bits for signiﬁcand) • IEEE 754 binary64 (double precision): 2−52 ≈ 2.22045e-16 (1 sign bit, 11 exponent bits, leaving 52 bits for signiﬁcand) • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM 15
67. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM 16
68. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) 16
69. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| 16
70. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM 16
71. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM 16
72. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) 16
73. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| 16
74. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM 16
75. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ϵM a.k.a. unit round-off, u = ULP(1.0) 16
76. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ϵM a.k.a. unit round-off, u = ULP(1.0) • ULP — Unit in the last place 16
77. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ϵM a.k.a. unit round-off, u = ULP(1.0) • ULP — Unit in the last place • the distance between the two closest straddling ﬂoating-point numbers a and b (i.e., those with a ≤ x ≤ b and a ̸= b), 16
78. ### Floating Point Representation Properties • ﬂoating point number representation: fl(x)

= x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ϵM a.k.a. unit round-off, u = ULP(1.0) • ULP — Unit in the last place • the distance between the two closest straddling ﬂoating-point numbers a and b (i.e., those with a ≤ x ≤ b and a ̸= b), • a measure of precision in numeric calculations. In base b, if x has exponent E, then ULP(x) = ϵM · bE. 16
79. ### Finite Differencing — Round-off Error I • Consider f ′(x)

≈ f (x + h) − f (x) h 17
80. ### Finite Differencing — Round-off Error I • Consider f ′(x)

≈ f (x + h) − f (x) h • Suppose that x and x + h exact 17
81. ### Finite Differencing — Round-off Error I • Consider f ′(x)

≈ f (x + h) − f (x) h • Suppose that x and x + h exact • Round-off errors when evaluating f: f (x + h)(1 + δ1) − f (x)(1 + δ2) h = f (x + h) − f (x) h + δ1 f (x + h) − δ2 f (x) h 17
82. ### Finite Differencing — Round-off Error I • Consider f ′(x)

≈ f (x + h) − f (x) h • Suppose that x and x + h exact • Round-off errors when evaluating f: f (x + h)(1 + δ1) − f (x)(1 + δ2) h = f (x + h) − f (x) h + δ1 f (x + h) − δ2 f (x) h • We have |δi| ≤ ϵM for i ∈ 1, 2. 17
83. ### Finite Differencing — Round-off Error I • Consider f ′(x)

≈ f (x + h) − f (x) h • Suppose that x and x + h exact • Round-off errors when evaluating f: f (x + h)(1 + δ1) − f (x)(1 + δ2) h = f (x + h) − f (x) h + δ1 f (x + h) − δ2 f (x) h • We have |δi| ≤ ϵM for i ∈ 1, 2. • =⇒ round-off error bound: ϵM |f (x + h)| + |f (x)| h 17
84. ### Finite Differencing — Round-off Error I • Consider f ′(x)

≈ f (x + h) − f (x) h • Suppose that x and x + h exact • Round-off errors when evaluating f: f (x + h)(1 + δ1) − f (x)(1 + δ2) h = f (x + h) − f (x) h + δ1 f (x + h) − δ2 f (x) h • We have |δi| ≤ ϵM for i ∈ 1, 2. • =⇒ round-off error bound: ϵM |f (x + h)| + |f (x)| h • =⇒ round-off error bound order: O(ϵM h ) 17
85. ### Finite Differencing — Round-off Error I • Consider f ′(x)

≈ f (x + h) − f (x) h • Suppose that x and x + h exact • Round-off errors when evaluating f: f (x + h)(1 + δ1) − f (x)(1 + δ2) h = f (x + h) − f (x) h + δ1 f (x + h) − δ2 f (x) h • We have |δi| ≤ ϵM for i ∈ 1, 2. • =⇒ round-off error bound: ϵM |f (x + h)| + |f (x)| h • =⇒ round-off error bound order: O(ϵM h ) • round-off error — increases with smaller h 17
86. ### Finite Differencing — Round-off Error I • Consider f ′(x)

≈ f (x + h) − f (x) h • Suppose that x and x + h exact • Round-off errors when evaluating f: f (x + h)(1 + δ1) − f (x)(1 + δ2) h = f (x + h) − f (x) h + δ1 f (x + h) − δ2 f (x) h • We have |δi| ≤ ϵM for i ∈ 1, 2. • =⇒ round-off error bound: ϵM |f (x + h)| + |f (x)| h • =⇒ round-off error bound order: O(ϵM h ) • round-off error — increases with smaller h • error contribution inversely proportial to h, proportionality constant O(ϵM ) 17
87. ### Finite Differencing — Round-off Error II • ϵM ̸= 0

=⇒ ∃ˆ h : x + ˆ h = x 18
88. ### Finite Differencing — Round-off Error II • ϵM ̸= 0

=⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either 18
89. ### Finite Differencing — Round-off Error II • ϵM ̸= 0

=⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either • =⇒ f (x) = f (x + ˆ h) 18
90. ### Finite Differencing — Round-off Error II • ϵM ̸= 0

=⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either • =⇒ f (x) = f (x + ˆ h) • =⇒ f (x+h)−f (x) ˆ h = 0 18
91. ### Finite Differencing — Round-off Error II • ϵM ̸= 0

=⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either • =⇒ f (x) = f (x + ˆ h) • =⇒ f (x+h)−f (x) ˆ h = 0 • =⇒ no correct digits in the result at (or below) step-size ˆ h 18
92. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) 19
93. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal 19
94. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h 19
95. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM 19
96. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM • h ≈ √ ϵM 19
97. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM • h ≈ √ ϵM • binary64: 1.490116e-08 19
98. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM • h ≈ √ ϵM • binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) 19
99. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM • h ≈ √ ϵM • binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) • binary64: 1.490116e-08 19
100. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM • h ≈ √ ϵM • binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) • binary64: 1.490116e-08 • Total error: O(h2) + O(ϵM h ) (central-difference formula) 19
101. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM • h ≈ √ ϵM • binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) • binary64: 1.490116e-08 • Total error: O(h2) + O(ϵM h ) (central-difference formula) • h ≈ ϵ1/3 M 19
102. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM • h ≈ √ ϵM • binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) • binary64: 1.490116e-08 • Total error: O(h2) + O(ϵM h ) (central-difference formula) • h ≈ ϵ1/3 M • binary64: 6.055454e-06 19
103. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM • h ≈ √ ϵM • binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) • binary64: 1.490116e-08 • Total error: O(h2) + O(ϵM h ) (central-difference formula) • h ≈ ϵ1/3 M • binary64: 6.055454e-06 • total error's minimum value ∈ O(ϵ2/3 M ) 19
104. ### Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) +

O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM • h ≈ √ ϵM • binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) • binary64: 1.490116e-08 • Total error: O(h2) + O(ϵM h ) (central-difference formula) • h ≈ ϵ1/3 M • binary64: 6.055454e-06 • total error's minimum value ∈ O(ϵ2/3 M ) • binary64: 3.666853e-11 19
105. ### Finite Differencing — Central-Difference Formula • Subtracting the backward-difference approximation

from the forward-difference approximation: 20
106. ### Finite Differencing — Central-Difference Formula • Subtracting the backward-difference approximation

from the forward-difference approximation: • (central-difference) f ′(x) = f (x + h) − f (x − h) 2h − O(h2 ) 20
107. ### Finite Differencing — Central-Difference Formula • Subtracting the backward-difference approximation

from the forward-difference approximation: • (central-difference) f ′(x) = f (x + h) − f (x − h) 2h − O(h2 ) • Truncation error ∈ O(h2) (second-order approximation) 20
108. ### Finite Differencing — Central-Difference Formula • Subtracting the backward-difference approximation

from the forward-difference approximation: • (central-difference) f ′(x) = f (x + h) − f (x − h) 2h − O(h2 ) • Truncation error ∈ O(h2) (second-order approximation) • Round-off error ∈ O(ϵM/h) 20
109. ### Finite Differencing — Central-Difference Formula • Subtracting the backward-difference approximation

from the forward-difference approximation: • (central-difference) f ′(x) = f (x + h) − f (x − h) 2h − O(h2 ) • Truncation error ∈ O(h2) (second-order approximation) • Round-off error ∈ O(ϵM/h) • Balanced trade-off: 20
110. ### Finite Differencing — Central-Difference Formula • Subtracting the backward-difference approximation

from the forward-difference approximation: • (central-difference) f ′(x) = f (x + h) − f (x − h) 2h − O(h2 ) • Truncation error ∈ O(h2) (second-order approximation) • Round-off error ∈ O(ϵM/h) • Balanced trade-off: • h ≈ ϵM 1/3 20
111. ### Finite Differencing — Central-Difference Formula • Subtracting the backward-difference approximation

from the forward-difference approximation: • (central-difference) f ′(x) = f (x + h) − f (x − h) 2h − O(h2 ) • Truncation error ∈ O(h2) (second-order approximation) • Round-off error ∈ O(ϵM/h) • Balanced trade-off: • h ≈ ϵM 1/3 • total error's minimum value ∈ O(ϵM 2/3) 20
112. ### Finite Difference Approximation Error Example: f (x) = cos(sin(x) ∗

cos(x)) f ′(x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x)) • Total error: O(h) + O( ϵM h ) (forward-difference formula) 21
113. ### Finite Difference Approximation Error Example: f (x) = cos(sin(x) ∗

cos(x)) f ′(x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x)) • Total error: O(h) + O( ϵM h ) (forward-difference formula) • h ≈ √ ϵM — binary64: 1.490116e-08 21
114. ### Finite Difference Approximation Error Example: f (x) = cos(sin(x) ∗

cos(x)) f ′(x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x)) • Total error: O(h) + O( ϵM h ) (forward-difference formula) • h ≈ √ ϵM — binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) — binary64: 1.490116e-08 21
115. ### Finite Difference Approximation Error Example: f (x) = cos(sin(x) ∗

cos(x)) f ′(x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x)) • Total error: O(h) + O( ϵM h ) (forward-difference formula) • h ≈ √ ϵM — binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) — binary64: 1.490116e-08 • Total error: O(h2) + O( ϵM h ) (central-difference formula) 21
116. ### Finite Difference Approximation Error Example: f (x) = cos(sin(x) ∗

cos(x)) f ′(x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x)) • Total error: O(h) + O( ϵM h ) (forward-difference formula) • h ≈ √ ϵM — binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) — binary64: 1.490116e-08 • Total error: O(h2) + O( ϵM h ) (central-difference formula) • h ≈ ϵ1/3 M — binary64: 6.055454e-06 21
117. ### Finite Difference Approximation Error Example: f (x) = cos(sin(x) ∗

cos(x)) f ′(x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x)) • Total error: O(h) + O( ϵM h ) (forward-difference formula) • h ≈ √ ϵM — binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) — binary64: 1.490116e-08 • Total error: O(h2) + O( ϵM h ) (central-difference formula) • h ≈ ϵ1/3 M — binary64: 6.055454e-06 • total error's minimum value ∈ O(ϵ2/3 M ) — binary64: 3.666853e-11 21
118. ### Symbolic Differentiation • Algebraic speciﬁcation for f manipulated by symbolic

manipulation tools to produce new algebraic expressions for each component of the gradient 22
119. ### Symbolic Differentiation • Algebraic speciﬁcation for f manipulated by symbolic

manipulation tools to produce new algebraic expressions for each component of the gradient • Commonly used symbolic manipulation tools can be found in the packages Mathematica (also Wolfram Alpha), Macsyma, Maple, or Mathics. 22
120. ### Symbolic Differentiation • Algebraic speciﬁcation for f manipulated by symbolic

manipulation tools to produce new algebraic expressions for each component of the gradient • Commonly used symbolic manipulation tools can be found in the packages Mathematica (also Wolfram Alpha), Macsyma, Maple, or Mathics. • Disadvantages: 22
121. ### Symbolic Differentiation • Algebraic speciﬁcation for f manipulated by symbolic

manipulation tools to produce new algebraic expressions for each component of the gradient • Commonly used symbolic manipulation tools can be found in the packages Mathematica (also Wolfram Alpha), Macsyma, Maple, or Mathics. • Disadvantages: • narrow applicability 22
122. ### Symbolic Differentiation • Algebraic speciﬁcation for f manipulated by symbolic

manipulation tools to produce new algebraic expressions for each component of the gradient • Commonly used symbolic manipulation tools can be found in the packages Mathematica (also Wolfram Alpha), Macsyma, Maple, or Mathics. • Disadvantages: • narrow applicability • expression swell 22
123. ### Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation

(SD) applicable to "paper-and-pencil" algebraic expressions 23
124. ### Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation

(SD) applicable to "paper-and-pencil" algebraic expressions • Input: function speciﬁed by an algebraic formula or closed-form expression 23
125. ### Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation

(SD) applicable to "paper-and-pencil" algebraic expressions • Input: function speciﬁed by an algebraic formula or closed-form expression • Objective function in numerical optimization — a computer program 23
126. ### Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation

(SD) applicable to "paper-and-pencil" algebraic expressions • Input: function speciﬁed by an algebraic formula or closed-form expression • Objective function in numerical optimization — a computer program • How are computer programs different? 23
127. ### Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation

(SD) applicable to "paper-and-pencil" algebraic expressions • Input: function speciﬁed by an algebraic formula or closed-form expression • Objective function in numerical optimization — a computer program • How are computer programs different? • named intermediates 23
128. ### Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation

(SD) applicable to "paper-and-pencil" algebraic expressions • Input: function speciﬁed by an algebraic formula or closed-form expression • Objective function in numerical optimization — a computer program • How are computer programs different? • named intermediates • program branches 23
129. ### Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation

(SD) applicable to "paper-and-pencil" algebraic expressions • Input: function speciﬁed by an algebraic formula or closed-form expression • Objective function in numerical optimization — a computer program • How are computer programs different? • named intermediates • program branches • joint allocations / mutation / overwriting 23
130. ### Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation

(SD) applicable to "paper-and-pencil" algebraic expressions • Input: function speciﬁed by an algebraic formula or closed-form expression • Objective function in numerical optimization — a computer program • How are computer programs different? • named intermediates • program branches • joint allocations / mutation / overwriting • iterative loops 23
131. ### Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation

(SD) applicable to "paper-and-pencil" algebraic expressions • Input: function speciﬁed by an algebraic formula or closed-form expression • Objective function in numerical optimization — a computer program • How are computer programs different? • named intermediates • program branches • joint allocations / mutation / overwriting • iterative loops • Do we have to resort to ﬁnite differences? 23

133. ### Numerical Objective Function & Algorithmic Differentiation II • Algorithmic Differentiation

(AD) offers another solution. • AD — a computational differentiation method with the roots in the late 1960s, Griewank (2012) 24
134. ### Numerical Objective Function & Algorithmic Differentiation II • Algorithmic Differentiation

(AD) offers another solution. • AD — a computational differentiation method with the roots in the late 1960s, Griewank (2012) • Machine Learning — Neural Networks — Backpropagation: Bryson, Denham, Dreyfus (1963), Werbos (1974) 24
135. ### Numerical Objective Function & Algorithmic Differentiation II • Algorithmic Differentiation

(AD) offers another solution. • AD — a computational differentiation method with the roots in the late 1960s, Griewank (2012) • Machine Learning — Neural Networks — Backpropagation: Bryson, Denham, Dreyfus (1963), Werbos (1974) • AD widely known and applied in natural sciences — Naumann (2012), and computational ﬁnance — Giles and Glasserman (2006), Homescu (2011) 24

137. ### Algorithmic Differentiation — Idea I • Idea: • computer code

for evaluating the function can be broken down 25
138. ### Algorithmic Differentiation — Idea I • Idea: • computer code

for evaluating the function can be broken down • into a composition of elementary arithmetic operations 25
139. ### Algorithmic Differentiation — Idea I • Idea: • computer code

for evaluating the function can be broken down • into a composition of elementary arithmetic operations • to which the chain rule (one of the basic rules of calculus) can be applied 25
140. ### Algorithmic Differentiation — Idea I • Idea: • computer code

for evaluating the function can be broken down • into a composition of elementary arithmetic operations • to which the chain rule (one of the basic rules of calculus) can be applied • e.g., given an objective function Q(θ) = f (g(θ)), we have Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′ 25
141. ### Algorithmic Differentiation — Idea I • Idea: • computer code

for evaluating the function can be broken down • into a composition of elementary arithmetic operations • to which the chain rule (one of the basic rules of calculus) can be applied • e.g., given an objective function Q(θ) = f (g(θ)), we have Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′ • Relies on the systematic application of the chain rule by a computer program 25
142. ### Algorithmic Differentiation — Idea I • Idea: • computer code

for evaluating the function can be broken down • into a composition of elementary arithmetic operations • to which the chain rule (one of the basic rules of calculus) can be applied • e.g., given an objective function Q(θ) = f (g(θ)), we have Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′ • Relies on the systematic application of the chain rule by a computer program • operating at the source code level 25
143. ### Algorithmic Differentiation — Idea I • Idea: • computer code

for evaluating the function can be broken down • into a composition of elementary arithmetic operations • to which the chain rule (one of the basic rules of calculus) can be applied • e.g., given an objective function Q(θ) = f (g(θ)), we have Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′ • Relies on the systematic application of the chain rule by a computer program • operating at the source code level • performed automatically 25
144. ### Algorithmic Differentiation — Idea I • Idea: • computer code

for evaluating the function can be broken down • into a composition of elementary arithmetic operations • to which the chain rule (one of the basic rules of calculus) can be applied • e.g., given an objective function Q(θ) = f (g(θ)), we have Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′ • Relies on the systematic application of the chain rule by a computer program • operating at the source code level • performed automatically • with minimal or no user intervention 25

26
146. ### Algorithmic Differentiation — Idea II • Mathematical operations performed exactly

• to the maximum extent allowed by the machine precision — i.e., accurate to within round-off error 26
147. ### Algorithmic Differentiation — Idea II • Mathematical operations performed exactly

• to the maximum extent allowed by the machine precision — i.e., accurate to within round-off error • truncation error: none 26
148. ### Algorithmic Differentiation — Idea II • Mathematical operations performed exactly

• to the maximum extent allowed by the machine precision — i.e., accurate to within round-off error • truncation error: none • round-off error: equivalent to using a source code analogously implementing the analytical derivatives. 26
149. ### Algorithmic Differentiation — Idea III • AD similar to symbolic

differentiation 27
150. ### Algorithmic Differentiation — Idea III • AD similar to symbolic

differentiation • major difference: 27
151. ### Algorithmic Differentiation — Idea III • AD similar to symbolic

differentiation • major difference: • SD restricted to symbolic manipulation of algebraic expressions 27
152. ### Algorithmic Differentiation — Idea III • AD similar to symbolic

differentiation • major difference: • SD restricted to symbolic manipulation of algebraic expressions • AD operates on (and integrates with) the existing source code of a computer program, Nocedal and Wright (2006 Section 8.2). 27
153. ### Algorithmic Differentiation — Idea III • AD similar to symbolic

differentiation • major difference: • SD restricted to symbolic manipulation of algebraic expressions • AD operates on (and integrates with) the existing source code of a computer program, Nocedal and Wright (2006 Section 8.2). • Note: less than required by SD, more than required by ﬁnite differences (which can operate on black box implementations) 27
154. ### Algorithmic Differentiation — Idea III • AD similar to symbolic

differentiation • major difference: • SD restricted to symbolic manipulation of algebraic expressions • AD operates on (and integrates with) the existing source code of a computer program, Nocedal and Wright (2006 Section 8.2). • Note: less than required by SD, more than required by ﬁnite differences (which can operate on black box implementations) • AD Input: (source code of) a computer program for evaluating a vector function 27
155. ### Algorithmic Differentiation — Idea III • AD similar to symbolic

differentiation • major difference: • SD restricted to symbolic manipulation of algebraic expressions • AD operates on (and integrates with) the existing source code of a computer program, Nocedal and Wright (2006 Section 8.2). • Note: less than required by SD, more than required by ﬁnite differences (which can operate on black box implementations) • AD Input: (source code of) a computer program for evaluating a vector function • AD Output: (source code of) a computer program implementing derivative(s) thereof 27

28
157. ### Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012):

• source code transformation (SCT) 28
158. ### Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012):

• source code transformation (SCT) • e.g., ADIFOR — produce new code that calculates both function and derivative values 28
159. ### Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012):

• source code transformation (SCT) • e.g., ADIFOR — produce new code that calculates both function and derivative values • operator overloading 28
160. ### Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012):

• source code transformation (SCT) • e.g., ADIFOR — produce new code that calculates both function and derivative values • operator overloading • e.g., ADOL-C — keep track of the elementary computations during the function evaluation, produce the derivatives at given inputs 28
161. ### Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012):

• source code transformation (SCT) • e.g., ADIFOR — produce new code that calculates both function and derivative values • operator overloading • e.g., ADOL-C — keep track of the elementary computations during the function evaluation, produce the derivatives at given inputs • Two modes: forward mode and reverse mode 28
162. ### Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012):

• source code transformation (SCT) • e.g., ADIFOR — produce new code that calculates both function and derivative values • operator overloading • e.g., ADOL-C — keep track of the elementary computations during the function evaluation, produce the derivatives at given inputs • Two modes: forward mode and reverse mode • Example: Eigen's Auto Diff module 28
163. ### Algorithmic Differentiation — Example MLE • a more concrete illustration:

example in the context of statistical estimation 29
164. ### Algorithmic Differentiation — Example MLE • a more concrete illustration:

example in the context of statistical estimation • evaluation and differentiation of the loglikelihood function of a normally distributed sample 29
165. ### Algorithmic Differentiation — Example MLE • a more concrete illustration:

example in the context of statistical estimation • evaluation and differentiation of the loglikelihood function of a normally distributed sample • we have T > 1 observations x1, x2, ..., xT i.i.d. ∼ N(µ, σ2), with µ ∈ R, and σ ∈ R+ 29
166. ### Algorithmic Differentiation — Example MLE • a more concrete illustration:

example in the context of statistical estimation • evaluation and differentiation of the loglikelihood function of a normally distributed sample • we have T > 1 observations x1, x2, ..., xT i.i.d. ∼ N(µ, σ2), with µ ∈ R, and σ ∈ R+ • our parameter vector is θ = (µ, σ2) 29
167. ### Algorithmic Differentiation — Example MLE • a more concrete illustration:

example in the context of statistical estimation • evaluation and differentiation of the loglikelihood function of a normally distributed sample • we have T > 1 observations x1, x2, ..., xT i.i.d. ∼ N(µ, σ2), with µ ∈ R, and σ ∈ R+ • our parameter vector is θ = (µ, σ2) • the likelihood function is L(θ; x) = f (x; θ) = (σ √ 2π)−T exp[−1 2 ∑T t=1(xt − µ)2/σ2] 29
168. ### Algorithmic Differentiation — Evaluation Trace • objective function: the loglikelihood

function, ℓ(θ) = logL(θ; x) = ∑T t=1 ℓt(θ) 30
169. ### Algorithmic Differentiation — Evaluation Trace • objective function: the loglikelihood

function, ℓ(θ) = logL(θ; x) = ∑T t=1 ℓt(θ) • a contribution given by ℓt(θ) = −T 2 log(2π) − T 2 log(σ2) − 1 2 (xt − µ)2/σ2. 30
170. ### Algorithmic Differentiation — Evaluation Trace • objective function: the loglikelihood

function, ℓ(θ) = logL(θ; x) = ∑T t=1 ℓt(θ) • a contribution given by ℓt(θ) = −T 2 log(2π) − T 2 log(σ2) − 1 2 (xt − µ)2/σ2. • goal: compute the value of ℓt corresponding to the arguments xt = 6, µ = 6, and σ2 = 4. 30
171. ### Algorithmic Differentiation — Evaluation Trace • objective function: the loglikelihood

function, ℓ(θ) = logL(θ; x) = ∑T t=1 ℓt(θ) • a contribution given by ℓt(θ) = −T 2 log(2π) − T 2 log(σ2) − 1 2 (xt − µ)2/σ2. • goal: compute the value of ℓt corresponding to the arguments xt = 6, µ = 6, and σ2 = 4. • A computer program may execute the sequence of operations represented by an evaluation trace, Griewank and Walther (2008) 30
172. ### Algorithmic Differentiation — Evaluation Trace Variable Operation Expression Value v

−1 = µ = 6 v0 = σ2 = 4 v1 = ϕ1(v0) = log(v0) = 1.386294 v2 = ϕ2(v1) = −1 2 log(2π) − 1 2 v1 = −1.612086 v3 = ϕ3(v −1) = xt − v −1 = 0 v4 = ϕ4(v3) = v2 3 = 0 v5 = ϕ5(v0, v4) = v4/v0 = 0 v6 = ϕ6(v5) = −1 2 v5 = 0 v7 = ϕ7(v2, v6) = v2 + v6 = −1.612086 ℓt = v7 = −1.612086 31

down 32
174. ### Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken

down • into a composition of intermediate subexpressions vi , with i ∈ {1, . . . , 7} 32
175. ### Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken

down • into a composition of intermediate subexpressions vi , with i ∈ {1, . . . , 7} • intermediate subexpressions: elementary operations and functions ϕi with known, predeﬁned derivatives 32
176. ### Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken

down • into a composition of intermediate subexpressions vi , with i ∈ {1, . . . , 7} • intermediate subexpressions: elementary operations and functions ϕi with known, predeﬁned derivatives • note: similar to the static single assignment form (SSA) used in compiler design and implementation 32
177. ### Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken

down • into a composition of intermediate subexpressions vi , with i ∈ {1, . . . , 7} • intermediate subexpressions: elementary operations and functions ϕi with known, predeﬁned derivatives • note: similar to the static single assignment form (SSA) used in compiler design and implementation • We can now apply the chain rule to the intermediate operations vi and collect the intermediate derivative results in order to obtain the derivative of the ﬁnal expression ℓt = v7 32
178. ### Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken

down • into a composition of intermediate subexpressions vi , with i ∈ {1, . . . , 7} • intermediate subexpressions: elementary operations and functions ϕi with known, predeﬁned derivatives • note: similar to the static single assignment form (SSA) used in compiler design and implementation • We can now apply the chain rule to the intermediate operations vi and collect the intermediate derivative results in order to obtain the derivative of the ﬁnal expression ℓt = v7 • Recall: Two modes of propagating the intermediate information: the forward mode and the reverse mode 32
179. ### Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken

down • into a composition of intermediate subexpressions vi , with i ∈ {1, . . . , 7} • intermediate subexpressions: elementary operations and functions ϕi with known, predeﬁned derivatives • note: similar to the static single assignment form (SSA) used in compiler design and implementation • We can now apply the chain rule to the intermediate operations vi and collect the intermediate derivative results in order to obtain the derivative of the ﬁnal expression ℓt = v7 • Recall: Two modes of propagating the intermediate information: the forward mode and the reverse mode • forward mode: Eigen's Auto Diff module used here 32
180. ### Algorithmic Differentiation — Forward Mode • goal: the derivative of

the output variable ℓt (dependent variable) with respect to the input variable µ (independent variable) 33
181. ### Algorithmic Differentiation — Forward Mode • goal: the derivative of

the output variable ℓt (dependent variable) with respect to the input variable µ (independent variable) • idea: compute the numerical value of the derivative of each of the intermediate subexpressions, keeping the track of the resulting derivatives values along the way. 33
182. ### Algorithmic Differentiation — Forward Mode • goal: the derivative of

the output variable ℓt (dependent variable) with respect to the input variable µ (independent variable) • idea: compute the numerical value of the derivative of each of the intermediate subexpressions, keeping the track of the resulting derivatives values along the way. • for each variable vi : introduce another variable, ˙ vi = ∂vi/∂µ 33
183. ### Algorithmic Differentiation — Forward Mode • goal: the derivative of

the output variable ℓt (dependent variable) with respect to the input variable µ (independent variable) • idea: compute the numerical value of the derivative of each of the intermediate subexpressions, keeping the track of the resulting derivatives values along the way. • for each variable vi : introduce another variable, ˙ vi = ∂vi/∂µ • then: apply the chain rule mechanically to each intermediate subexpression 33
184. ### Algorithmic Differentiation — Forward Mode • goal: the derivative of

the output variable ℓt (dependent variable) with respect to the input variable µ (independent variable) • idea: compute the numerical value of the derivative of each of the intermediate subexpressions, keeping the track of the resulting derivatives values along the way. • for each variable vi : introduce another variable, ˙ vi = ∂vi/∂µ • then: apply the chain rule mechanically to each intermediate subexpression • assigning the resulting numerical value to each ˙ vi . 33

186. ### Algorithmic Differentiation — Forward Mode • in our case •

start with ˙ v −1 = 1.0 (active differentation variable µ), 34
187. ### Algorithmic Differentiation — Forward Mode • in our case •

start with ˙ v −1 = 1.0 (active differentation variable µ), • ˙ v0 = 0.0 (passive: we are not differentiating with respect to σ2 and thus treat it as a constant) 34
188. ### Algorithmic Differentiation — Forward Mode • in our case •

start with ˙ v −1 = 1.0 (active differentation variable µ), • ˙ v0 = 0.0 (passive: we are not differentiating with respect to σ2 and thus treat it as a constant) • then, given that v1 = ϕ1(v0) = log(v0), we obtain ˙ v1 = (∂ϕ1(v0)/∂v0)˙ v0 = ˙ v0/v0 = 0.0 34
189. ### Algorithmic Differentiation — Forward Mode • in our case •

start with ˙ v −1 = 1.0 (active differentation variable µ), • ˙ v0 = 0.0 (passive: we are not differentiating with respect to σ2 and thus treat it as a constant) • then, given that v1 = ϕ1(v0) = log(v0), we obtain ˙ v1 = (∂ϕ1(v0)/∂v0)˙ v0 = ˙ v0/v0 = 0.0 • analogously, v2 = ϕ2(v1) = −1 2 log(2π) − 1 2 v1 , hence ˙ v2 = (∂ϕ2(v1)/∂v1)˙ v1 = −1 2 ˙ v1 = 0.0 34
190. ### Algorithmic Differentiation — Forward Mode • in our case •

start with ˙ v −1 = 1.0 (active differentation variable µ), • ˙ v0 = 0.0 (passive: we are not differentiating with respect to σ2 and thus treat it as a constant) • then, given that v1 = ϕ1(v0) = log(v0), we obtain ˙ v1 = (∂ϕ1(v0)/∂v0)˙ v0 = ˙ v0/v0 = 0.0 • analogously, v2 = ϕ2(v1) = −1 2 log(2π) − 1 2 v1 , hence ˙ v2 = (∂ϕ2(v1)/∂v1)˙ v1 = −1 2 ˙ v1 = 0.0 • by continuing this procedure we ultimately obtain an augmented, forward-derived, evaluation trace 34
191. ### Algorithmic Differentiation — Augmented Evaluation Trace Variable Operation Expression Value

v −1 = µ = 6 ˙ v −1 = ∂v −1/∂v −1 = 1 v0 = σ2 = 4 ˙ v0 = ∂v0/∂v −1 = 0 v1 = ϕ1(v0) = log(v0) = 1.386294 ˙ v1 = (∂ϕ1/∂v0)˙ v0 = ˙ v0/v0 = 0 . . . . . . . . . . . . v7 = ϕ7(v2, v6) = v2 + v6 = −1.612086 ˙ v7 = (∂ϕ7/∂v2)˙ v2 + (∂ϕ7/∂v6)˙ v6 = ˙ v2 + ˙ v6 = 0 ℓt = v7 = −1.612086 ˙ ℓt = ˙ v7 = 0 35
192. ### Algorithmic Differentiation — Augmented Evaluation Trace • The basic forward

mode of AD — called “forward” because the derivatives values ˙ vi carried along simultaneously with the values vi themselves 36
193. ### Algorithmic Differentiation — Augmented Evaluation Trace • The basic forward

mode of AD — called “forward” because the derivatives values ˙ vi carried along simultaneously with the values vi themselves • The problem addressed by the implementations: 36
194. ### Algorithmic Differentiation — Augmented Evaluation Trace • The basic forward

mode of AD — called “forward” because the derivatives values ˙ vi carried along simultaneously with the values vi themselves • The problem addressed by the implementations: • "transform a program with a particular evaluation trace into an augmented program whose evaluation trace also contains exactly the same extra variables and additional lines as in the derived evaluation trace," Griewank and Walther (2008). 36
195. ### AD — Reverse Mode • An alternative to the forward

approach — the reverse or adjoint mode. 37
196. ### AD — Reverse Mode • An alternative to the forward

approach — the reverse or adjoint mode. • Griewank and Walther (2008): "Rather than choosing an input variable and calculating the sensitivity of every intermediate variable with respect to that input, we choose instead an output variable and calculate the sensitivity of that output with respect to each of the intermediate variables. As it turns out adjoint sensitivities must naturally be computed backward, i.e., starting from the output variables." 37
197. ### AD — Reverse Mode • An alternative to the forward

approach — the reverse or adjoint mode. • Griewank and Walther (2008): "Rather than choosing an input variable and calculating the sensitivity of every intermediate variable with respect to that input, we choose instead an output variable and calculate the sensitivity of that output with respect to each of the intermediate variables. As it turns out adjoint sensitivities must naturally be computed backward, i.e., starting from the output variables." • This talk: we shall not treat it any further detail and refer anyone interested to Griewank and Walther (2008). 37
198. ### AD — Reverse Mode • An alternative to the forward

approach — the reverse or adjoint mode. • Griewank and Walther (2008): "Rather than choosing an input variable and calculating the sensitivity of every intermediate variable with respect to that input, we choose instead an output variable and calculate the sensitivity of that output with respect to each of the intermediate variables. As it turns out adjoint sensitivities must naturally be computed backward, i.e., starting from the output variables." • This talk: we shall not treat it any further detail and refer anyone interested to Griewank and Walther (2008). • Worth noting: the choice between the modes depends on the computational cost. 37
199. ### AD — Forward Mode vs. Reverse Mode • In general,

for f : Rn → Rm 38
200. ### AD — Forward Mode vs. Reverse Mode • In general,

for f : Rn → Rm • rough CostAD,Forward(f) ∈ n O(Cost(f)) 38
201. ### AD — Forward Mode vs. Reverse Mode • In general,

for f : Rn → Rm • rough CostAD,Forward(f) ∈ n O(Cost(f)) • rough CostAD,Reverse(f) ∈ m O(Cost(f)) 38
202. ### AD — Forward Mode vs. Reverse Mode • In general,

for f : Rn → Rm • rough CostAD,Forward(f) ∈ n O(Cost(f)) • rough CostAD,Reverse(f) ∈ m O(Cost(f)) • Which is better? Depends: e.g., gradient vs. Jacobian 38
203. ### AD — Forward Mode vs. Reverse Mode • In general,

for f : Rn → Rm • rough CostAD,Forward(f) ∈ n O(Cost(f)) • rough CostAD,Reverse(f) ∈ m O(Cost(f)) • Which is better? Depends: e.g., gradient vs. Jacobian • Foward mode performance: can establish a worst-case performance bound for AD when used for gradient computation 38
204. ### Illustration: Simulation Study — Setup The Model: GARCH(1, 1) The

spot asset price S follows the following process under the phys- ical probability measure P: log St+1 St ≡ rt+1 = µ + ϵt+1 (1) ϵt+1 = √vt+1 wt+1 (2) vt+1 = ω + aϵ2 t + bvt (3) w P ∼ GWN(0, 1) (4) 39
205. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ 40
206. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ • r — logarithmic return of the spot price 40
207. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r 40
208. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r 40
209. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r • non-negativity: ω, a, b > 0 40
210. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r • non-negativity: ω, a, b > 0 • w — randomness source (innovation): standard Gaussian white noise 40
211. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r • non-negativity: ω, a, b > 0 • w — randomness source (innovation): standard Gaussian white noise • mv = ω 1−a−b — the unconditional mean of v (unconditional variance of r) 40
212. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r • non-negativity: ω, a, b > 0 • w — randomness source (innovation): standard Gaussian white noise • mv = ω 1−a−b — the unconditional mean of v (unconditional variance of r) • pv = (a + b) — the mean-reversion rate or the persistence rate 40
213. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r • non-negativity: ω, a, b > 0 • w — randomness source (innovation): standard Gaussian white noise • mv = ω 1−a−b — the unconditional mean of v (unconditional variance of r) • pv = (a + b) — the mean-reversion rate or the persistence rate • when pv < 1, weak stationarity, the conditional variance will revert to the unconditional variance at a geometric rate of pv 40
214. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r • non-negativity: ω, a, b > 0 • w — randomness source (innovation): standard Gaussian white noise • mv = ω 1−a−b — the unconditional mean of v (unconditional variance of r) • pv = (a + b) — the mean-reversion rate or the persistence rate • when pv < 1, weak stationarity, the conditional variance will revert to the unconditional variance at a geometric rate of pv • smaller pv — less persistent sensitivity of the volatility expectation to past market shocks 40
215. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r • non-negativity: ω, a, b > 0 • w — randomness source (innovation): standard Gaussian white noise • mv = ω 1−a−b — the unconditional mean of v (unconditional variance of r) • pv = (a + b) — the mean-reversion rate or the persistence rate • when pv < 1, weak stationarity, the conditional variance will revert to the unconditional variance at a geometric rate of pv • smaller pv — less persistent sensitivity of the volatility expectation to past market shocks • pv = 1 — integrated GARCH (IGARCH) process 40
216. ### Illustration: Simulation Study — Setup — Notation • θ =

(µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r • non-negativity: ω, a, b > 0 • w — randomness source (innovation): standard Gaussian white noise • mv = ω 1−a−b — the unconditional mean of v (unconditional variance of r) • pv = (a + b) — the mean-reversion rate or the persistence rate • when pv < 1, weak stationarity, the conditional variance will revert to the unconditional variance at a geometric rate of pv • smaller pv — less persistent sensitivity of the volatility expectation to past market shocks • pv = 1 — integrated GARCH (IGARCH) process • common ﬁnding: pv close to 1 in ﬁnancial data, "near-integrated" process, Bollerslev and Engle (1993). 40
217. ### Illustration: Monte Carlo Simulation • DGP (data-generating process): near-integrated scenario,

θDGP : µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85 41
218. ### Illustration: Monte Carlo Simulation • DGP (data-generating process): near-integrated scenario,

θDGP : µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85 • M = 10,000 (MC estimation experiments / replications count) 41
219. ### Illustration: Monte Carlo Simulation • DGP (data-generating process): near-integrated scenario,

θDGP : µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85 • M = 10,000 (MC estimation experiments / replications count) • N = 2,000 ... 16,000 (sample size) 41
220. ### Illustration: Monte Carlo Simulation • DGP (data-generating process): near-integrated scenario,

θDGP : µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85 • M = 10,000 (MC estimation experiments / replications count) • N = 2,000 ... 16,000 (sample size) • 2,000 observations / 252 business days — roughly 8 years of data 41
221. ### Illustration: Monte Carlo Simulation • DGP (data-generating process): near-integrated scenario,

θDGP : µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85 • M = 10,000 (MC estimation experiments / replications count) • N = 2,000 ... 16,000 (sample size) • 2,000 observations / 252 business days — roughly 8 years of data • S&P 500 from January 3rd, 1950 — over 16,000 observations 41

225. ### Illustration: Findings — Reliability • Reliability of estimation: • using

AD: successful convergence in the vast majority of Monte Carlo experiments 44
226. ### Illustration: Findings — Reliability • Reliability of estimation: • using

AD: successful convergence in the vast majority of Monte Carlo experiments • using FD: very high convergence failure ratio 44
227. ### Illustration: Findings — Reliability • Reliability of estimation: • using

AD: successful convergence in the vast majority of Monte Carlo experiments • using FD: very high convergence failure ratio • Successful-convergence ratios (fraction of MC experiments with successful convergence within a 0.01 neighborhood true DGP values) for the best algorithm-gradient pairs are as follows: 44
228. ### Illustration: Findings — Reliability • Reliability of estimation: • using

AD: successful convergence in the vast majority of Monte Carlo experiments • using FD: very high convergence failure ratio • Successful-convergence ratios (fraction of MC experiments with successful convergence within a 0.01 neighborhood true DGP values) for the best algorithm-gradient pairs are as follows: • L-BFGS(AD): average = 0.811, median: 0.99999903, 44
229. ### Illustration: Findings — Reliability • Reliability of estimation: • using

AD: successful convergence in the vast majority of Monte Carlo experiments • using FD: very high convergence failure ratio • Successful-convergence ratios (fraction of MC experiments with successful convergence within a 0.01 neighborhood true DGP values) for the best algorithm-gradient pairs are as follows: • L-BFGS(AD): average = 0.811, median: 0.99999903, • L-BFGS(FD): average = 0.446, median: 0.02893254 (sic), 44
230. ### Illustration: Findings — Reliability • Reliability of estimation: • using

AD: successful convergence in the vast majority of Monte Carlo experiments • using FD: very high convergence failure ratio • Successful-convergence ratios (fraction of MC experiments with successful convergence within a 0.01 neighborhood true DGP values) for the best algorithm-gradient pairs are as follows: • L-BFGS(AD): average = 0.811, median: 0.99999903, • L-BFGS(FD): average = 0.446, median: 0.02893254 (sic), • Truncated Newton with restarting (TNR) (FD): average = 0.708, median: 0.99856261. 44
231. ### Illustration: Findings • Overall: When using the fastest successful algorithm-gradient

pair for both AD and FD: 45
232. ### Illustration: Findings • Overall: When using the fastest successful algorithm-gradient

pair for both AD and FD: • AD achieves 12.5 times higher accuracy (in terms of the error norm) with a 3 slowdown compared to L-BFGS(FD) 45
233. ### Illustration: Findings • Overall: When using the fastest successful algorithm-gradient

pair for both AD and FD: • AD achieves 12.5 times higher accuracy (in terms of the error norm) with a 3 slowdown compared to L-BFGS(FD) • while achieving 4 times higher accuracy and 1.1 speedup compared to TNR(FD). 45

235. ### Example in R I # Gaussian model # 2 parameters:

mu, sigma^2 # # parameters = vector(mode = "numeric", length = 2L) # names(parameters) = c("mu", "sigma^2") example_parameters = function() { parameters = vector(mode = "numeric", length = 2L) names(parameters) = c("mu", "sigma^2") parameters["mu"] = 6 parameters["sigma^2"] = 4 parameters } 47
236. ### Example in R II # `l_t`: objective function contribution l_t

= function(xt, parameters) { mu = parameters["mu"] s2 = parameters["sigma^2"] -.5 * log(2. * pi) - .5 * log(s2) - .5 * (xt - mu)^2 / s2 } 48
237. ### Example in R III # evaluation trace of `l_t` evaluation_trace

= function(xt, parameters) { mu = parameters["mu"] s2 = parameters["sigma^2"] v_1 = mu; v0 = s2 v1 = log(v0) v2 = -.5 * log(2. * pi) - .5 * v1 v3 = xt - v_1 v4 = v3^2 v5 = v4 / v0 v6 = -.5 * v5 v7 = v2 + v6 v7 } 49
238. ### Example in R IV require("testthat") describe("evaluation_trace", { it("matches the objective

function", { parameters = example_parameters() objective_given = function(xt) l_t(xt, parameters) trace_given = function(xt) evaluation_trace(xt, parameters) expect_equivalent(expected = objective_given(3), object = trace_given(3)) expect_equivalent(expected = objective_given(6), object = trace_given(6)) expect_equivalent(expected = objective_given(9), object = trace_given(9)) }) }) 50
239. ### Example in R V # analytical derivative w.r.t. `mu` dl_t_dmu

= function(xt, parameters) { mu = parameters["mu"] s2 = parameters["sigma^2"] (xt - mu) / s2 } # analytical derivative w.r.t. `sigma^2` dl_t_ds2 = function(xt, parameters) { mu = parameters["mu"] s2 = parameters["sigma^2"] (xt - mu)^2 / (2 * s2^2) - 1 / (2 * s2) } 51
240. ### Example in R VI # forward-difference approximation of derivative w.r.t.

`mu` dl_t_dmu_FD = function(xt, parameters, h) { mu = parameters["mu"] s2 = parameters["sigma^2"] parameters_FD = parameters parameters_FD["mu"] = mu + h (l_t(xt, parameters_FD) - l_t(xt, parameters)) / h } 52
241. ### Example in R VII # central-difference approximation of derivative w.r.t.

`mu` dl_t_dmu_CD = function(xt, parameters, h) { mu = parameters["mu"] s2 = parameters["sigma^2"] parameters_CD_forward = parameters parameters_CD_forward["mu"] = mu + h parameters_CD_backward = parameters parameters_CD_backward["mu"] = mu - h (l_t(xt, parameters_CD_forward) - l_t(xt, parameters_CD_backward)) / (2. * h) } 53
242. ### Example in R VIII # step too large xs =

c(3, 6, 9) h = 1e-1 expect_equivalent(expected = dl_t_dmu(xs, fixed_parameters), object = dl_t_dmu_FD(xs, fixed_parameters, h)) Error: dl_t_dmu_FD(xs, fixed_parameters, h) not equal to expected 3/3 mismatches (average diff: 0.0125). First 3: pos x y diff 1 -0.75 -0.7625 0.0125 2 0.00 -0.0125 0.0125 3 0.75 0.7375 0.0125 54
243. ### Example in R IX # step too small xs =

c(3, 6, 9) h = 1e-10 expect_equivalent(expected = dl_t_dmu(xs, fixed_parameters), object = dl_t_dmu_FD(xs, fixed_parameters, h)) Error: dl_t_dmu_FD(xs, fixed_parameters, h) not equal to expected 2/3 mismatches (average diff: 6.21e-08). First 2: pos x y diff 1 -0.75 -0.75 6.21e-08 3 0.75 0.75 -6.21e-08 55
244. ### Example in R X # optimal step size xs =

c(3, 6, 9) h = sqrt(.Machine\$double.eps) expect_equivalent(expected = dl_t_dmu(xs, fixed_parameters), object = dl_t_dmu_FD(xs, fixed_parameters, h)) # test passes 56
245. ### Example in R XI # truncation - round-off trade-off xs

= c(3, 6, 9) hs = 10 ^ -(1:16) for (h in hs) { grad_analytical = dl_t_dmu(xs, fixed_parameters) grad_FD = dl_t_dmu_FD(xt = xs, fixed_parameters, h = h) err = grad_analytical - grad_FD err_2 = sqrt(sum(err^2)) err_max = max(abs(err)) cat(h, "\t", err_2, "\t", err_max, "\n") } 57
246. ### Example in R XII h err_2 err_max 0.1 0.02165064 0.0125

0.01 0.002165064 0.00125 0.001 0.0002165064 0.000125 1e-04 2.165064e-05 1.25e-05 1e-05 2.165038e-06 1.250006e-06 1e-06 2.165256e-07 1.251159e-07 1e-07 2.307615e-08 1.343989e-08 1e-08 2.495572e-08 1.764636e-08 1e-09 8.775942e-08 6.205528e-08 1e-10 8.775942e-08 6.205528e-08 1e-11 3.148961e-05 2.226652e-05 1e-12 9.429331e-05 6.667544e-05 1e-13 0.0007223303 0.0005107646 1e-14 0.0070027 0.004951657 1e-15 0.4326232 0.3059108 1e-16 1.06066 0.75 58
247. ### Example in R XIII # Manual Differentiation # Manually Forward-Derived

Evalutation Trace # of The Normal Loglikelihood Contribution forward_evaluation_trace = function(xt, parameters) { mu = parameters["mu"] s2 = parameters["sigma^2"] v_1 = mu v_1. = 1.0 # active differentiation variable v0 = s2 v0. = 0.0 # passive differentiation variable v1 = log(v0) dv1_dv0 = 1 / v0 v1. = dv1_dv0 * v0. 59
248. ### Example in R XIV v2 = -.5 * log(2. *

pi) - .5 * v1 dv2_dv1 = -.5 v2. = dv2_dv1 * v1. v3 = xt - v_1 dv3_dv_1 = -1. v3. = dv3_dv_1 * v_1. v4 = v3^2 dv4_dv3 = 2. * v3 v4. = dv4_dv3 * v3. v5 = v4 / v0 dv5_dv0 = - v4 / (v0^2) dv5_dv4 = 1 / v0 v5. = dv5_dv0 * v0. + dv5_dv4 * v4. 60
249. ### Example in R XV v6 = -.5 * v5 dv6_dv5

= -.5 v6. = dv6_dv5 * v5. v7 = v2 + v6 dv7_dv2 = 1. dv7_dv6 = 1. v7. = dv7_dv2 * v2. + dv7_dv6 * v6. data.frame(f = v7, g = v7.) } dl_t_dmu_AD = function(xt, parameters) { forward_evaluation_trace(xt, parameters)\$g } 61
250. ### Example in R XVI # NO step size # NO

truncation - round-off trade-off # NO problem :-) xs = c(3, 6, 9) expect_equivalent(expected = dl_t_dmu(xt = xs, fixed_parameters), object = dl_t_dmu_AD(xt = xs, fixed_parameters)) 62
251. ### Example in R XVII N = 10000L # sample size

xs = rnorm(N) err_AD = dl_t_dmu(xs, fixed_parameters) - dl_t_dmu_AD(xs, fixed_parameters) err_FD = dl_t_dmu(xs, fixed_parameters) - dl_t_dmu_FD(xs, fixed_parameters, sqrt(.Machine\$double.eps)) err_CD = dl_t_dmu(xs, fixed_parameters) - dl_t_dmu_CD(xs, fixed_parameters, .Machine\$double.eps^(1/3)) 63
252. ### Example in R XVIII > head(err_AD) [1] 0 0 0

0 0 0 > head(err_FD) [1] 7.520500e-08 -3.242974e-08 4.690289e-08 6.174995e-08 -1.4761 > head(err_CD) [1] 1.116505e-10 1.861267e-11 -4.390932e-11 1.224840e-10 3.1222 > sum(err_AD) [1] 0 > sum(err_FD) [1] 2.248303e-05 > sum(err_CD) [1] 3.520195e-07 64
253. ### Example using Rcpp Rcpp::sourceCpp('ExampleGaussianRcpp.cpp') describe("objective function in Rcpp", { it("matches

the objective function in R", { parameters = example_parameters() objectiveR_at = function(xt) l_t(xt, parameters) objectiveRcpp_at = function(xt) l_t_cpp(xt, parameters) expect_equivalent(expected = objectiveR_at(3), object = objectiveRcpp_at(3)) expect_equivalent(expected = objectiveR_at(6), object = objectiveRcpp_at(6)) expect_equivalent(expected = objectiveR_at(9), object = objectiveRcpp_at(9)) }) }) 65
254. ### ExampleGaussianRcpp.cpp I // [[Rcpp::plugins("cpp11")]] #include <cstddef> // [[Rcpp::depends(BH)]] #include <boost/math/constants/constants.hpp>

// [[Rcpp::depends(RcppEigen)]] #include <RcppEigen.h> #include <cmath> 66
255. ### ExampleGaussianRcpp.cpp II namespace model { // 2 parameters: mu, sigma^2

enum parameter : std::size_t { mu, s2 }; constexpr std::size_t parameters_count = 2; } 67
256. ### ExampleGaussianRcpp.cpp III // [[Rcpp::export]] double l_t_cpp(double xt, const Eigen::Map<Eigen::VectorXd> parameters)

{ const auto mu = parameters[model::parameter::mu]; const auto s2 = parameters[model::parameter::s2]; constexpr auto two_pi = boost::math::constants::two_pi<double>(); using std::log; using std::pow; return -.5 * log(two_pi) - .5 * log(s2) - .5 * pow(xt - mu, 2) / s2; } 68

258. ### ExampleGaussianRcppEigen.cpp I // [[Rcpp::plugins("cpp11")]] #include <cstddef> // [[Rcpp::depends(BH)]] #include <boost/math/constants/constants.hpp>

// [[Rcpp::depends(RcppEigen)]] #include <RcppEigen.h> #include <unsupported/Eigen/AutoDiff> #include <cmath> 70
259. ### ExampleGaussianRcppEigen.cpp II namespace model { // 2 parameters: mu, sigma^2

enum parameter : std::size_t { mu, s2 }; constexpr std::size_t parameters_count = 2; template <typename ScalarType> using parameter_vector = Eigen::Matrix<ScalarType, parameters_count, 1>; } 71
260. ### ExampleGaussianRcppEigen.cpp III // note: data `xt` -- double-precision number(s), just

as before // only the parameters adjusted to `ScalarType` template <typename VectorType> typename VectorType::Scalar l_t_cpp_AD(double xt, const VectorType & parameters) { using Scalar = typename VectorType::Scalar; const Scalar & mu = parameters[model::parameter::mu]; const Scalar & s2 = parameters[model::parameter::s2]; // note: `two_pi` is, as always, a double-precision constant constexpr auto two_pi = boost::math::constants::two_pi<double>(); using std::log; using std::pow; return -.5 * log(two_pi) - .5 * log(s2) - .5 * pow(xt - mu, 2) / s2; } 72
261. ### ExampleGaussianRcppEigen.cpp IV // wrapper over `l_t_cpp_AD` // (simply dispatching to

the existing implementation) // // (Rcpp-exportable explicit template instantiations // would render this unnecessary) // [[Rcpp::export]] double l_t_value_cpp(double xt, const Eigen::Map<Eigen::VectorXd> parameters) { return l_t_cpp_AD(xt, parameters); } 73
262. ### ExampleGaussianRcppEigen.cpp V // objective function together with its gradient //

`xt`: input (data) // `parameters_input`: input (model parameters) // `gradient_output`: output (computed gradient) // returns: computed objective function value // [[Rcpp::export]] double l_t_value_gradient_cpp ( double xt, const Eigen::Map<Eigen::VectorXd> parameters_input, Eigen::Map<Eigen::VectorXd> gradient_output ) { using parameter_vector = model::parameter_vector<double>; using AD = Eigen::AutoDiffScalar<parameter_vector>; using VectorAD = model::parameter_vector<AD>; 74

constexpr auto P = model::parameters_count; for (std::size_t p = 0; p != P; ++p) { // forward mode: // active differentiation variable `p` // (one parameter at a time) parameters_AD(p).derivatives() = parameter_vector::Unit(p); } const AD & loglikelihood_result = l_t_cpp_AD(xt, parameters_AD); gradient_output = loglikelihood_result.derivatives(); return loglikelihood_result.value(); } 75
264. ### AD Example using RcppEigen I Rcpp::sourceCpp('ExampleGaussianRcppEigen.cpp') derivatives_match_given = function(data) {

fixed_parameters = example_parameters() gradient = vector(mode = "numeric", length = 2L) names(gradient) = c("mu", "sigma^2") l_t_value_gradient_cpp(data, fixed_parameters, gradient) expect_equivalent(expected = dl_t_dmu(data, fixed_parameters), object = gradient["mu"]) expect_equivalent(expected = dl_t_ds2(data, fixed_parameters), object = gradient["sigma^2"]) } 76
265. ### AD Example using RcppEigen II describe("algorithmically differentiated function using RcppEigen",

{ it("matches the analytical derivatives implemented in R", { derivatives_match_given(data = 3) derivatives_match_given(data = 6) derivatives_match_given(data = 9) }) }) 77
266. ### AD Example using RcppEigen — Performance I # Recall: #

`dl_t_dmu_AD` uses our # Manually Forward-Derived Evalutation Trace # operating on the (whole) data vector N = 10000L # sample size xs = rnorm(N) system.time( { err_AD = dl_t_dmu(xs, fixed_parameters) - dl_t_dmu_AD(xs, fixed_parameters) } ) 78
267. ### AD Example using RcppEigen — Performance II # accuracy: R-some

:-) > head(err_AD) [1] 0 0 0 0 0 0 > sum(err_AD) [1] 0 # performance: good enough for now ;-) user system elapsed 0 0 0 79
268. ### AD Example using RcppEigen — Performance III gradient = vector(mode

= "numeric", length = 2L) names(gradient) = c("mu", "sigma^2") errs_AD_cpp = vector(mode = "numeric", length = N) # in contrast: we now have a # scalar-taking function system.time( { for (t in seq_along(xs)) { l_t_value_gradient_cpp(xs[t], fixed_parameters, gradient) errs_AD_cpp[t] = dl_t_dmu(xs[t], fixed_parameters) - dl_t_dmu_AD(xs[t], fixed_parameters) } } ) 80
269. ### AD Example using RcppEigen — Performance IV # accuracy: awesome

> head(errs_AD_cpp) [1] 0 0 0 0 0 0 > sum(errs_AD_cpp) [1] 0 # performance: not ideal user system elapsed 3.56 0.00 3.56 81
270. ### Data Parallel Objective Function I // First: make the data

parallelism (DP) explicit // [[Rcpp::export]] Eigen::VectorXd l_t_value_cppDP ( const Eigen::Map<Eigen::VectorXd> xs, const Eigen::Map<Eigen::VectorXd> parameters ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); for (std::size_t t = 0; t != sample_size; ++t) { result[t] = l_t_cpp_AD(xs[t], parameters); } return result; } 82
271. ### Data Parallel Objective Function II // Second: Parallelize using OpenMP

// [[Rcpp::export]] Eigen::VectorXd l_t_value_cppDP_OMP ( const Eigen::Map<Eigen::VectorXd> xs, const Eigen::Map<Eigen::VectorXd> parameters ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); #pragma omp parallel for default(none) shared(result) for (std::size_t t = 0; t < sample_size; ++t) { result[t] = l_t_cpp_AD(xs[t], parameters); } return result; } 83
272. ### Data Parallel Objective Function Performance I errs_l = l_t(xs, fixed_parameters)

- l_t_value_cppDP(xs, fixed_parameters) > head(errs_l) [1] 0 0 0 0 0 0 > sum(errs_l) [1] 0 errs_l = l_t(xs, fixed_parameters) - l_t_value_cppDP_OMP(xs, fixed_parameters) > head(errs_l) [1] 0 0 0 0 0 0 > sum(errs_l) [1] 0 84
273. ### Data Parallel Objective Function Performance II > require("microbenchmark") > microbenchmark(

+ l_t(xs, fixed_parameters), + l_t_value_cppDP(xs, fixed_parameters), + l_t_value_cppDP_OMP(xs, fixed_parameters) + ) Unit: microseconds expr median neval l_t(xs, fixed_parameters) 102.657 100 l_t_value_cppDP(xs, fixed_parameters) 458.618 100 l_t_value_cppDP_OMP(xs, fixed_parameters) 213.526 100 Note: changing l_t_cpp_AD to l_t_cpp_AD_DP (analogously) can also help, but we're not done, yet... 85
274. ### Data Parallel Gradient I // [[Rcpp::export]] Eigen::VectorXd l_t_value_gradient_cppDP ( const

Eigen::Map<Eigen::VectorXd> xs, const Eigen::Map<Eigen::VectorXd> parameters_input, Eigen::Map<Eigen::MatrixXd> gradient_output ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); for (std::size_t t = 0; t != sample_size; ++t) { using parameter_vector = model::parameter_vector<double>; using AD = Eigen::AutoDiffScalar<parameter_vector>; using VectorAD = model::parameter_vector<AD>; 86

= parameters.cast<AD>(); constexpr auto P = model::parameters_count; for (std::size_t p = 0; p != P; ++p) { parameters_AD(p).derivatives() = parameter_vector::Unit(p); } const AD & loglikelihood_result = l_t_cpp_AD(xs[t], parameters_AD); gradient_output.row(t) = loglikelihood_result.derivatives(); result[t] = loglikelihood_result.value(); } return result; } 87
276. ### Data Parallel Gradient III // [[Rcpp::export]] Eigen::VectorXd l_t_value_gradient_cppDP_OMP ( const

Eigen::Map<Eigen::VectorXd> xs, const Eigen::Map<Eigen::VectorXd> parameters_input, Eigen::Map<Eigen::MatrixXd> gradient_output ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); #pragma omp parallel for default(none) \ shared(result, gradient_output) for (std::size_t t = 0; t < sample_size; ++t) { using parameter_vector = model::parameter_vector<double>; using AD = Eigen::AutoDiffScalar<parameter_vector>; using VectorAD = model::parameter_vector<AD>; 88

= parameters.cast<AD>(); constexpr auto P = model::parameters_count; for (std::size_t p = 0; p != P; ++p) { parameters_AD(p).derivatives() = parameter_vector::Unit(p); } const AD & loglikelihood_result = l_t_cpp_AD(xs[t], parameters_AD); gradient_output.row(t) = loglikelihood_result.derivatives(); result[t] = loglikelihood_result.value(); } return result; } 89
278. ### Data Parallel Gradient Performance I parameters_count = 2L # mu,

sigma^2 N = 10000L # sample size gradient = matrix(data = 0, nrow = N, ncol = parameters_count) colnames(gradient) = c("mu", "sigma^2") l_value = l_t_value_gradient_cppDP(xs, fixed_parameters, gradient) errs_g = dl_t_dmu_AD(xs, fixed_parameters) - gradient[,"mu"] > str(gradient) num [1:10000, 1:2] -1.39 -1.32 -1.38 -1.53 -1.88 ... - attr(*, "dimnames")=List of 2 ..\$ : NULL ..\$ : chr [1:2] "mu" "sigma^2" > head(errs_g) [1] 0 0 0 0 0 0 > sum(errs_g) [1] 0 90

errs_g = dl_t_dmu_AD(xs, fixed_parameters) - gradient[,"mu"] > str(gradient) num [1:10000, 1:2] -1.39 -1.32 -1.38 -1.53 -1.88 ... - attr(*, "dimnames")=List of 2 ..\$ : NULL ..\$ : chr [1:2] "mu" "sigma^2" > head(errs_g) [1] 0 0 0 0 0 0 > sum(errs_g) [1] 0 91

281. ### Data Parallel Gradient Performance — Conclusions Worth noting: • dl_t_dmu_AD

only computes derivative w.r.t. µ, whereas l_t_value_gradient_cppDP also computes derivative w.r.t. σ2 93
282. ### Data Parallel Gradient Performance — Conclusions Worth noting: • dl_t_dmu_AD

only computes derivative w.r.t. µ, whereas l_t_value_gradient_cppDP also computes derivative w.r.t. σ2 • thus naive serial version already effectively twice as fast 93
283. ### Data Parallel Gradient Performance — Conclusions Worth noting: • dl_t_dmu_AD

only computes derivative w.r.t. µ, whereas l_t_value_gradient_cppDP also computes derivative w.r.t. σ2 • thus naive serial version already effectively twice as fast • we have an additional speed-up on top of that thanks to OpenMP 93
284. ### Data Parallel Gradient Performance — Conclusions Worth noting: • dl_t_dmu_AD

only computes derivative w.r.t. µ, whereas l_t_value_gradient_cppDP also computes derivative w.r.t. σ2 • thus naive serial version already effectively twice as fast • we have an additional speed-up on top of that thanks to OpenMP • we can do better: truly data-parallel (non-naive) objective function implementation l_t_cpp_AD_DP, reverse mode AD 93
285. ### Data Parallel Gradient Performance — Conclusions Worth noting: • dl_t_dmu_AD

only computes derivative w.r.t. µ, whereas l_t_value_gradient_cppDP also computes derivative w.r.t. σ2 • thus naive serial version already effectively twice as fast • we have an additional speed-up on top of that thanks to OpenMP • we can do better: truly data-parallel (non-naive) objective function implementation l_t_cpp_AD_DP, reverse mode AD • left as an exercise for the audience! :-) 93

287. ### Caveats • Source code spanning multiple programming languages • Closed

source software (proprietary, third party) 94
288. ### Caveats • Source code spanning multiple programming languages • Closed

source software (proprietary, third party) • Reverse mode — memory requirements 94
289. ### Caveats • Source code spanning multiple programming languages • Closed

source software (proprietary, third party) • Reverse mode — memory requirements • Perverse (but valid) code: if (x == 1.0) then y = 0.0 else y = x − 1.0 94
290. ### Caveats • Source code spanning multiple programming languages • Closed

source software (proprietary, third party) • Reverse mode — memory requirements • Perverse (but valid) code: if (x == 1.0) then y = 0.0 else y = x − 1.0 • derivative at x = 1? 94