Algorithmic Differentiation: C++ & Extremum Estimation (CppCon 2015)

Slide 1

Slide 1 text

Algorithmic Differentiation C++ & Extremum Estimation Matt P. Dziubinski CppCon 2015 [email protected] // @matt_dz Department of Mathematical Sciences, Aalborg University CREATES (Center for Research in Econometric Analysis of Time Series)

Slide 2

Slide 2 text

Outline Why? Numerical Optimization Calculating Derivatives Example Source Code Resources References 2

Slide 3

Slide 3 text

Slides • Latest version: https://speakerdeck.com/mattpd 3

Slide 4

Slide 4 text

Why?

Slide 5

Slide 5 text

Motivation: Data & Models • Have: data, y (numbers) 5

Slide 6

Slide 6 text

Motivation: Data & Models • Have: data, y (numbers) • Want: understand, explain, forecast 5

Slide 7

Slide 7 text

Motivation: Data & Models • Have: data, y (numbers) • Want: understand, explain, forecast • Tool: model, f (y; θ) 5

Slide 8

Slide 8 text

Motivation: Data & Models • Have: data, y (numbers) • Want: understand, explain, forecast • Tool: model, f (y; θ) • Need: connection, θ 5

Slide 9

Slide 9 text

Motivation: Data & Models • Have: data, y (numbers) • Want: understand, explain, forecast • Tool: model, f (y; θ) • Need: connection, θ • Idea: ﬁtting model to data 5

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Motivation: Data & Models • Have: data, y (numbers) • Want: understand, explain, forecast • Tool: model, f (y; θ) • Need: connection, θ • Idea: ﬁtting model to data • How: cost function — minimize / maximize • Nowadays one of the most widely applied approaches — estimation (econometrics, statistics), calibration (ﬁnance), training ((supervised) machine learning) 5

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Motivation: Parametric Models • Parametric model: parameter vector θ ∈ Θ ⊆ Rp, Θ — parameter space 6

Slide 14

Slide 14 text

Motivation: Parametric Models • Parametric model: parameter vector θ ∈ Θ ⊆ Rp, Θ — parameter space • An estimator ˆ θ is an extremum estimator if ∃Q : Θ → R s.t. ˆ θ = argmax θ∈Θ Q(θ) 6

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Motivation: Parametric Models • Parametric model: parameter vector θ ∈ Θ ⊆ Rp, Θ — parameter space • An estimator ˆ θ is an extremum estimator if ∃Q : Θ → R s.t. ˆ θ = argmax θ∈Θ Q(θ) • can be obtained as an extremum (maximum or minimum) of a scalar objective function • class includes OLS (Ordinary Least Squares), NLS (Nonlinear Least Squares), GMM (Generalized Method of Moments), and QMLE (Quasi Maximum Likelihood Estimation) 6

Slide 17

Slide 17 text

Numerical Optimization

Slide 18

Slide 18 text

Numerical Optimization: Algorithms • Numerical optimization algorithms — can be broadly categorized into two categories: 8

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Numerical Optimization: Algorithms • Numerical optimization algorithms — can be broadly categorized into two categories: • derivative-free — do not rely on the knowledge of the objective function’s gradient in the optimization process • gradient-based — need the gradient of the objective function • iteration form: initial iterate: θ0 new iterate: θk = θk−1 + sk , iteration k ≥ 1 step: sk = αk pk length: scalar αk > 0 direction: vector pk s.t. Hk pk = ∇Q(θk) • Hk : Hk = I (Steepest Ascent) — many variants, including SGD (Stochastic Gradient Descent) Hk = ∇2Q(θk) (Newton) Hk = Hessian approximation satisfying the secant equation: Hk+1 sk = ∇Qk+1 − ∇Qk (Quasi-Newton) 8

Slide 23

Slide 23 text

Numerical Optimization: Gradient • Gradient-based algorithms often exhibit superior convergence rates (superlinear or even quadratic) 9

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Numerical Optimization: Gradient • Gradient-based algorithms often exhibit superior convergence rates (superlinear or even quadratic) • However: this property depends on the gradient either being exact — or at minimum sufﬁciently accurate in the limit sense, Nocedal and Wright (2006 Chapter 3) • For many modern models analytical gradient formulas not available in closed-form — or non-trivial to derive • This often leads to the use of numerical approximations — in particular, ﬁnite difference methods (FDMs), Nocedal and Wright (2006 Chapter 8) 9

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Numerical Optimization: Algorithmic Differentiation • Need a way obtain the gradient — preferably: fast and accurate 10

Slide 29

Slide 29 text

Numerical Optimization: Algorithmic Differentiation • Need a way obtain the gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: 10

Slide 30

Slide 30 text

Numerical Optimization: Algorithmic Differentiation • Need a way obtain the gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast 10

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Numerical Optimization: Algorithmic Differentiation • Need a way obtain the gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast • and accurate • and simple • Essentially: • an automatic computational differentiation method • based on the systematic application of the chain rule 10

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Calculating Derivatives

Slide 39

Slide 39 text

Calculating Derivatives • Main approaches: 12

Slide 40

Slide 40 text

Calculating Derivatives • Main approaches: • Finite Differencing 12

Slide 41

Slide 41 text

Calculating Derivatives • Main approaches: • Finite Differencing • Algorithmic Differentiation (AD), a.k.a. Automatic Differentiation 12

Slide 42

Slide 42 text

Calculating Derivatives • Main approaches: • Finite Differencing • Algorithmic Differentiation (AD), a.k.a. Automatic Differentiation • Symbolic Differentiation 12

Slide 43

Slide 43 text

Finite Differencing • Recall f ′(x) = limh→0 f (x+h)−f (x) h 13

Slide 44

Slide 44 text

Finite Differencing • Recall f ′(x) = limh→0 f (x+h)−f (x) h • forward-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x) h , 13

Slide 45

Slide 45 text

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Finite Differencing • Recall f ′(x) = limh→0 f (x+h)−f (x) h • forward-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x) h , • central-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x − hei) 2h , • Computational cost: • forward-difference formula: p + 1 function evaluations 13

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Finite Differencing — Truncation Error • Truncation Error Analysis — Taylor’s theorem 14

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Finite Differencing — Truncation Error • Truncation Error Analysis — Taylor’s theorem • Consider univariate function f and the forward-difference formula: f ′(x) ≈ f (x + h) − f (x) h • Recall Taylor’s formula f (x + h) = f (x) + hf ′(x) + 1 2 h2f ′′(x) + 1 6 h3f ′′′(x) + O(h4 ) • Hence f ′(x) = f (x + h) − f (x) h − 1 2 hf ′′(x) − 1 6 h2f ′′′(x) − O(h3 ) 14

Slide 53

Slide 53 text

Slide 54

Slide 54 text

Slide 55

Slide 55 text

Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or distributivity) of addition (and multiplication) 15

Slide 56

Slide 56 text

Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or distributivity) of addition (and multiplication) • Example: 15

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or distributivity) of addition (and multiplication) • Example: • Input: a = 1234.567; b = 45.678; c = 0.0004 a + b + c c + b + a (a + b + c) - (c + b + a) • Output: • 1280.2454 • 1280.2454 • -2.273737e-13 • ”Many a serious mathematician has attempted to give rigorous analyses of a sequence of ﬂoating point operations, but has found the task to be so formidable that he has tried to content himself with plausibility arguments instead.” — Donald E. Knuth, TAOCP2 15

Slide 63

Slide 63 text

Floating Point Representation • Floating point representation: inspired by scientiﬁc notation 16

Slide 64

Slide 64 text

Floating Point Representation • Floating point representation: inspired by scientiﬁc notation • The IEEE Standard for Floating-Point Arithmetic (IEEE 754) 16

Slide 65

Slide 65 text

Slide 66

Slide 66 text

Slide 67

Slide 67 text

Slide 68

Slide 68 text

Floating Point Representation • Floating point representation: inspired by scientific notation • The IEEE Standard for Floating-Point Arithmetic (IEEE 754) • Form: significant digits × baseexponent • Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} • the difference between 1.0 and the next value representable by the floating-point number 1.0 + ϵ • IEEE 754 binary32 (single precision): 2−23 ≈ 1.19209e-07 (1 sign bit, 8 exponent bits, leaving 23 bits for significand) 16

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Slide 71

Slide 71 text

Floating Point Representation Properties • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM 17

Slide 72

Slide 72 text

Floating Point Representation Properties • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) 17

Slide 73

Slide 73 text

Slide 74

Slide 74 text

Slide 75

Slide 75 text

Slide 76

Slide 76 text

Slide 77

Slide 77 text

Slide 78

Slide 78 text

Slide 79

Slide 79 text

Slide 80

Slide 80 text

Floating Point Representation Properties • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ϵM a.k.a. unit round-off, u = ULP(1.0) • ULP — Unit in the last place 17

Slide 81

Slide 81 text

Slide 82

Slide 82 text

Slide 83

Slide 83 text

Finite Differencing — Round-off Error I • Consider f ′(x) ≈ f (x + h) − f (x) h 18

Slide 84

Slide 84 text

Finite Differencing — Round-off Error I • Consider f ′(x) ≈ f (x + h) − f (x) h • Suppose that x and x + h exact 18

Slide 85

Slide 85 text

Slide 86

Slide 86 text

Slide 87

Slide 87 text

Slide 88

Slide 88 text

Finite Differencing — Round-off Error I • Consider f ′(x) ≈ f (x + h) − f (x) h • Suppose that x and x + h exact • Round-off errors when evaluating f: f (x + h)(1 + δ1) − f (x)(1 + δ2) h = f (x + h) − f (x) h + δ1 f (x + h) − δ2 f (x) h • We have |δi| ≤ ϵM for i ∈ 1, 2. • =⇒ round-off error bound: ϵM |f (x + h)| + |f (x)| h • =⇒ round-off error bound order: O(ϵM h ) 18

Slide 89

Slide 89 text

Slide 90

Slide 90 text

Slide 91

Slide 91 text

Finite Differencing — Round-off Error II • ϵM ̸= 0 =⇒ ∃ˆ h : x + ˆ h = x 19

Slide 92

Slide 92 text

Finite Differencing — Round-off Error II • ϵM ̸= 0 =⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either 19

Slide 93

Slide 93 text

Finite Differencing — Round-off Error II • ϵM ̸= 0 =⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either • =⇒ f (x) = f (x + ˆ h) 19

Slide 94

Slide 94 text

Finite Differencing — Round-off Error II • ϵM ̸= 0 =⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either • =⇒ f (x) = f (x + ˆ h) • =⇒ f (x+h)−f (x) ˆ h = 0 19

Slide 95

Slide 95 text

Finite Differencing — Round-off Error II • ϵM ̸= 0 =⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either • =⇒ f (x) = f (x + ˆ h) • =⇒ f (x+h)−f (x) ˆ h = 0 • =⇒ no correct digits in the result at (or below) step-size ˆ h 19

Slide 96

Slide 96 text

Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) + O(ϵM h ) (forward-difference formula) 20

Slide 97

Slide 97 text

Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) + O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal 20

Slide 98

Slide 98 text

Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) + O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h 20

Slide 99

Slide 99 text

Slide 100

Slide 100 text

Slide 101

Slide 101 text

Slide 102

Slide 102 text

Slide 103

Slide 103 text

Slide 104

Slide 104 text

Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) + O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM • h ≈ √ ϵM • binary64: 1.490116e-08 • total error’s minimum value ∈ O( √ ϵM) • binary64: 1.490116e-08 • Total error: O(h2) + O(ϵM h ) (central-difference formula) 20

Slide 105

Slide 105 text

Slide 106

Slide 106 text

Slide 107

Slide 107 text

Slide 108

Slide 108 text

Slide 109

Slide 109 text

Finite Differencing — Central-Difference Formula • Subtracting the backward-difference approximation from the forward-difference approximation: 21

Slide 110

Slide 110 text

Slide 111

Slide 111 text

Slide 112

Slide 112 text

Finite Differencing — Central-Difference Formula • Subtracting the backward-difference approximation from the forward-difference approximation: • (central-difference) f ′(x) = f (x + h) − f (x − h) 2h − O(h2 ) • Truncation error ∈ O(h2) (second-order approximation) • Round-off error ∈ O(ϵM/h) 21

Slide 113

Slide 113 text

Slide 114

Slide 114 text

Slide 115

Slide 115 text

Slide 116

Slide 116 text

Finite Difference Approximation Error Example: f (x) = cos(sin(x) ∗ cos(x)) f ′(x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x)) • Total error: O(h) + O( ϵM h ) (forward-difference formula) 22

Slide 117

Slide 117 text

Slide 118

Slide 118 text

Slide 119

Slide 119 text

Finite Difference Approximation Error Example: f (x) = cos(sin(x) ∗ cos(x)) f ′(x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x)) • Total error: O(h) + O( ϵM h ) (forward-difference formula) • h ≈ √ ϵM — binary64: 1.490116e-08 • total error’s minimum value ∈ O( √ ϵM) — binary64: 1.490116e-08 • Total error: O(h2) + O( ϵM h ) (central-difference formula) 22

Slide 120

Slide 120 text

Slide 121

Slide 121 text

Slide 122

Slide 122 text

Symbolic Differentiation • Algebraic speciﬁcation for f manipulated by symbolic manipulation tools to produce new algebraic expressions for each component of the gradient 23

Slide 123

Slide 123 text

Symbolic Differentiation • Algebraic speciﬁcation for f manipulated by symbolic manipulation tools to produce new algebraic expressions for each component of the gradient • Commonly used symbolic manipulation tools can be found in the packages Mathematica (also Wolfram Alpha), Macsyma, Maple, or Mathics. 23

Slide 124

Slide 124 text

Slide 125

Slide 125 text

Slide 126

Slide 126 text

Slide 127

Slide 127 text

Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation (SD) applicable to ”paper-and-pencil” algebraic expressions 24

Slide 128

Slide 128 text

Slide 129

Slide 129 text

Slide 130

Slide 130 text

Slide 131

Slide 131 text

Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation (SD) applicable to ”paper-and-pencil” algebraic expressions • Input: function speciﬁed by an algebraic formula or closed-form expression • Objective function in numerical optimization — a computer program • How are computer programs different? • named intermediates 24

Slide 132

Slide 132 text

Slide 133

Slide 133 text

Slide 134

Slide 134 text

Slide 135

Slide 135 text

Slide 136

Slide 136 text

Numerical Objective Function & Algorithmic Differentiation II • Algorithmic Differentiation (AD) offers another solution. 25

Slide 137

Slide 137 text

Slide 138

Slide 138 text

Slide 139

Slide 139 text

Numerical Objective Function & Algorithmic Differentiation II • Algorithmic Differentiation (AD) offers another solution. • AD — a computational differentiation method with the roots in the late 1960s, Griewank (2012) • Machine Learning — Neural Networks — Backpropagation: Bryson, Denham, Dreyfus (1963), Werbos (1974) • AD widely known and applied in natural sciences — Naumann (2012), and computational ﬁnance — Giles and Glasserman (2006), Homescu (2011) 25

Slide 140

Slide 140 text

Algorithmic Differentiation — Idea I • Idea: 26

Slide 141

Slide 141 text

Algorithmic Differentiation — Idea I • Idea: • computer code for evaluating the function can be broken down 26

Slide 142

Slide 142 text

Algorithmic Differentiation — Idea I • Idea: • computer code for evaluating the function can be broken down • into a composition of elementary arithmetic operations 26

Slide 143

Slide 143 text

Slide 144

Slide 144 text

Slide 145

Slide 145 text

Algorithmic Differentiation — Idea I • Idea: • computer code for evaluating the function can be broken down • into a composition of elementary arithmetic operations • to which the chain rule (one of the basic rules of calculus) can be applied • e.g., given an objective function Q(θ) = f (g(θ)), we have Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′ • Relies on the systematic application of the chain rule by a computer program 26

Slide 146

Slide 146 text

Slide 147

Slide 147 text

Slide 148

Slide 148 text

Slide 149

Slide 149 text

Algorithmic Differentiation — Idea II • Mathematical operations performed exactly 27

Slide 150

Slide 150 text

Algorithmic Differentiation — Idea II • Mathematical operations performed exactly • to the maximum extent allowed by the machine precision — i.e., accurate to within round-off error 27

Slide 151

Slide 151 text

Slide 152

Slide 152 text

Algorithmic Differentiation — Idea II • Mathematical operations performed exactly • to the maximum extent allowed by the machine precision — i.e., accurate to within round-off error • truncation error: none • round-off error: equivalent to using a source code analogously implementing the analytical derivatives. 27

Slide 153

Slide 153 text

Algorithmic Differentiation — Idea III • AD similar to symbolic differentiation 28

Slide 154

Slide 154 text

Algorithmic Differentiation — Idea III • AD similar to symbolic differentiation • major difference: 28

Slide 155

Slide 155 text

Algorithmic Differentiation — Idea III • AD similar to symbolic differentiation • major difference: • SD restricted to symbolic manipulation of algebraic expressions 28

Slide 156

Slide 156 text

Slide 157

Slide 157 text

Algorithmic Differentiation — Idea III • AD similar to symbolic differentiation • major difference: • SD restricted to symbolic manipulation of algebraic expressions • AD operates on (and integrates with) the existing source code of a computer program, Nocedal and Wright (2006 Section 8.2). • Note: less than required by SD, more than required by ﬁnite differences (which can operate on black box implementations) 28

Slide 158

Slide 158 text

Slide 159

Slide 159 text

Slide 160

Slide 160 text

Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012): 29

Slide 161

Slide 161 text

Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012): • source code transformation (SCT) 29

Slide 162

Slide 162 text

Slide 163

Slide 163 text

Slide 164

Slide 164 text

Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012): • source code transformation (SCT) • e.g., ADIFOR — produce new code that calculates both function and derivative values • operator overloading • e.g., ADOL-C — keep track of the elementary computations during the function evaluation, produce the derivatives at given inputs 29

Slide 165

Slide 165 text

Slide 166

Slide 166 text

Slide 167

Slide 167 text

Algorithmic Differentiation — Example MLE • a more concrete illustration: example in the context of statistical estimation 30

Slide 168

Slide 168 text

Slide 169

Slide 169 text

Slide 170

Slide 170 text

Algorithmic Differentiation — Example MLE • a more concrete illustration: example in the context of statistical estimation • evaluation and differentiation of the loglikelihood function of a normally distributed sample • we have T > 1 observations x1, x2, ..., xT i.i.d. ∼ N(µ, σ2), with µ ∈ R, and σ ∈ R+ • our parameter vector is θ = (µ, σ2) 30

Slide 171

Slide 171 text

Slide 172

Slide 172 text

Algorithmic Differentiation — Evaluation Trace • objective function: the loglikelihood function, ℓ(θ) = logL(θ; x) = ∑T t=1 ℓt(θ) 31

Slide 173

Slide 173 text

Slide 174

Slide 174 text

Algorithmic Differentiation — Evaluation Trace • objective function: the loglikelihood function, ℓ(θ) = logL(θ; x) = ∑T t=1 ℓt(θ) • a contribution given by ℓt(θ) = −T 2 log(2π) − T 2 log(σ2) − 1 2 (xt − µ)2/σ2. • goal: compute the value of ℓt corresponding to the arguments xt = 6, µ = 6, and σ2 = 4. 31

Slide 175

Slide 175 text

Slide 176

Slide 176 text

Algorithmic Differentiation — Evaluation Trace Variable Operation Expression Value v −1 = µ = 6 v0 = σ2 = 4 v1 = ϕ1(v0) = log(v0) = 1.386294 v2 = ϕ2(v1) = −1 2 log(2π) − 1 2 v1 = −1.612086 v3 = ϕ3(v −1) = xt − v −1 = 0 v4 = ϕ4(v3) = v2 3 = 0 v5 = ϕ5(v0, v4) = v4/v0 = 0 v6 = ϕ6(v5) = −1 2 v5 = 0 v7 = ϕ7(v2, v6) = v2 + v6 = −1.612086 ℓt = v7 = −1.612086 32

Slide 177

Slide 177 text

Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken down 33

Slide 178

Slide 178 text

Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken down • into a composition of intermediate subexpressions vi , with i ∈ {1, . . . , 7} 33

Slide 179

Slide 179 text

Slide 180

Slide 180 text

Slide 181

Slide 181 text

Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken down • into a composition of intermediate subexpressions vi , with i ∈ {1, . . . , 7} • intermediate subexpressions: elementary operations and functions ϕi with known, predeﬁned derivatives • note: similar to the static single assignment form (SSA) used in compiler design and implementation • We can now apply the chain rule to the intermediate operations vi and collect the intermediate derivative results in order to obtain the derivative of the ﬁnal expression ℓt = v7 33

Slide 182

Slide 182 text

Slide 183

Slide 183 text

Slide 184

Slide 184 text

Algorithmic Differentiation — Forward Mode • goal: the derivative of the output variable ℓt (dependent variable) with respect to the input variable µ (independent variable) 34

Slide 185

Slide 185 text

Slide 186

Slide 186 text

Algorithmic Differentiation — Forward Mode • goal: the derivative of the output variable ℓt (dependent variable) with respect to the input variable µ (independent variable) • idea: compute the numerical value of the derivative of each of the intermediate subexpressions, keeping the track of the resulting derivatives values along the way. • for each variable vi : introduce another variable, ˙ vi = ∂vi/∂µ 34

Slide 187

Slide 187 text

Slide 188

Slide 188 text

Slide 189

Slide 189 text

Algorithmic Differentiation — Forward Mode • in our case 35

Slide 190

Slide 190 text

Algorithmic Differentiation — Forward Mode • in our case • start with ˙ v −1 = 1.0 (active differentation variable µ), 35

Slide 191

Slide 191 text

Slide 192

Slide 192 text

Slide 193

Slide 193 text

Algorithmic Differentiation — Forward Mode • in our case • start with ˙ v −1 = 1.0 (active differentation variable µ), • ˙ v0 = 0.0 (passive: we are not differentiating with respect to σ2 and thus treat it as a constant) • then, given that v1 = ϕ1(v0) = log(v0), we obtain ˙ v1 = (∂ϕ1(v0)/∂v0)˙ v0 = ˙ v0/v0 = 0.0 • analogously, v2 = ϕ2(v1) = −1 2 log(2π) − 1 2 v1 , hence ˙ v2 = (∂ϕ2(v1)/∂v1)˙ v1 = −1 2 ˙ v1 = 0.0 35

Slide 194

Slide 194 text

Slide 195

Slide 195 text

Algorithmic Differentiation — Augmented Evaluation Trace Variable Operation Expression Value v −1 = µ = 6 ˙ v −1 = ∂v −1/∂v −1 = 1 v0 = σ2 = 4 ˙ v0 = ∂v0/∂v −1 = 0 v1 = ϕ1(v0) = log(v0) = 1.386294 ˙ v1 = (∂ϕ1/∂v0)˙ v0 = ˙ v0/v0 = 0 . . . . . . . . . . . . v7 = ϕ7(v2, v6) = v2 + v6 = −1.612086 ˙ v7 = (∂ϕ7/∂v2)˙ v2 + (∂ϕ7/∂v6)˙ v6 = ˙ v2 + ˙ v6 = 0 ℓt = v7 = −1.612086 ˙ ℓt = ˙ v7 = 0 36

Slide 196

Slide 196 text

Slide 197

Slide 197 text

Slide 198

Slide 198 text

Algorithmic Differentiation — Augmented Evaluation Trace • The basic forward mode of AD — called “forward” because the derivatives values ˙ vi carried along simultaneously with the values vi themselves • The problem addressed by the implementations: • ”transform a program with a particular evaluation trace into an augmented program whose evaluation trace also contains exactly the same extra variables and additional lines as in the derived evaluation trace,” Griewank and Walther (2008). 37

Slide 199

Slide 199 text

AD — Reverse Mode • An alternative to the forward approach — the reverse or adjoint mode. 38

Slide 200

Slide 200 text

AD — Reverse Mode • An alternative to the forward approach — the reverse or adjoint mode. • Griewank and Walther (2008): ”Rather than choosing an input variable and calculating the sensitivity of every intermediate variable with respect to that input, we choose instead an output variable and calculate the sensitivity of that output with respect to each of the intermediate variables. As it turns out adjoint sensitivities must naturally be computed backward, i.e., starting from the output variables.” 38

Slide 201

Slide 201 text

Slide 202

Slide 202 text

Slide 203

Slide 203 text

AD — Forward Mode vs. Reverse Mode • In general, for f : Rn → Rm 39

Slide 204

Slide 204 text

AD — Forward Mode vs. Reverse Mode • In general, for f : Rn → Rm • rough CostAD,Forward(f) ∈ n O(Cost(f)) 39

Slide 205

Slide 205 text

AD — Forward Mode vs. Reverse Mode • In general, for f : Rn → Rm • rough CostAD,Forward(f) ∈ n O(Cost(f)) • rough CostAD,Reverse(f) ∈ m O(Cost(f)) 39

Slide 206

Slide 206 text

Slide 207

Slide 207 text

AD — Forward Mode vs. Reverse Mode • In general, for f : Rn → Rm • rough CostAD,Forward(f) ∈ n O(Cost(f)) • rough CostAD,Reverse(f) ∈ m O(Cost(f)) • Which is better? Depends: e.g., gradient vs. Jacobian • Foward mode performance: can establish a worst-case performance bound for AD when used for gradient computation 39

Slide 208

Slide 208 text

Illustration: Simulation Study — Setup The Model: GARCH(1, 1) The spot asset price S follows the following process under the phys- ical probability measure P: log St+1 St ≡ rt+1 = µ + ϵt+1 (1) ϵt+1 = √vt+1 wt+1 (2) vt+1 = ω + aϵ2 t + bvt (3) w P ∼ GWN(0, 1) (4) 40

Slide 209

Slide 209 text

Illustration: Simulation Study — Setup — Notation • θ = (µ, ω, a, b)⊤ 41

Slide 210

Slide 210 text

Illustration: Simulation Study — Setup — Notation • θ = (µ, ω, a, b)⊤ • r — logarithmic return of the spot price 41

Slide 211

Slide 211 text

Illustration: Simulation Study — Setup — Notation • θ = (µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r 41

Slide 212

Slide 212 text

Illustration: Simulation Study — Setup — Notation • θ = (µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r 41

Slide 213

Slide 213 text

Slide 214

Slide 214 text

Slide 215

Slide 215 text

Slide 216

Slide 216 text

Slide 217

Slide 217 text

Slide 218

Slide 218 text

Illustration: Simulation Study — Setup — Notation • θ = (µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r • non-negativity: ω, a, b > 0 • w — randomness source (innovation): standard Gaussian white noise • mv = ω 1−a−b — the unconditional mean of v (unconditional variance of r) • pv = (a + b) — the mean-reversion rate or the persistence rate • when pv < 1, weak stationarity, the conditional variance will revert to the unconditional variance at a geometric rate of pv • smaller pv — less persistent sensitivity of the volatility expectation to past market shocks 41

Slide 219

Slide 219 text

Slide 220

Slide 220 text

Slide 221

Slide 221 text

Illustration: Monte Carlo Simulation • DGP (data-generating process): near-integrated scenario, θDGP : µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85 42

Slide 222

Slide 222 text

Slide 223

Slide 223 text

Slide 224

Slide 224 text

Illustration: Monte Carlo Simulation • DGP (data-generating process): near-integrated scenario, θDGP : µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85 • M = 10,000 (MC estimation experiments / replications count) • N = 2,000 ... 16,000 (sample size) • 2,000 observations / 252 business days — roughly 8 years of data 42

Slide 225

Slide 225 text

Slide 226

Slide 226 text

Illustration: Reliability — AD, L-BFGS 43

Slide 227

Slide 227 text

Illustration: Reliability — FD, TNR 44

Slide 228

Slide 228 text

Illustration: Findings — Reliability • Reliability of estimation: 45

Slide 229

Slide 229 text

Illustration: Findings — Reliability • Reliability of estimation: • using AD: successful convergence in the vast majority of Monte Carlo experiments 45

Slide 230

Slide 230 text

Slide 231

Slide 231 text

Slide 232

Slide 232 text

Illustration: Findings — Reliability • Reliability of estimation: • using AD: successful convergence in the vast majority of Monte Carlo experiments • using FD: very high convergence failure ratio • Successful-convergence ratios (fraction of MC experiments with successful convergence within a 0.01 neighborhood true DGP values) for the best algorithm-gradient pairs are as follows: • L-BFGS(AD): average = 0.811, median: 0.99999903, 45

Slide 233

Slide 233 text

Slide 234

Slide 234 text

Slide 235

Slide 235 text

Illustration: Findings • Overall: When using the fastest successful algorithm-gradient pair for both AD and FD: 46

Slide 236

Slide 236 text

Illustration: Findings • Overall: When using the fastest successful algorithm-gradient pair for both AD and FD: • AD achieves 12.5 times higher accuracy (in terms of the error norm) with a 3 slowdown compared to L-BFGS(FD) 46

Slide 237

Slide 237 text

Slide 238

Slide 238 text

Example Source Code

Slide 239

Slide 239 text

Example using Rcpp Rcpp::sourceCpp('ExampleGaussianRcpp.cpp') • Note: Rcpp not necessary for any of this: 48

Slide 240

Slide 240 text

Example using Rcpp Rcpp::sourceCpp('ExampleGaussianRcpp.cpp') • Note: Rcpp not necessary for any of this: • Feel free to skip over any line with Rcpp:: 48

Slide 241

Slide 241 text

Example using Rcpp Rcpp::sourceCpp('ExampleGaussianRcpp.cpp') • Note: Rcpp not necessary for any of this: • Feel free to skip over any line with Rcpp:: • Add #include and #include for plain vanilla Eigen 48

Slide 242

Slide 242 text

ExampleGaussianRcpp.cpp I // [[Rcpp::plugins("cpp11")]] #include // [[Rcpp::depends(BH)]] #include // [[Rcpp::depends(RcppEigen)]] #include #include 49

Slide 243

Slide 243 text

ExampleGaussianRcpp.cpp II namespace model { // 2 parameters: mu, sigma^2 enum parameter : std::size_t { mu, s2 }; constexpr std::size_t parameters_count = 2; } 50

Slide 244

Slide 244 text

ExampleGaussianRcpp.cpp III // [[Rcpp::export]] double l_t_cpp(double xt, const Eigen::Map parameters) { const auto mu = parameters[model::parameter::mu]; const auto s2 = parameters[model::parameter::s2]; constexpr auto two_pi = boost::math::constants::two_pi(); using std::log; using std::pow; return -.5 * log(two_pi) - .5 * log(s2) - .5 * pow(xt - mu, 2) / s2; } 51

Slide 245

Slide 245 text

AD Example using RcppEigen Rcpp::sourceCpp('ExampleGaussianRcppEigen.cpp') 52

Slide 246

Slide 246 text

ExampleGaussianRcppEigen.cpp I // [[Rcpp::plugins("cpp11")]] #include // [[Rcpp::depends(BH)]] #include // [[Rcpp::depends(RcppEigen)]] #include #include #include 53

Slide 247

Slide 247 text

ExampleGaussianRcppEigen.cpp II namespace model { // 2 parameters: mu, sigma^2 enum parameter : std::size_t { mu, s2 }; constexpr std::size_t parameters_count = 2; template using parameter_vector = Eigen::Matrix; } 54

Slide 248

Slide 248 text

ExampleGaussianRcppEigen.cpp III // note: data `xt` -- double-precision number(s), just as before // only the parameters adjusted to `ScalarType` template typename VectorType::Scalar l_t_cpp_AD(double xt, const VectorType & parameters) { using Scalar = typename VectorType::Scalar; const Scalar & mu = parameters[model::parameter::mu]; const Scalar & s2 = parameters[model::parameter::s2]; // note: `two_pi` is, as always, a double-precision constant constexpr auto two_pi = boost::math::constants::two_pi(); using std::log; using std::pow; return -.5 * log(two_pi) - .5 * log(s2) - .5 * pow(xt - mu, 2) / s2; } 55

Slide 249

Slide 249 text

ExampleGaussianRcppEigen.cpp IV // wrapper over `l_t_cpp_AD` // (simply dispatching to the existing implementation) // // (Rcpp-exportable explicit template instantiations // would render this unnecessary) // [[Rcpp::export]] double l_t_value_cpp(double xt, const Eigen::Map parameters) { return l_t_cpp_AD(xt, parameters); } 56

Slide 250

Slide 250 text

ExampleGaussianRcppEigen.cpp V // objective function together with its gradient // `xt`: input (data) // `parameters_input`: input (model parameters) // `gradient_output`: output (computed gradient) // returns: computed objective function value // [[Rcpp::export]] double l_t_value_gradient_cpp ( double xt, const Eigen::Map parameters_input, Eigen::Map gradient_output ) { using parameter_vector = model::parameter_vector; using AD = Eigen::AutoDiffScalar; using VectorAD = model::parameter_vector; 57

Slide 251

Slide 251 text

ExampleGaussianRcppEigen.cpp VI parameter_vector parameters = parameters_input; VectorAD parameters_AD = parameters.cast(); constexpr auto P = model::parameters_count; for (std::size_t p = 0; p != P; ++p) { // forward mode: // active differentiation variable `p` // (one parameter at a time) parameters_AD(p).derivatives() = parameter_vector::Unit(p); } const AD & loglikelihood_result = l_t_cpp_AD(xt, parameters_AD); gradient_output = loglikelihood_result.derivatives(); return loglikelihood_result.value(); } 58

Slide 252

Slide 252 text

Data Parallel Objective Function I // First: make the data parallelism (DP) explicit // [[Rcpp::export]] Eigen::VectorXd l_t_value_cppDP ( const Eigen::Map xs, const Eigen::Map parameters ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); for (std::size_t t = 0; t != sample_size; ++t) { result[t] = l_t_cpp_AD(xs[t], parameters); } return result; } 59

Slide 253

Slide 253 text

Data Parallel Objective Function II // Second: Parallelize using OpenMP // [[Rcpp::export]] Eigen::VectorXd l_t_value_cppDP_OMP ( const Eigen::Map xs, const Eigen::Map parameters ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); #pragma omp parallel for default(none) shared(result) for (std::size_t t = 0; t < sample_size; ++t) { result[t] = l_t_cpp_AD(xs[t], parameters); } return result; } 60

Slide 254

Slide 254 text

Data Parallel Objective Function Performance I > require("microbenchmark") > microbenchmark( + l_t(xs, fixed_parameters), + l_t_value_cppDP(xs, fixed_parameters), + l_t_value_cppDP_OMP(xs, fixed_parameters) + ) Unit: microseconds expr median neval l_t_value_cppDP(xs, fixed_parameters) 458.618 100 l_t_value_cppDP_OMP(xs, fixed_parameters) 213.526 100 Note: changing l_t_cpp_AD to l_t_cpp_AD_DP (analogously) can also help, but we’re not done, yet... 61

Slide 255

Slide 255 text

Data Parallel Gradient I // [[Rcpp::export]] Eigen::VectorXd l_t_value_gradient_cppDP ( const Eigen::Map xs, const Eigen::Map parameters_input, Eigen::Map gradient_output ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); for (std::size_t t = 0; t != sample_size; ++t) { using parameter_vector = model::parameter_vector; using AD = Eigen::AutoDiffScalar; using VectorAD = model::parameter_vector; 62

Slide 256

Slide 256 text

Data Parallel Gradient II parameter_vector parameters = parameters_input; VectorAD parameters_AD = parameters.cast(); constexpr auto P = model::parameters_count; for (std::size_t p = 0; p != P; ++p) { parameters_AD(p).derivatives() = parameter_vector::Unit(p); } const AD & loglikelihood_result = l_t_cpp_AD(xs[t], parameters_AD); gradient_output.row(t) = loglikelihood_result.derivatives(); result[t] = loglikelihood_result.value(); } return result; } 63

Slide 257

Slide 257 text

Data Parallel Gradient III // [[Rcpp::export]] Eigen::VectorXd l_t_value_gradient_cppDP_OMP ( const Eigen::Map xs, const Eigen::Map parameters_input, Eigen::Map gradient_output ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); #pragma omp parallel for default(none) \ shared(result, gradient_output) for (std::size_t t = 0; t < sample_size; ++t) { using parameter_vector = model::parameter_vector; using AD = Eigen::AutoDiffScalar; using VectorAD = model::parameter_vector; 64

Slide 258

Slide 258 text

Data Parallel Gradient IV parameter_vector parameters = parameters_input; VectorAD parameters_AD = parameters.cast(); constexpr auto P = model::parameters_count; for (std::size_t p = 0; p != P; ++p) { parameters_AD(p).derivatives() = parameter_vector::Unit(p); } const AD & loglikelihood_result = l_t_cpp_AD(xs[t], parameters_AD); gradient_output.row(t) = loglikelihood_result.derivatives(); result[t] = loglikelihood_result.value(); } return result; } 65

Slide 259

Slide 259 text

Data Parallel Gradient Performance I microbenchmark( l_t_value_gradient_cppDP(xs, fixed_parameters, gradient), l_t_value_gradient_cppDP_OMP(xs, fixed_parameters, gradient) ) Unit: microseconds expr median neval l_t_value_gradient_cppDP 631.3375 100 l_t_value_gradient_cppDP_OMP 258.6945 100 66

Slide 260

Slide 260 text

Data Parallel Gradient Performance — Conclusions Worth noting: • speed-up over naive serial version thanks to OpenMP 67

Slide 261

Slide 261 text

Data Parallel Gradient Performance — Conclusions Worth noting: • speed-up over naive serial version thanks to OpenMP • we can do better: truly data-parallel (non-naive) objective function implementation l_t_cpp_AD_DP, reverse mode AD 67

Slide 262

Slide 262 text

Slide 263

Slide 263 text

Caveats • Source code 68

Slide 264

Slide 264 text

Caveats • Source code • Recall: AD requires the access to the source code 68

Slide 265

Slide 265 text

Caveats • Source code • Recall: AD requires the access to the source code • We don’t always have it: Consider closed source software (proprietary, third party) 68

Slide 266

Slide 266 text

Slide 267

Slide 267 text

Caveats • Source code • Recall: AD requires the access to the source code • We don’t always have it: Consider closed source software (proprietary, third party) • Another challenge: Source code spanning multiple programming languages • Reverse mode — memory requirements 68

Slide 268

Slide 268 text

Slide 269

Slide 269 text

Slide 270

Slide 270 text

Resources

Slide 271

Slide 271 text

General http://www.autodiff.org/ http://en.wikipedia.org/wiki/Automatic_differentiation 70

Slide 272

Slide 272 text

Resources I A Gentle Introduction to Backpropagation http://numericinsight.blogspot.com/2014/07/a-gentle- introduction-to-backpropagation.html http://www.numericinsight.com/uploads/A_Gentle_Introduction_to_Ba A Multi-Core Benchmark Used to Improve Algorithmic Differentiation http://www.seanet.com/~bradbell/cppad_thread.htm A Multi-Core Algorithmic Differentiation Benchmarking System — Brad Bell https://vimeo.com/39008544 71

Slide 273

Slide 273 text

Resources II A Step by Step Backpropagation Example http://mattmazur.com/2015/03/17/a-step-by-step- backpropagation-example/ Adjoint Methods in Computational Finance http://www.hpcfinance.eu/sites/www.hpcfinance.eu/files/Uwe%20Nau http://www.nag.com/Market/seminars/Uwe_AD_Slides_July13.pdf Adjoints and Automatic (Algorithmic) Differentiation in Computational Finance http://arxiv.org/abs/1107.1831v1 Algorithmic Differentiation in More Depth http://www.nag.com/pss/ad-in-more-depth 72

Slide 274

Slide 274 text

Resources III Algorithmic Differentiation of a GPU Accelerated Application http://www.nag.co.uk/Market/events/jdt-hpc-new-thinking-in- ﬁnance-presentation.pdf Automatic Differentiation and QuantLib https://quantlib.wordpress.com/tag/automatic-differentiation/ Automatic differentiation in deep learning by Shawn Tan https://cdn.rawgit.com/shawntan/presentations/master/Deep Learning-Copy1.slides.html Calculus on Computational Graphs: Backpropagation https://colah.github.io/posts/2015-08-Backprop/ 73

Slide 275

Slide 275 text

Resources IV Computing derivatives for nonlinear optimization: Forward mode automatic differentiation http://nbviewer.ipython.org/github/joehuchette/OR-software- tools-2015/blob/master/6-nonlinear-opt/Nonlinear- DualNumbers.ipynb Introduction to Automatic Differentiation http://alexey.radul.name/ideas/2013/introduction-to-automatic- differentiation/ Jarrett Revels: Automatic differentiation https://www.youtube.com/watch?v=PrXUl0sanro 74

Slide 276

Slide 276 text

Resources V Neural Networks (Part I) – Understanding the Mathematics behind backpropagation https://biasvariance.wordpress.com/2015/08/18/neural- networks-understanding-the-math-behind-backpropagation- part-i/ 75

Slide 277

Slide 277 text

Floating Point Numbers • http://www.johndcook.com/blog/2009/04/06/anatomy-of-a- ﬂoating-point-number/ Everything by Bruce Dawson: • https://randomascii.wordpress.com/category/ﬂoating-point/ In particular: • https://randomascii.wordpress.com/2012/02/25/ comparing-floating-point-numbers-2012-edition/ • https://randomascii.wordpress.com/2012/04/05/ floating-point-complexities/ • https://randomascii.wordpress.com/2013/01/03/ top-eight-entertaining-blog-facts-for-2012/ 76

Slide 278

Slide 278 text

Software I https://en.wikipedia.org/wiki/Automatic_differentiation#Software ADNumber, Adept, ADOL-C, CppAD, Eigen (Auto Diff module) CasADi https://github.com/casadi/casadi/wiki CasADi is a symbolic framework for algorithmic (a.k.a. automatic) differentiation and numeric optimization. CppAD https://github.com/coin-or/CppAD/ 77

Slide 279

Slide 279 text

Software II Dali https://github.com/JonathanRaiman/Dali An automatic differentiation library that uses reverse-mode differentation (backpropagation) to differentiate recurrent neural networks, or most mathematical expressions through control ﬂow, while loops, recursion. Open Porous Media Automatic Differentiation Library https://github.com/OPM/opm-autodiff Utilities for automatic differentiation and simulators based on AD. The Stan Math Library (stan::math: includes a C++ reverse-mode automatic differentiation library) https://github.com/stan-dev/math 78

Slide 280

Slide 280 text

References

Slide 281

Slide 281 text

References I Bollerslev, Tim, and Robert Engle. 1993. “Common Persistence in Conditional Variances.” Econometrica 61 (1). The Econometric Society: pp. 167–86. http://www.jstor.org/stable/2951782. Giles, Michael, and Paul Glasserman. 2006. “Smoking Adjoints: Fast Monte Carlo Greeks.” Risk 19: 88–92. Griewank, Andreas. 2012. “Who Invented the Reverse Mode of Differentiation?” Optimization Stories Documenta Matematica, Extra Volume ISMP (2012): 389–400. Griewank, Andreas, and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. 2nd ed. Other Titles in Applied Mathematics 105. 80

Slide 282

Slide 282 text

References II Philadelphia, PA: SIAM. http://www.ec-securehost.com/SIAM/OT105.html. Homescu, C. 2011. “Adjoints and Automatic (Algorithmic) Differentiation in Computational Finance.” ArXiv E-Prints, July. http://arxiv.org/abs/1107.1831. Naumann, Uwe. 2012. The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation. Computer software, Environments, and Tools 24. Philadelphia, PA: SIAM. http://www.ec-securehost.com/SIAM/SE24.html. Nocedal, Jorge, and Stephen J. Wright. 2006. Numerical Optimization. 2. ed. Springer: New York, NY. 81

Slide 283

Slide 283 text

Thank You! Questions? 82