Algorithmic Differentiation for Extremum Estimation: An Introduction Using RcppEigen (useR! 2015)

Slide 1

Slide 1 text

Algorithmic Differentiation for Extremum Estimation An Introduction Using RcppEigen Matt P. Dziubinski useR! 2015 in Aalborg, Denmark [email protected] // @matt_dz Department of Mathematical Sciences, Aalborg University CREATES (Center for Research in Econometric Analysis of Time Series)

Slide 2

Slide 2 text

Outline Why? Numerical Optimization Calculating Derivatives Example Source Code Resources References 2

Slide 3

Slide 3 text

Why?

Slide 4

Slide 4 text

Motivation: Parametric Models • Parametric model: parameter vector θ ∈ Θ ⊆ Rp, Θ — parameter space 4

Slide 5

Slide 5 text

Motivation: Parametric Models • Parametric model: parameter vector θ ∈ Θ ⊆ Rp, Θ — parameter space • An estimator ˆ θ is an extremum estimator if ∃Q : Θ → R s.t. ˆ θ = argmax θ∈Θ Q(θ) 4

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Motivation: Parametric Models • Parametric model: parameter vector θ ∈ Θ ⊆ Rp, Θ — parameter space • An estimator ˆ θ is an extremum estimator if ∃Q : Θ → R s.t. ˆ θ = argmax θ∈Θ Q(θ) • can be obtained as an extremum (maximum or minimum) of a scalar objective function • class includes OLS (Ordinary Least Squares), NLS (Nonlinear Least Squares), GMM (Generalized Method of Moments), and QMLE (Quasi Maximum Likelihood Estimation) 4

Slide 8

Slide 8 text

Motivation: Practice • Nowadays one of the most widely applied approaches for estimation (econometrics, statistics), calibration (ﬁnance), training ((supervised) machine learning) 5

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Motivation: Practice • Nowadays one of the most widely applied approaches for estimation (econometrics, statistics), calibration (ﬁnance), training ((supervised) machine learning) • Example: (Q)MLE: • Likelihood: L(θ; y) = f (y; θ), f — data PDF • Maximum-likelihood estimator (MLE): ˆ θMLE = argmax θ∈Θ L(θ; y) 5

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Numerical Optimization

Slide 14

Slide 14 text

Numerical Optimization: Algorithms • Numerical optimization algorithms — can be broadly categorized into two categories: 7

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Numerical Optimization: Algorithms • Numerical optimization algorithms — can be broadly categorized into two categories: • derivative-free — do not rely on the knowledge of the objective function's gradient in the optimization process • gradient-based — need the gradient of the objective function • iteration form: initial iterate: θ0 new iterate: θk = θk−1 + sk , iteration k ≥ 1 step: sk = αk pk length: scalar αk > 0 direction: vector pk s.t. Hk pk = ∇Q(θk) • Hk : Hk = I (Steepest Ascent) — many variants, including SGD (Stochastic Gradient Descent) Hk = ∇2Q(θk) (Newton) Hk = Hessian approximation satisfying the secant equation: Hk+1 sk = ∇Qk+1 − ∇Qk (Quasi-Newton) 7

Slide 19

Slide 19 text

Numerical Optimization: Gradient • Gradient-based algorithms often exhibit superior convergence rates (superlinear or even quadratic) 8

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Numerical Optimization: Gradient • Gradient-based algorithms often exhibit superior convergence rates (superlinear or even quadratic) • However: this property depends on the gradient either being exact — or at minimum sufﬁciently accurate in the limit sense, Nocedal and Wright (2006 Chapter 3) • For many modern models analytical gradient formulas not available in closed-form — or non-trivial to derive • This often leads to the use of numerical approximations — in particular, ﬁnite difference methods (FDMs), Nocedal and Wright (2006 Chapter 8) 8

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Numerical Optimization: Algorithmic Differentiation • Need a way obtain the gradient — preferably: fast and accurate 9

Slide 25

Slide 25 text

Numerical Optimization: Algorithmic Differentiation • Need a way obtain the gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: 9

Slide 26

Slide 26 text

Numerical Optimization: Algorithmic Differentiation • Need a way obtain the gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast 9

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Numerical Optimization: Algorithmic Differentiation • Need a way obtain the gradient — preferably: fast and accurate • This is where Algorithmic Differentiation (AD) comes in: • fast • and accurate • and simple • Essentially: • an automatic computational differentiation method • based on the systematic application of the chain rule 9

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Calculating Derivatives

Slide 35

Slide 35 text

Calculating Derivatives • Main approaches: 11

Slide 36

Slide 36 text

Calculating Derivatives • Main approaches: • Finite Differencing 11

Slide 37

Slide 37 text

Calculating Derivatives • Main approaches: • Finite Differencing • Algorithmic Differentiation (AD), a.k.a. Automatic Differentiation 11

Slide 38

Slide 38 text

Calculating Derivatives • Main approaches: • Finite Differencing • Algorithmic Differentiation (AD), a.k.a. Automatic Differentiation • Symbolic Differentiation 11

Slide 39

Slide 39 text

Finite Differencing • Recall f ′(x) = limh→0 f (x+h)−f (x) h 12

Slide 40

Slide 40 text

Finite Differencing • Recall f ′(x) = limh→0 f (x+h)−f (x) h • forward-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x) h , 12

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Finite Differencing • Recall f ′(x) = limh→0 f (x+h)−f (x) h • forward-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x) h , • central-difference approximation: ∂f ∂xi (x) ≈ f (x + hei) − f (x − hei) 2h , • Computational cost: • forward-difference formula: p + 1 function evaluations 12

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Finite Differencing — Truncation Error • Truncation Error Analysis — Taylor’s theorem 13

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Finite Differencing — Truncation Error • Truncation Error Analysis — Taylor’s theorem • Consider univariate function f and the forward-difference formula: f ′(x) ≈ f (x + h) − f (x) h • Recall Taylor's formula f (x + h) = f (x) + hf ′(x) + 1 2 h2f ′′(x) + 1 6 h3f ′′′(x) + O(h4 ) • Hence f ′(x) = f (x + h) − f (x) h − 1 2 hf ′′(x) − 1 6 h2f ′′′(x) − O(h3 ) 13

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or distributivity) of addition (and multiplication) 14

Slide 52

Slide 52 text

Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or distributivity) of addition (and multiplication) • Example: 14

Slide 53

Slide 53 text

Slide 54

Slide 54 text

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Floating Point Arithmetic • Floating Point Arithmetic: NO associativity (or distributivity) of addition (and multiplication) • Example: • Input: a = 1234.567; b = 45.678; c = 0.0004 a + b + c c + b + a (a + b + c) - (c + b + a) • Output: • 1280.2454 • 1280.2454 • -2.273737e-13 • "Many a serious mathematician has attempted to give rigorous analyses of a sequence of ﬂoating point operations, but has found the task to be so formidable that he has tried to content himself with plausibility arguments instead." — Donald E. Knuth, TAOCP2 14

Slide 59

Slide 59 text

Floating Point Representation • Floating point representation: inspired by scientiﬁc notation 15

Slide 60

Slide 60 text

Floating Point Representation • Floating point representation: inspired by scientiﬁc notation • The IEEE Standard for Floating-Point Arithmetic (IEEE 754) 15

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Slide 63

Slide 63 text

Slide 64

Slide 64 text

Floating Point Representation • Floating point representation: inspired by scientific notation • The IEEE Standard for Floating-Point Arithmetic (IEEE 754) • Form: significant digits × baseexponent • Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} • the difference between 1.0 and the next value representable by the floating-point number 1.0 + ϵ • IEEE 754 binary32 (single precision): 2−23 ≈ 1.19209e-07 (1 sign bit, 8 exponent bits, leaving 23 bits for significand) 15

Slide 65

Slide 65 text

Slide 66

Slide 66 text

Slide 67

Slide 67 text

Floating Point Representation Properties • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM 16

Slide 68

Slide 68 text

Floating Point Representation Properties • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) 16

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Slide 71

Slide 71 text

Slide 72

Slide 72 text

Slide 73

Slide 73 text

Slide 74

Slide 74 text

Slide 75

Slide 75 text

Slide 76

Slide 76 text

Floating Point Representation Properties • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ﬂoating point number representation: fl(x) = x(1 + δ), where |δ| < ϵM • representation (round-off) error: x − fl(x) • absolute round-off error: |x − fl(x)| < ϵM|x| • relative round-off error: |x − fl(x)|/|x| < ϵM • ϵM a.k.a. unit round-off, u = ULP(1.0) • ULP — Unit in the last place 16

Slide 77

Slide 77 text

Slide 78

Slide 78 text

Slide 79

Slide 79 text

Finite Differencing — Round-off Error I • Consider f ′(x) ≈ f (x + h) − f (x) h 17

Slide 80

Slide 80 text

Finite Differencing — Round-off Error I • Consider f ′(x) ≈ f (x + h) − f (x) h • Suppose that x and x + h exact 17

Slide 81

Slide 81 text

Slide 82

Slide 82 text

Slide 83

Slide 83 text

Slide 84

Slide 84 text

Finite Differencing — Round-off Error I • Consider f ′(x) ≈ f (x + h) − f (x) h • Suppose that x and x + h exact • Round-off errors when evaluating f: f (x + h)(1 + δ1) − f (x)(1 + δ2) h = f (x + h) − f (x) h + δ1 f (x + h) − δ2 f (x) h • We have |δi| ≤ ϵM for i ∈ 1, 2. • =⇒ round-off error bound: ϵM |f (x + h)| + |f (x)| h • =⇒ round-off error bound order: O(ϵM h ) 17

Slide 85

Slide 85 text

Slide 86

Slide 86 text

Slide 87

Slide 87 text

Finite Differencing — Round-off Error II • ϵM ̸= 0 =⇒ ∃ˆ h : x + ˆ h = x 18

Slide 88

Slide 88 text

Finite Differencing — Round-off Error II • ϵM ̸= 0 =⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either 18

Slide 89

Slide 89 text

Finite Differencing — Round-off Error II • ϵM ̸= 0 =⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either • =⇒ f (x) = f (x + ˆ h) 18

Slide 90

Slide 90 text

Finite Differencing — Round-off Error II • ϵM ̸= 0 =⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either • =⇒ f (x) = f (x + ˆ h) • =⇒ f (x+h)−f (x) ˆ h = 0 18

Slide 91

Slide 91 text

Finite Differencing — Round-off Error II • ϵM ̸= 0 =⇒ ∃ˆ h : x + ˆ h = x • i.e., x and x + h not guaranteed to be exactly representable, either • =⇒ f (x) = f (x + ˆ h) • =⇒ f (x+h)−f (x) ˆ h = 0 • =⇒ no correct digits in the result at (or below) step-size ˆ h 18

Slide 92

Slide 92 text

Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) + O(ϵM h ) (forward-difference formula) 19

Slide 93

Slide 93 text

Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) + O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal 19

Slide 94

Slide 94 text

Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) + O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h 19

Slide 95

Slide 95 text

Slide 96

Slide 96 text

Slide 97

Slide 97 text

Slide 98

Slide 98 text

Slide 99

Slide 99 text

Slide 100

Slide 100 text

Finite Differencing — Trucation—Round-off Trade-off • Total error: O(h) + O(ϵM h ) (forward-difference formula) • Best accuracy when contributions approximately equal • h ≈ ϵM h • h2 ≈ ϵM • h ≈ √ ϵM • binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) • binary64: 1.490116e-08 • Total error: O(h2) + O(ϵM h ) (central-difference formula) 19

Slide 101

Slide 101 text

Slide 102

Slide 102 text

Slide 103

Slide 103 text

Slide 104

Slide 104 text

Slide 105

Slide 105 text

Finite Differencing — Central-Difference Formula • Subtracting the backward-difference approximation from the forward-difference approximation: 20

Slide 106

Slide 106 text

Slide 107

Slide 107 text

Finite Differencing — Central-Difference Formula • Subtracting the backward-difference approximation from the forward-difference approximation: • (central-difference) f ′(x) = f (x + h) − f (x − h) 2h − O(h2 ) • Truncation error ∈ O(h2) (second-order approximation) 20

Slide 108

Slide 108 text

Slide 109

Slide 109 text

Slide 110

Slide 110 text

Slide 111

Slide 111 text

Slide 112

Slide 112 text

Finite Difference Approximation Error Example: f (x) = cos(sin(x) ∗ cos(x)) f ′(x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x)) • Total error: O(h) + O( ϵM h ) (forward-difference formula) 21

Slide 113

Slide 113 text

Slide 114

Slide 114 text

Slide 115

Slide 115 text

Finite Difference Approximation Error Example: f (x) = cos(sin(x) ∗ cos(x)) f ′(x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x)) • Total error: O(h) + O( ϵM h ) (forward-difference formula) • h ≈ √ ϵM — binary64: 1.490116e-08 • total error's minimum value ∈ O( √ ϵM) — binary64: 1.490116e-08 • Total error: O(h2) + O( ϵM h ) (central-difference formula) 21

Slide 116

Slide 116 text

Slide 117

Slide 117 text

Slide 118

Slide 118 text

Symbolic Differentiation • Algebraic speciﬁcation for f manipulated by symbolic manipulation tools to produce new algebraic expressions for each component of the gradient 22

Slide 119

Slide 119 text

Symbolic Differentiation • Algebraic speciﬁcation for f manipulated by symbolic manipulation tools to produce new algebraic expressions for each component of the gradient • Commonly used symbolic manipulation tools can be found in the packages Mathematica (also Wolfram Alpha), Macsyma, Maple, or Mathics. 22

Slide 120

Slide 120 text

Slide 121

Slide 121 text

Slide 122

Slide 122 text

Slide 123

Slide 123 text

Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation (SD) applicable to "paper-and-pencil" algebraic expressions 23

Slide 124

Slide 124 text

Slide 125

Slide 125 text

Slide 126

Slide 126 text

Slide 127

Slide 127 text

Numerical Objective Function & Algorithmic Differentiation I • Symbolic differentiation (SD) applicable to "paper-and-pencil" algebraic expressions • Input: function speciﬁed by an algebraic formula or closed-form expression • Objective function in numerical optimization — a computer program • How are computer programs different? • named intermediates 23

Slide 128

Slide 128 text

Slide 129

Slide 129 text

Slide 130

Slide 130 text

Slide 131

Slide 131 text

Slide 132

Slide 132 text

Numerical Objective Function & Algorithmic Differentiation II • Algorithmic Differentiation (AD) offers another solution. 24

Slide 133

Slide 133 text

Slide 134

Slide 134 text

Slide 135

Slide 135 text

Numerical Objective Function & Algorithmic Differentiation II • Algorithmic Differentiation (AD) offers another solution. • AD — a computational differentiation method with the roots in the late 1960s, Griewank (2012) • Machine Learning — Neural Networks — Backpropagation: Bryson, Denham, Dreyfus (1963), Werbos (1974) • AD widely known and applied in natural sciences — Naumann (2012), and computational ﬁnance — Giles and Glasserman (2006), Homescu (2011) 24

Slide 136

Slide 136 text

Algorithmic Differentiation — Idea I • Idea: 25

Slide 137

Slide 137 text

Algorithmic Differentiation — Idea I • Idea: • computer code for evaluating the function can be broken down 25

Slide 138

Slide 138 text

Algorithmic Differentiation — Idea I • Idea: • computer code for evaluating the function can be broken down • into a composition of elementary arithmetic operations 25

Slide 139

Slide 139 text

Slide 140

Slide 140 text

Slide 141

Slide 141 text

Algorithmic Differentiation — Idea I • Idea: • computer code for evaluating the function can be broken down • into a composition of elementary arithmetic operations • to which the chain rule (one of the basic rules of calculus) can be applied • e.g., given an objective function Q(θ) = f (g(θ)), we have Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′ • Relies on the systematic application of the chain rule by a computer program 25

Slide 142

Slide 142 text

Slide 143

Slide 143 text

Slide 144

Slide 144 text

Slide 145

Slide 145 text

Algorithmic Differentiation — Idea II • Mathematical operations performed exactly 26

Slide 146

Slide 146 text

Algorithmic Differentiation — Idea II • Mathematical operations performed exactly • to the maximum extent allowed by the machine precision — i.e., accurate to within round-off error 26

Slide 147

Slide 147 text

Slide 148

Slide 148 text

Algorithmic Differentiation — Idea II • Mathematical operations performed exactly • to the maximum extent allowed by the machine precision — i.e., accurate to within round-off error • truncation error: none • round-off error: equivalent to using a source code analogously implementing the analytical derivatives. 26

Slide 149

Slide 149 text

Algorithmic Differentiation — Idea III • AD similar to symbolic differentiation 27

Slide 150

Slide 150 text

Algorithmic Differentiation — Idea III • AD similar to symbolic differentiation • major difference: 27

Slide 151

Slide 151 text

Algorithmic Differentiation — Idea III • AD similar to symbolic differentiation • major difference: • SD restricted to symbolic manipulation of algebraic expressions 27

Slide 152

Slide 152 text

Slide 153

Slide 153 text

Algorithmic Differentiation — Idea III • AD similar to symbolic differentiation • major difference: • SD restricted to symbolic manipulation of algebraic expressions • AD operates on (and integrates with) the existing source code of a computer program, Nocedal and Wright (2006 Section 8.2). • Note: less than required by SD, more than required by ﬁnite differences (which can operate on black box implementations) 27

Slide 154

Slide 154 text

Slide 155

Slide 155 text

Slide 156

Slide 156 text

Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012): 28

Slide 157

Slide 157 text

Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012): • source code transformation (SCT) 28

Slide 158

Slide 158 text

Slide 159

Slide 159 text

Slide 160

Slide 160 text

Algorithmic Differentiation — Implementations • Two implementation styles, Naumann (2012): • source code transformation (SCT) • e.g., ADIFOR — produce new code that calculates both function and derivative values • operator overloading • e.g., ADOL-C — keep track of the elementary computations during the function evaluation, produce the derivatives at given inputs 28

Slide 161

Slide 161 text

Slide 162

Slide 162 text

Slide 163

Slide 163 text

Algorithmic Differentiation — Example MLE • a more concrete illustration: example in the context of statistical estimation 29

Slide 164

Slide 164 text

Slide 165

Slide 165 text

Slide 166

Slide 166 text

Algorithmic Differentiation — Example MLE • a more concrete illustration: example in the context of statistical estimation • evaluation and differentiation of the loglikelihood function of a normally distributed sample • we have T > 1 observations x1, x2, ..., xT i.i.d. ∼ N(µ, σ2), with µ ∈ R, and σ ∈ R+ • our parameter vector is θ = (µ, σ2) 29

Slide 167

Slide 167 text

Slide 168

Slide 168 text

Algorithmic Differentiation — Evaluation Trace • objective function: the loglikelihood function, ℓ(θ) = logL(θ; x) = ∑T t=1 ℓt(θ) 30

Slide 169

Slide 169 text

Slide 170

Slide 170 text

Algorithmic Differentiation — Evaluation Trace • objective function: the loglikelihood function, ℓ(θ) = logL(θ; x) = ∑T t=1 ℓt(θ) • a contribution given by ℓt(θ) = −T 2 log(2π) − T 2 log(σ2) − 1 2 (xt − µ)2/σ2. • goal: compute the value of ℓt corresponding to the arguments xt = 6, µ = 6, and σ2 = 4. 30

Slide 171

Slide 171 text

Slide 172

Slide 172 text

Algorithmic Differentiation — Evaluation Trace Variable Operation Expression Value v −1 = µ = 6 v0 = σ2 = 4 v1 = ϕ1(v0) = log(v0) = 1.386294 v2 = ϕ2(v1) = −1 2 log(2π) − 1 2 v1 = −1.612086 v3 = ϕ3(v −1) = xt − v −1 = 0 v4 = ϕ4(v3) = v2 3 = 0 v5 = ϕ5(v0, v4) = v4/v0 = 0 v6 = ϕ6(v5) = −1 2 v5 = 0 v7 = ϕ7(v2, v6) = v2 + v6 = −1.612086 ℓt = v7 = −1.612086 31

Slide 173

Slide 173 text

Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken down 32

Slide 174

Slide 174 text

Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken down • into a composition of intermediate subexpressions vi , with i ∈ {1, . . . , 7} 32

Slide 175

Slide 175 text

Slide 176

Slide 176 text

Slide 177

Slide 177 text

Algorithmic Differentiation — Evaluation Trace • evaluation of ℓt broken down • into a composition of intermediate subexpressions vi , with i ∈ {1, . . . , 7} • intermediate subexpressions: elementary operations and functions ϕi with known, predeﬁned derivatives • note: similar to the static single assignment form (SSA) used in compiler design and implementation • We can now apply the chain rule to the intermediate operations vi and collect the intermediate derivative results in order to obtain the derivative of the ﬁnal expression ℓt = v7 32

Slide 178

Slide 178 text

Slide 179

Slide 179 text

Slide 180

Slide 180 text

Algorithmic Differentiation — Forward Mode • goal: the derivative of the output variable ℓt (dependent variable) with respect to the input variable µ (independent variable) 33

Slide 181

Slide 181 text

Slide 182

Slide 182 text

Algorithmic Differentiation — Forward Mode • goal: the derivative of the output variable ℓt (dependent variable) with respect to the input variable µ (independent variable) • idea: compute the numerical value of the derivative of each of the intermediate subexpressions, keeping the track of the resulting derivatives values along the way. • for each variable vi : introduce another variable, ˙ vi = ∂vi/∂µ 33

Slide 183

Slide 183 text

Slide 184

Slide 184 text

Slide 185

Slide 185 text

Algorithmic Differentiation — Forward Mode • in our case 34

Slide 186

Slide 186 text

Algorithmic Differentiation — Forward Mode • in our case • start with ˙ v −1 = 1.0 (active differentation variable µ), 34

Slide 187

Slide 187 text

Slide 188

Slide 188 text

Slide 189

Slide 189 text

Algorithmic Differentiation — Forward Mode • in our case • start with ˙ v −1 = 1.0 (active differentation variable µ), • ˙ v0 = 0.0 (passive: we are not differentiating with respect to σ2 and thus treat it as a constant) • then, given that v1 = ϕ1(v0) = log(v0), we obtain ˙ v1 = (∂ϕ1(v0)/∂v0)˙ v0 = ˙ v0/v0 = 0.0 • analogously, v2 = ϕ2(v1) = −1 2 log(2π) − 1 2 v1 , hence ˙ v2 = (∂ϕ2(v1)/∂v1)˙ v1 = −1 2 ˙ v1 = 0.0 34

Slide 190

Slide 190 text

Slide 191

Slide 191 text

Algorithmic Differentiation — Augmented Evaluation Trace Variable Operation Expression Value v −1 = µ = 6 ˙ v −1 = ∂v −1/∂v −1 = 1 v0 = σ2 = 4 ˙ v0 = ∂v0/∂v −1 = 0 v1 = ϕ1(v0) = log(v0) = 1.386294 ˙ v1 = (∂ϕ1/∂v0)˙ v0 = ˙ v0/v0 = 0 . . . . . . . . . . . . v7 = ϕ7(v2, v6) = v2 + v6 = −1.612086 ˙ v7 = (∂ϕ7/∂v2)˙ v2 + (∂ϕ7/∂v6)˙ v6 = ˙ v2 + ˙ v6 = 0 ℓt = v7 = −1.612086 ˙ ℓt = ˙ v7 = 0 35

Slide 192

Slide 192 text

Slide 193

Slide 193 text

Slide 194

Slide 194 text

Algorithmic Differentiation — Augmented Evaluation Trace • The basic forward mode of AD — called “forward” because the derivatives values ˙ vi carried along simultaneously with the values vi themselves • The problem addressed by the implementations: • "transform a program with a particular evaluation trace into an augmented program whose evaluation trace also contains exactly the same extra variables and additional lines as in the derived evaluation trace," Griewank and Walther (2008). 36

Slide 195

Slide 195 text

AD — Reverse Mode • An alternative to the forward approach — the reverse or adjoint mode. 37

Slide 196

Slide 196 text

AD — Reverse Mode • An alternative to the forward approach — the reverse or adjoint mode. • Griewank and Walther (2008): "Rather than choosing an input variable and calculating the sensitivity of every intermediate variable with respect to that input, we choose instead an output variable and calculate the sensitivity of that output with respect to each of the intermediate variables. As it turns out adjoint sensitivities must naturally be computed backward, i.e., starting from the output variables." 37

Slide 197

Slide 197 text

Slide 198

Slide 198 text

Slide 199

Slide 199 text

AD — Forward Mode vs. Reverse Mode • In general, for f : Rn → Rm 38

Slide 200

Slide 200 text

AD — Forward Mode vs. Reverse Mode • In general, for f : Rn → Rm • rough CostAD,Forward(f) ∈ n O(Cost(f)) 38

Slide 201

Slide 201 text

AD — Forward Mode vs. Reverse Mode • In general, for f : Rn → Rm • rough CostAD,Forward(f) ∈ n O(Cost(f)) • rough CostAD,Reverse(f) ∈ m O(Cost(f)) 38

Slide 202

Slide 202 text

Slide 203

Slide 203 text

AD — Forward Mode vs. Reverse Mode • In general, for f : Rn → Rm • rough CostAD,Forward(f) ∈ n O(Cost(f)) • rough CostAD,Reverse(f) ∈ m O(Cost(f)) • Which is better? Depends: e.g., gradient vs. Jacobian • Foward mode performance: can establish a worst-case performance bound for AD when used for gradient computation 38

Slide 204

Slide 204 text

Illustration: Simulation Study — Setup The Model: GARCH(1, 1) The spot asset price S follows the following process under the phys- ical probability measure P: log St+1 St ≡ rt+1 = µ + ϵt+1 (1) ϵt+1 = √vt+1 wt+1 (2) vt+1 = ω + aϵ2 t + bvt (3) w P ∼ GWN(0, 1) (4) 39

Slide 205

Slide 205 text

Illustration: Simulation Study — Setup — Notation • θ = (µ, ω, a, b)⊤ 40

Slide 206

Slide 206 text

Illustration: Simulation Study — Setup — Notation • θ = (µ, ω, a, b)⊤ • r — logarithmic return of the spot price 40

Slide 207

Slide 207 text

Illustration: Simulation Study — Setup — Notation • θ = (µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r 40

Slide 208

Slide 208 text

Illustration: Simulation Study — Setup — Notation • θ = (µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r 40

Slide 209

Slide 209 text

Slide 210

Slide 210 text

Slide 211

Slide 211 text

Slide 212

Slide 212 text

Slide 213

Slide 213 text

Slide 214

Slide 214 text

Illustration: Simulation Study — Setup — Notation • θ = (µ, ω, a, b)⊤ • r — logarithmic return of the spot price • µ — conditional mean of r • v — conditional variance of r • non-negativity: ω, a, b > 0 • w — randomness source (innovation): standard Gaussian white noise • mv = ω 1−a−b — the unconditional mean of v (unconditional variance of r) • pv = (a + b) — the mean-reversion rate or the persistence rate • when pv < 1, weak stationarity, the conditional variance will revert to the unconditional variance at a geometric rate of pv • smaller pv — less persistent sensitivity of the volatility expectation to past market shocks 40

Slide 215

Slide 215 text

Slide 216

Slide 216 text

Slide 217

Slide 217 text

Illustration: Monte Carlo Simulation • DGP (data-generating process): near-integrated scenario, θDGP : µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85 41

Slide 218

Slide 218 text

Slide 219

Slide 219 text

Slide 220

Slide 220 text

Illustration: Monte Carlo Simulation • DGP (data-generating process): near-integrated scenario, θDGP : µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85 • M = 10,000 (MC estimation experiments / replications count) • N = 2,000 ... 16,000 (sample size) • 2,000 observations / 252 business days — roughly 8 years of data 41

Slide 221

Slide 221 text

Slide 222

Slide 222 text

Illustration: Reliability — AD, L-BFGS 42

Slide 223

Slide 223 text

Illustration: Reliability — FD, TNR 43

Slide 224

Slide 224 text

Illustration: Findings — Reliability • Reliability of estimation: 44

Slide 225

Slide 225 text

Illustration: Findings — Reliability • Reliability of estimation: • using AD: successful convergence in the vast majority of Monte Carlo experiments 44

Slide 226

Slide 226 text

Slide 227

Slide 227 text

Slide 228

Slide 228 text

Illustration: Findings — Reliability • Reliability of estimation: • using AD: successful convergence in the vast majority of Monte Carlo experiments • using FD: very high convergence failure ratio • Successful-convergence ratios (fraction of MC experiments with successful convergence within a 0.01 neighborhood true DGP values) for the best algorithm-gradient pairs are as follows: • L-BFGS(AD): average = 0.811, median: 0.99999903, 44

Slide 229

Slide 229 text

Slide 230

Slide 230 text

Slide 231

Slide 231 text

Illustration: Findings • Overall: When using the fastest successful algorithm-gradient pair for both AD and FD: 45

Slide 232

Slide 232 text

Illustration: Findings • Overall: When using the fastest successful algorithm-gradient pair for both AD and FD: • AD achieves 12.5 times higher accuracy (in terms of the error norm) with a 3 slowdown compared to L-BFGS(FD) 45

Slide 233

Slide 233 text

Slide 234

Slide 234 text

Example Source Code

Slide 235

Slide 235 text

Example in R I # Gaussian model # 2 parameters: mu, sigma^2 # # parameters = vector(mode = "numeric", length = 2L) # names(parameters) = c("mu", "sigma^2") example_parameters = function() { parameters = vector(mode = "numeric", length = 2L) names(parameters) = c("mu", "sigma^2") parameters["mu"] = 6 parameters["sigma^2"] = 4 parameters } 47

Slide 236

Slide 236 text

Example in R II # `l_t`: objective function contribution l_t = function(xt, parameters) { mu = parameters["mu"] s2 = parameters["sigma^2"] -.5 * log(2. * pi) - .5 * log(s2) - .5 * (xt - mu)^2 / s2 } 48

Slide 237

Slide 237 text

Example in R III # evaluation trace of `l_t` evaluation_trace = function(xt, parameters) { mu = parameters["mu"] s2 = parameters["sigma^2"] v_1 = mu; v0 = s2 v1 = log(v0) v2 = -.5 * log(2. * pi) - .5 * v1 v3 = xt - v_1 v4 = v3^2 v5 = v4 / v0 v6 = -.5 * v5 v7 = v2 + v6 v7 } 49

Slide 238

Slide 238 text

Example in R IV require("testthat") describe("evaluation_trace", { it("matches the objective function", { parameters = example_parameters() objective_given = function(xt) l_t(xt, parameters) trace_given = function(xt) evaluation_trace(xt, parameters) expect_equivalent(expected = objective_given(3), object = trace_given(3)) expect_equivalent(expected = objective_given(6), object = trace_given(6)) expect_equivalent(expected = objective_given(9), object = trace_given(9)) }) }) 50

Slide 239

Slide 239 text

Example in R V # analytical derivative w.r.t. `mu` dl_t_dmu = function(xt, parameters) { mu = parameters["mu"] s2 = parameters["sigma^2"] (xt - mu) / s2 } # analytical derivative w.r.t. `sigma^2` dl_t_ds2 = function(xt, parameters) { mu = parameters["mu"] s2 = parameters["sigma^2"] (xt - mu)^2 / (2 * s2^2) - 1 / (2 * s2) } 51

Slide 240

Slide 240 text

Example in R VI # forward-difference approximation of derivative w.r.t. `mu` dl_t_dmu_FD = function(xt, parameters, h) { mu = parameters["mu"] s2 = parameters["sigma^2"] parameters_FD = parameters parameters_FD["mu"] = mu + h (l_t(xt, parameters_FD) - l_t(xt, parameters)) / h } 52

Slide 241

Slide 241 text

Example in R VII # central-difference approximation of derivative w.r.t. `mu` dl_t_dmu_CD = function(xt, parameters, h) { mu = parameters["mu"] s2 = parameters["sigma^2"] parameters_CD_forward = parameters parameters_CD_forward["mu"] = mu + h parameters_CD_backward = parameters parameters_CD_backward["mu"] = mu - h (l_t(xt, parameters_CD_forward) - l_t(xt, parameters_CD_backward)) / (2. * h) } 53

Slide 242

Slide 242 text

Example in R VIII # step too large xs = c(3, 6, 9) h = 1e-1 expect_equivalent(expected = dl_t_dmu(xs, fixed_parameters), object = dl_t_dmu_FD(xs, fixed_parameters, h)) Error: dl_t_dmu_FD(xs, fixed_parameters, h) not equal to expected 3/3 mismatches (average diff: 0.0125). First 3: pos x y diff 1 -0.75 -0.7625 0.0125 2 0.00 -0.0125 0.0125 3 0.75 0.7375 0.0125 54

Slide 243

Slide 243 text

Example in R IX # step too small xs = c(3, 6, 9) h = 1e-10 expect_equivalent(expected = dl_t_dmu(xs, fixed_parameters), object = dl_t_dmu_FD(xs, fixed_parameters, h)) Error: dl_t_dmu_FD(xs, fixed_parameters, h) not equal to expected 2/3 mismatches (average diff: 6.21e-08). First 2: pos x y diff 1 -0.75 -0.75 6.21e-08 3 0.75 0.75 -6.21e-08 55

Slide 244

Slide 244 text

Example in R X # optimal step size xs = c(3, 6, 9) h = sqrt(.Machine$double.eps) expect_equivalent(expected = dl_t_dmu(xs, fixed_parameters), object = dl_t_dmu_FD(xs, fixed_parameters, h)) # test passes 56

Slide 245

Slide 245 text

Example in R XI # truncation - round-off trade-off xs = c(3, 6, 9) hs = 10 ^ -(1:16) for (h in hs) { grad_analytical = dl_t_dmu(xs, fixed_parameters) grad_FD = dl_t_dmu_FD(xt = xs, fixed_parameters, h = h) err = grad_analytical - grad_FD err_2 = sqrt(sum(err^2)) err_max = max(abs(err)) cat(h, "\t", err_2, "\t", err_max, "\n") } 57

Slide 246

Slide 246 text

Example in R XII h err_2 err_max 0.1 0.02165064 0.0125 0.01 0.002165064 0.00125 0.001 0.0002165064 0.000125 1e-04 2.165064e-05 1.25e-05 1e-05 2.165038e-06 1.250006e-06 1e-06 2.165256e-07 1.251159e-07 1e-07 2.307615e-08 1.343989e-08 1e-08 2.495572e-08 1.764636e-08 1e-09 8.775942e-08 6.205528e-08 1e-10 8.775942e-08 6.205528e-08 1e-11 3.148961e-05 2.226652e-05 1e-12 9.429331e-05 6.667544e-05 1e-13 0.0007223303 0.0005107646 1e-14 0.0070027 0.004951657 1e-15 0.4326232 0.3059108 1e-16 1.06066 0.75 58

Slide 247

Slide 247 text

Example in R XIII # Manual Differentiation # Manually Forward-Derived Evalutation Trace # of The Normal Loglikelihood Contribution forward_evaluation_trace = function(xt, parameters) { mu = parameters["mu"] s2 = parameters["sigma^2"] v_1 = mu v_1. = 1.0 # active differentiation variable v0 = s2 v0. = 0.0 # passive differentiation variable v1 = log(v0) dv1_dv0 = 1 / v0 v1. = dv1_dv0 * v0. 59

Slide 248

Slide 248 text

Example in R XIV v2 = -.5 * log(2. * pi) - .5 * v1 dv2_dv1 = -.5 v2. = dv2_dv1 * v1. v3 = xt - v_1 dv3_dv_1 = -1. v3. = dv3_dv_1 * v_1. v4 = v3^2 dv4_dv3 = 2. * v3 v4. = dv4_dv3 * v3. v5 = v4 / v0 dv5_dv0 = - v4 / (v0^2) dv5_dv4 = 1 / v0 v5. = dv5_dv0 * v0. + dv5_dv4 * v4. 60

Slide 249

Slide 249 text

Example in R XV v6 = -.5 * v5 dv6_dv5 = -.5 v6. = dv6_dv5 * v5. v7 = v2 + v6 dv7_dv2 = 1. dv7_dv6 = 1. v7. = dv7_dv2 * v2. + dv7_dv6 * v6. data.frame(f = v7, g = v7.) } dl_t_dmu_AD = function(xt, parameters) { forward_evaluation_trace(xt, parameters)$g } 61

Slide 250

Slide 250 text

Example in R XVI # NO step size # NO truncation - round-off trade-off # NO problem :-) xs = c(3, 6, 9) expect_equivalent(expected = dl_t_dmu(xt = xs, fixed_parameters), object = dl_t_dmu_AD(xt = xs, fixed_parameters)) 62

Slide 251

Slide 251 text

Example in R XVII N = 10000L # sample size xs = rnorm(N) err_AD = dl_t_dmu(xs, fixed_parameters) - dl_t_dmu_AD(xs, fixed_parameters) err_FD = dl_t_dmu(xs, fixed_parameters) - dl_t_dmu_FD(xs, fixed_parameters, sqrt(.Machine$double.eps)) err_CD = dl_t_dmu(xs, fixed_parameters) - dl_t_dmu_CD(xs, fixed_parameters, .Machine$double.eps^(1/3)) 63

Slide 252

Slide 252 text

Example in R XVIII > head(err_AD) [1] 0 0 0 0 0 0 > head(err_FD) [1] 7.520500e-08 -3.242974e-08 4.690289e-08 6.174995e-08 -1.4761 > head(err_CD) [1] 1.116505e-10 1.861267e-11 -4.390932e-11 1.224840e-10 3.1222 > sum(err_AD) [1] 0 > sum(err_FD) [1] 2.248303e-05 > sum(err_CD) [1] 3.520195e-07 64

Slide 253

Slide 253 text

Example using Rcpp Rcpp::sourceCpp('ExampleGaussianRcpp.cpp') describe("objective function in Rcpp", { it("matches the objective function in R", { parameters = example_parameters() objectiveR_at = function(xt) l_t(xt, parameters) objectiveRcpp_at = function(xt) l_t_cpp(xt, parameters) expect_equivalent(expected = objectiveR_at(3), object = objectiveRcpp_at(3)) expect_equivalent(expected = objectiveR_at(6), object = objectiveRcpp_at(6)) expect_equivalent(expected = objectiveR_at(9), object = objectiveRcpp_at(9)) }) }) 65

Slide 254

Slide 254 text

ExampleGaussianRcpp.cpp I // [[Rcpp::plugins("cpp11")]] #include // [[Rcpp::depends(BH)]] #include // [[Rcpp::depends(RcppEigen)]] #include #include 66

Slide 255

Slide 255 text

ExampleGaussianRcpp.cpp II namespace model { // 2 parameters: mu, sigma^2 enum parameter : std::size_t { mu, s2 }; constexpr std::size_t parameters_count = 2; } 67

Slide 256

Slide 256 text

ExampleGaussianRcpp.cpp III // [[Rcpp::export]] double l_t_cpp(double xt, const Eigen::Map parameters) { const auto mu = parameters[model::parameter::mu]; const auto s2 = parameters[model::parameter::s2]; constexpr auto two_pi = boost::math::constants::two_pi(); using std::log; using std::pow; return -.5 * log(two_pi) - .5 * log(s2) - .5 * pow(xt - mu, 2) / s2; } 68

Slide 257

Slide 257 text

AD Example using RcppEigen Rcpp::sourceCpp('ExampleGaussianRcppEigen.cpp') 69

Slide 258

Slide 258 text

ExampleGaussianRcppEigen.cpp I // [[Rcpp::plugins("cpp11")]] #include // [[Rcpp::depends(BH)]] #include // [[Rcpp::depends(RcppEigen)]] #include #include #include 70

Slide 259

Slide 259 text

ExampleGaussianRcppEigen.cpp II namespace model { // 2 parameters: mu, sigma^2 enum parameter : std::size_t { mu, s2 }; constexpr std::size_t parameters_count = 2; template using parameter_vector = Eigen::Matrix; } 71

Slide 260

Slide 260 text

ExampleGaussianRcppEigen.cpp III // note: data `xt` -- double-precision number(s), just as before // only the parameters adjusted to `ScalarType` template typename VectorType::Scalar l_t_cpp_AD(double xt, const VectorType & parameters) { using Scalar = typename VectorType::Scalar; const Scalar & mu = parameters[model::parameter::mu]; const Scalar & s2 = parameters[model::parameter::s2]; // note: `two_pi` is, as always, a double-precision constant constexpr auto two_pi = boost::math::constants::two_pi(); using std::log; using std::pow; return -.5 * log(two_pi) - .5 * log(s2) - .5 * pow(xt - mu, 2) / s2; } 72

Slide 261

Slide 261 text

ExampleGaussianRcppEigen.cpp IV // wrapper over `l_t_cpp_AD` // (simply dispatching to the existing implementation) // // (Rcpp-exportable explicit template instantiations // would render this unnecessary) // [[Rcpp::export]] double l_t_value_cpp(double xt, const Eigen::Map parameters) { return l_t_cpp_AD(xt, parameters); } 73

Slide 262

Slide 262 text

ExampleGaussianRcppEigen.cpp V // objective function together with its gradient // `xt`: input (data) // `parameters_input`: input (model parameters) // `gradient_output`: output (computed gradient) // returns: computed objective function value // [[Rcpp::export]] double l_t_value_gradient_cpp ( double xt, const Eigen::Map parameters_input, Eigen::Map gradient_output ) { using parameter_vector = model::parameter_vector; using AD = Eigen::AutoDiffScalar; using VectorAD = model::parameter_vector; 74

Slide 263

Slide 263 text

ExampleGaussianRcppEigen.cpp VI parameter_vector parameters = parameters_input; VectorAD parameters_AD = parameters.cast(); constexpr auto P = model::parameters_count; for (std::size_t p = 0; p != P; ++p) { // forward mode: // active differentiation variable `p` // (one parameter at a time) parameters_AD(p).derivatives() = parameter_vector::Unit(p); } const AD & loglikelihood_result = l_t_cpp_AD(xt, parameters_AD); gradient_output = loglikelihood_result.derivatives(); return loglikelihood_result.value(); } 75

Slide 264

Slide 264 text

AD Example using RcppEigen I Rcpp::sourceCpp('ExampleGaussianRcppEigen.cpp') derivatives_match_given = function(data) { fixed_parameters = example_parameters() gradient = vector(mode = "numeric", length = 2L) names(gradient) = c("mu", "sigma^2") l_t_value_gradient_cpp(data, fixed_parameters, gradient) expect_equivalent(expected = dl_t_dmu(data, fixed_parameters), object = gradient["mu"]) expect_equivalent(expected = dl_t_ds2(data, fixed_parameters), object = gradient["sigma^2"]) } 76

Slide 265

Slide 265 text

AD Example using RcppEigen II describe("algorithmically differentiated function using RcppEigen", { it("matches the analytical derivatives implemented in R", { derivatives_match_given(data = 3) derivatives_match_given(data = 6) derivatives_match_given(data = 9) }) }) 77

Slide 266

Slide 266 text

AD Example using RcppEigen — Performance I # Recall: # `dl_t_dmu_AD` uses our # Manually Forward-Derived Evalutation Trace # operating on the (whole) data vector N = 10000L # sample size xs = rnorm(N) system.time( { err_AD = dl_t_dmu(xs, fixed_parameters) - dl_t_dmu_AD(xs, fixed_parameters) } ) 78

Slide 267

Slide 267 text

AD Example using RcppEigen — Performance II # accuracy: R-some :-) > head(err_AD) [1] 0 0 0 0 0 0 > sum(err_AD) [1] 0 # performance: good enough for now ;-) user system elapsed 0 0 0 79

Slide 268

Slide 268 text

AD Example using RcppEigen — Performance III gradient = vector(mode = "numeric", length = 2L) names(gradient) = c("mu", "sigma^2") errs_AD_cpp = vector(mode = "numeric", length = N) # in contrast: we now have a # scalar-taking function system.time( { for (t in seq_along(xs)) { l_t_value_gradient_cpp(xs[t], fixed_parameters, gradient) errs_AD_cpp[t] = dl_t_dmu(xs[t], fixed_parameters) - dl_t_dmu_AD(xs[t], fixed_parameters) } } ) 80

Slide 269

Slide 269 text

AD Example using RcppEigen — Performance IV # accuracy: awesome > head(errs_AD_cpp) [1] 0 0 0 0 0 0 > sum(errs_AD_cpp) [1] 0 # performance: not ideal user system elapsed 3.56 0.00 3.56 81

Slide 270

Slide 270 text

Data Parallel Objective Function I // First: make the data parallelism (DP) explicit // [[Rcpp::export]] Eigen::VectorXd l_t_value_cppDP ( const Eigen::Map xs, const Eigen::Map parameters ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); for (std::size_t t = 0; t != sample_size; ++t) { result[t] = l_t_cpp_AD(xs[t], parameters); } return result; } 82

Slide 271

Slide 271 text

Data Parallel Objective Function II // Second: Parallelize using OpenMP // [[Rcpp::export]] Eigen::VectorXd l_t_value_cppDP_OMP ( const Eigen::Map xs, const Eigen::Map parameters ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); #pragma omp parallel for default(none) shared(result) for (std::size_t t = 0; t < sample_size; ++t) { result[t] = l_t_cpp_AD(xs[t], parameters); } return result; } 83

Slide 272

Slide 272 text

Data Parallel Objective Function Performance I errs_l = l_t(xs, fixed_parameters) - l_t_value_cppDP(xs, fixed_parameters) > head(errs_l) [1] 0 0 0 0 0 0 > sum(errs_l) [1] 0 errs_l = l_t(xs, fixed_parameters) - l_t_value_cppDP_OMP(xs, fixed_parameters) > head(errs_l) [1] 0 0 0 0 0 0 > sum(errs_l) [1] 0 84

Slide 273

Slide 273 text

Data Parallel Objective Function Performance II > require("microbenchmark") > microbenchmark( + l_t(xs, fixed_parameters), + l_t_value_cppDP(xs, fixed_parameters), + l_t_value_cppDP_OMP(xs, fixed_parameters) + ) Unit: microseconds expr median neval l_t(xs, fixed_parameters) 102.657 100 l_t_value_cppDP(xs, fixed_parameters) 458.618 100 l_t_value_cppDP_OMP(xs, fixed_parameters) 213.526 100 Note: changing l_t_cpp_AD to l_t_cpp_AD_DP (analogously) can also help, but we're not done, yet... 85

Slide 274

Slide 274 text

Data Parallel Gradient I // [[Rcpp::export]] Eigen::VectorXd l_t_value_gradient_cppDP ( const Eigen::Map xs, const Eigen::Map parameters_input, Eigen::Map gradient_output ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); for (std::size_t t = 0; t != sample_size; ++t) { using parameter_vector = model::parameter_vector; using AD = Eigen::AutoDiffScalar; using VectorAD = model::parameter_vector; 86

Slide 275

Slide 275 text

Data Parallel Gradient II parameter_vector parameters = parameters_input; VectorAD parameters_AD = parameters.cast(); constexpr auto P = model::parameters_count; for (std::size_t p = 0; p != P; ++p) { parameters_AD(p).derivatives() = parameter_vector::Unit(p); } const AD & loglikelihood_result = l_t_cpp_AD(xs[t], parameters_AD); gradient_output.row(t) = loglikelihood_result.derivatives(); result[t] = loglikelihood_result.value(); } return result; } 87

Slide 276

Slide 276 text

Data Parallel Gradient III // [[Rcpp::export]] Eigen::VectorXd l_t_value_gradient_cppDP_OMP ( const Eigen::Map xs, const Eigen::Map parameters_input, Eigen::Map gradient_output ) { const std::size_t sample_size = xs.size(); Eigen::VectorXd result(sample_size); #pragma omp parallel for default(none) \ shared(result, gradient_output) for (std::size_t t = 0; t < sample_size; ++t) { using parameter_vector = model::parameter_vector; using AD = Eigen::AutoDiffScalar; using VectorAD = model::parameter_vector; 88

Slide 277

Slide 277 text

Data Parallel Gradient IV parameter_vector parameters = parameters_input; VectorAD parameters_AD = parameters.cast(); constexpr auto P = model::parameters_count; for (std::size_t p = 0; p != P; ++p) { parameters_AD(p).derivatives() = parameter_vector::Unit(p); } const AD & loglikelihood_result = l_t_cpp_AD(xs[t], parameters_AD); gradient_output.row(t) = loglikelihood_result.derivatives(); result[t] = loglikelihood_result.value(); } return result; } 89

Slide 278

Slide 278 text

Data Parallel Gradient Performance I parameters_count = 2L # mu, sigma^2 N = 10000L # sample size gradient = matrix(data = 0, nrow = N, ncol = parameters_count) colnames(gradient) = c("mu", "sigma^2") l_value = l_t_value_gradient_cppDP(xs, fixed_parameters, gradient) errs_g = dl_t_dmu_AD(xs, fixed_parameters) - gradient[,"mu"] > str(gradient) num [1:10000, 1:2] -1.39 -1.32 -1.38 -1.53 -1.88 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:2] "mu" "sigma^2" > head(errs_g) [1] 0 0 0 0 0 0 > sum(errs_g) [1] 0 90

Slide 279

Slide 279 text

Data Parallel Gradient Performance II l_value = l_t_value_gradient_cppDP_OMP(xs, fixed_parameters, gradie errs_g = dl_t_dmu_AD(xs, fixed_parameters) - gradient[,"mu"] > str(gradient) num [1:10000, 1:2] -1.39 -1.32 -1.38 -1.53 -1.88 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:2] "mu" "sigma^2" > head(errs_g) [1] 0 0 0 0 0 0 > sum(errs_g) [1] 0 91

Slide 280

Slide 280 text

Data Parallel Gradient Performance III microbenchmark( dl_t_dmu_AD(xs, fixed_parameters), l_t_value_gradient_cppDP(xs, fixed_parameters, gradient), l_t_value_gradient_cppDP_OMP(xs, fixed_parameters, gradient) ) Unit: microseconds expr median neval dl_t_dmu_AD 608.2400 100 l_t_value_gradient_cppDP 631.3375 100 l_t_value_gradient_cppDP_OMP 258.6945 100 92

Slide 281

Slide 281 text

Data Parallel Gradient Performance — Conclusions Worth noting: • dl_t_dmu_AD only computes derivative w.r.t. µ, whereas l_t_value_gradient_cppDP also computes derivative w.r.t. σ2 93

Slide 282

Slide 282 text

Slide 283

Slide 283 text

Slide 284

Slide 284 text

Data Parallel Gradient Performance — Conclusions Worth noting: • dl_t_dmu_AD only computes derivative w.r.t. µ, whereas l_t_value_gradient_cppDP also computes derivative w.r.t. σ2 • thus naive serial version already effectively twice as fast • we have an additional speed-up on top of that thanks to OpenMP • we can do better: truly data-parallel (non-naive) objective function implementation l_t_cpp_AD_DP, reverse mode AD 93

Slide 285

Slide 285 text

Slide 286

Slide 286 text

Caveats • Source code spanning multiple programming languages 94

Slide 287

Slide 287 text

Caveats • Source code spanning multiple programming languages • Closed source software (proprietary, third party) 94

Slide 288

Slide 288 text

Caveats • Source code spanning multiple programming languages • Closed source software (proprietary, third party) • Reverse mode — memory requirements 94

Slide 289

Slide 289 text

Caveats • Source code spanning multiple programming languages • Closed source software (proprietary, third party) • Reverse mode — memory requirements • Perverse (but valid) code: if (x == 1.0) then y = 0.0 else y = x − 1.0 94

Slide 290

Slide 290 text

Slide 291

Slide 291 text

Resources

Slide 292

Slide 292 text

Resources http://www.autodiff.org/ https://en.wikipedia.org/wiki/Automatic_ differentiation#Implementation ADNumber, Adept, ADOL-C, CppAD, Eigen (Auto Diff module) http://alexey.radul.name/ideas/2013/ introduction-to-automatic-differentiation/ 96

Slide 293

Slide 293 text

References

Slide 294

Slide 294 text

References I Bollerslev, Tim, and Robert Engle. 1993. “Common Persistence in Conditional Variances.” Econometrica 61 (1). The Econometric Society: pp. 167–86. http://www.jstor.org/stable/2951782. Giles, Michael, and Paul Glasserman. 2006. “Smoking Adjoints: Fast Monte Carlo Greeks.” Risk 19: 88–92. Griewank, Andreas. 2012. “Who Invented the Reverse Mode of Differentiation?” Optimization Stories Documenta Matematica, Extra Volume ISMP (2012): 389–400. Griewank, Andreas, and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. 2nd ed. Other Titles in Applied Mathematics 105. 98

Slide 295

Slide 295 text

References II Philadelphia, PA: SIAM. http://www.ec-securehost.com/SIAM/OT105.html. Homescu, C. 2011. “Adjoints and Automatic (Algorithmic) Differentiation in Computational Finance.” ArXiv E-Prints, July. http://arxiv.org/abs/1107.1831. Naumann, Uwe. 2012. The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation. Computer software, Environments, and Tools 24. Philadelphia, PA: SIAM. http://www.ec-securehost.com/SIAM/SE24.html. Nocedal, Jorge, and Stephen J. Wright. 2006. Numerical Optimization. 2. ed. Springer: New York, NY. 99

Slide 296

Slide 296 text

Thank You! Questions? 100