Slide 1

Slide 1 text

DIFFERENTIABLE SEQUENCE ALIGNMENT Michiel Stock, Dimitri Boeckaerts, Steff Taelman & Wim Van Criekinge @michielstock [email protected] 1 Photo by Andrew Schultz on Unsplash KERMIT

Slide 2

Slide 2 text

SEQUENCE ALIGNMENT 2 Sequence alignment to analyse biosequences “Deep learning” methods for biosequences

Slide 3

Slide 3 text

DIFFERENTIABLE COMPUTING 3 Deep learning revolution largely made possible by automatic differentiation, making computing gradients easy, accurate and performant. Differentiable computing is a paradigm where one uses gradients in general computer programs. neural Turing machine Our goal: given a sequence alignment algorithm with parameters to align two sequences, and , and yields an alignment score . How will this score change under an in fi nitesimal change of the parameters? θ s t v

Slide 4

Slide 4 text

DIFFERENTIATING A MAXIMUM 4 Sequence alignment is just dynamic programming, i.e. computing a bunch of maxima of the subproblems. Why can’t we differentiate that? maximum of 0.12, 0.90, 0.22, 0.85, 0.43, 0.77 is 0.90 Solution: create a smoothed maximum operator with better partial derivatives 0, 0.69, 0, 0.26, 0, 0.05 partial derivatives correspond to the argmax, i.e. 0, 1, 0, 0, 0, 0 ∂ ∂xi gradient only uses identity of largest element! A great loss of information!

Slide 5

Slide 5 text

SMOOTH MAXIMUM OPERATORS 5 maxΩ (x) = max q∈△n−1 ⟨ q, x⟩ − Ω(q) regular maximum Ω(q) = 0 negative entropy Ω(q) = γ∑ i qi log(qi ) regularization ℓ2 Ω(q) = γ∑ i q2 i convex regularizer

Slide 6

Slide 6 text

DIFFERENTIATING NEEDLEMAN-WUNSCH (GLOBAL ALIGNMENT) 6 : substitution scores θ in general, is just obtained from the substitution matrix, i.e. θ θi,j = Ssi ,tj : DP matrix D Di+1,j+i = maxΩ (Di+1,j + cs i , Di,j + θi,j , Di,j+i + ct j ) : gradient E ∂Dn+1,m+1 ∂Di,j

Slide 7

Slide 7 text

DIFFERENTIATING NEEDLEMAN-WUNSCH 7 function ∇needleman_wunsch(max_argmaxᵧ, θ, (cˢ, cᵗ)) n, m = size(θ) D = zeros(n+1, m+1) # initialize dynamic programming matrix D[2:n+1,1] .= -cumsum(cˢ) # cost of starting with gaps in s D[1,2:m+1] .= -cumsum(cᵗ) # cost of starting with gaps in t E = zeros(n+2, m+2) # matrix for the gradient E[n+2,m+2] = 1.0 Q = zeros(n+2, m+2, 3) # matrix for backtracking Q[n+2,m+2,2] = 1.0 # forward pass, performing dynamic programming for i in 1:n, j in 1:m v, q = max_argmaxᵧ((D[i+1,j] - cˢ[i], # gap in first sequence D[i,j] + θ[i,j], # extending the alignment D[i,j+1] - cᵗ[j])) # gap in second sequence D[i+1,j+1] = v # store smooth max Q[i+1,j+1,:] .= q # store directions end # backtracking through the directions to compute the gradient for i in n:-1:1, j in m:-1:1 E[i+1,j+1] = Q[i+1,j+2,1] * E[i+1,j+2] + Q[i+2,j+2,2] * E[i+2,j+2] + Q[i+2,j+1,3] * E[i+2,j+1] end return D[n+1,2:m+1], E[n+1,2:m+1] # value and gradient end initialize arrays for dynamic programming ( ), backtracking ( ) and the gradient ( ) D Q E compute optimal local choice and store soft maximum and its gradient backtrack to obtain the gradient

Slide 8

Slide 8 text

PLAYING WITH THE MAXIMUM OPERATOR 8 squared maximum yields sparser smoothing Ω(q) = γ∑ i q2 i more regularization encourages random walking behaviour Ω(q) = γ∑ i qi log(qi ) large γ regular maximum recovers vanilla alignment algorithm Ω(q) = 0 (or ) γ → 0

Slide 9

Slide 9 text

DIFFERENTIATING SMITH-WATERMAN (LOCAL ALIGNMENT) 9 θ D Di+1,j+i = maxΩ(Di+1,j + cs i , Di,j + θi,j, Di,j+i + ct j , 0) M v = maxΩ (D) M = ∇D maxΩ (D) E ∂v ∂Di,j

Slide 10

Slide 10 text

DIFFERENTIATING SMITH-WATERMAN 10 function ∇smith_waterman(max_argmaxᵧ, θ, (cˢ, cᵗ)) n, m = size(θ) D = zeros(n+1, m+1) # initialize dynamic programming matrix E = zeros(n+2, m+2) # matrix for the gradient Q = zeros(n+2, m+2, 3) # matrix for backtracking for i in 1:n, j in 1:m v, q = max_argmaxᵧ((D[i+1,j] - cˢ[i], # gap in first sequence D[i,j] + θ[i,j], # extending the alignment D[i,j+1] - cᵗ[j], # gap in second sequence 0.0)) D[i+1,j+1] = v # store smooth max Q[i+1,j+1,:] .= q[1], q[2], q[3] # store directions end v, M = max_argmaxᵧ(D[2:n+1, 2:m+1]) # compute smooth max and gradient # backtracking through the directions to compute the gradient for i in n:-1:1, j in m:-1:1 E[i+1,j+1] = M[i,j] + # contribution to v Q[i+1,j+2,1] * E[i+1,j+2] + Q[i+2,j+2,2] * E[i+2,j+2] + Q[i+2,j+1,3] * E[i+2,j+1] end return v, E[2:n+1,2:m+1] # value and gradient end initialize arrays for dynamic programming ( ), backtracking ( ) and the gradient ( ) D Q E compute optimal local choice and store soft maximum and its gradient backtrack to obtain the gradient take smooth maximum of and its gradient D

Slide 11

Slide 11 text

PROPAGATING GRADIENTS OF ALIGNMENT SCORES 11 Up to now, we provided the the gradient of the alignment score w.r.t. the DP matrix ∂v ∂Di,j By applying the chain to the DP update rules, we can easily obtain the derivatives w.r.t. the parameters, e.g. ∂v ∂θi,j = ∂v ∂Di,j ∂Di,j ∂θi,j Autodiff can propagate these gradients further, e.g. towards the substitution matrix or as part of larger artificial neural network! gap in s gap in t substitution = + + E

Slide 12

Slide 12 text

COMPUTATION TIME 12 length NW NW + grad SW SW + grad max 10 0.00001 0.00002 0.00001 0.00002 100 0.00004 0.00064 0.00019 0.00062 500 0.00109 0.01930 0.00673 0.02339 1000 0.00370 0.06548 0.01511 0.06109 entropy 10 0.00002 0.00002 0.00002 0.00002 100 0.00077 0.00113 0.00104 0.00145 500 0.01925 0.02630 0.02884 0.04001 1000 0.06677 0.09468 0.08752 0.12903 squared 10 0.00002 0.00001 0.00002 0.00002 100 0.00046 0.00053 0.00079 0.00092 500 0.01149 0.01384 0.01856 0.02207 1000 0.04039 0.04867 0.07319 0.08620 Running time in seconds in 32-bit precision (excluding compiling and array initialization)

Slide 13

Slide 13 text

GRADIENTS AVAILABLE VIA CHAINRULES.JL 13 Repo: https://github.com/MichielStock/DiffDynProg.jl Custom adjoints provided via ChainRulesCore.jl (also interoperable with various automatic differentiation libraries) Compute derivatives of arbitrary pieces or Julia code. Interoperable with e.g., bioinformatics libraries.

Slide 14

Slide 14 text

CONCLUSION 14 Our work builds upon the framework by Mensch and Blondel. Mensch, A., & Blondel, M. (2018). Differentiable dynamic programming for structured prediction and attention. Retrieved from https://arxiv.org/pdf/ 1802.03676.pdf Differentiable sequence alignment as a natural generalization of vanilla alignment.