position i in sequence Sx , and let fmðuS1 i ; uS2 j Þ be a eturns the number of mismatching positions between and uS2 j . The gkm kernel leverages the fact that: hfS1 gkm ; fS2 gkm i ¼ X i X j h fmðuS1 i ; uS2 j Þ (3) xes i and j sum over all l-mers in S1 and S2 respective- a function that returns the contribution of an l-mer smatches to the dot product hfS1 ; fS2 i. For example, if e a pair of l-mers with m mismatches between them, er of gapped k-mers they share is l À m k . Thus, in traditional gapped k-mer kernel, hðmÞ ¼ l À m k . 2014) additionally proposed variants of the gapped k- h as the truncated gkm-filter that differ in the func- otherwise have an identical formulation. utational efficiency, the gkm-SVM implementation ameter d that sets h(m) ¼ 0 for all m > d (regardless of and k). This is efficient because it limits the number that need to be considered to those where m d. In ation, d is referred to as the maximum number of atches. 7 by University of Tokyo Library user on 13 August 2019 hat returns the number of mismatches and h ds on the specific variant of the kernel used. nce, we will denote fmðlSx j ; lSz k Þ as m here. ribute the quantity wj wk h ðmÞ over the bases ion i in sequence Sx that overlaps the l-mer base at position i as Bx ¼ Si x , and the corre- e other l-mer lSz k (from position k in support ÀjÞ (‘corresponding’ means that the offset of f lSx j is the same as the offset of Bz relative to enly distribute wj wk h ðmÞ among the ðl À mÞ ween l-mers lSx i and lSz j , then position i would wjwk h ðmÞ lÀm if Bx ¼ Bz , and would inherit an im- . Leaving out the weights wj wk (which don’t ), we can encapsulate the core logic in the defined as: ; BzÞ ::¼ h ðmÞ l À m if Bx¼ Bz 0 otherwise 8 < : (8) , we write the weighted contribution to by position i from the l-mer pair ðlSx j ; lSz k Þ as: wgkm x z Thus, by design, we again have the P i /wgkmrbf i;Sx;Sz ¼ Kwgkmrbf ðSx; SzÞ À expðÀcÞ. This i GkmExplain to address the saturation issue fa Figure 1: even in situations where perturbing any in cally produces a near-zero change in the kernel Kw GkmExplain importance scores are not saturated entire difference Kwgkmrbf ðSx; SzÞ À expðÀcÞ. Specif of the toy example in Figure 1, GkmExplain rec similarity with the positively-weighted support vec its baseline value, and therefore grants positiv gapped k -mer matches between the input sequence tor—even though similarity with the support vec sensitive to specific individual perturbations. importance scores sum to the ‘difference from base a theoretical connection between GkmExplain an Integrated Gradients, which we discuss in Supplem SA.2. Once we have the contribution of base i to the output Kwgkmrbf ðSx; SzÞ, we can find the contributio wgkmrbf-SVM output by simply taking the w /wgkmrbf i;Sx;S zj over all support vectors Szj , analogous to Eqn. 11, giving: How should we distribute the quantity wj wk h ðmÞ over the bases in Sx ? Consider a position i in sequence Sx that overlaps the l-mer lSx j . Let us denote the base at position i as Bx ¼ Si x , and the corre- sponding base within the other l-mer lSz k (from position k in support vector Sz ) as Bz ¼ Sk þðiÀjÞ z (‘corresponding’ means that the offset of Bx relative to the start of lSx j is the same as the offset of Bz relative to the start of lSz k ). If we evenly distribute wj wk h ðmÞ among the ðl À mÞ matching positions between l-mers lSx i and lSz j , then position i would inherit an importance of wjwk h ðmÞ lÀm if Bx ¼ Bz , and would inherit an im- portance of 0 if Bx 6¼ Bz . Leaving out the weights wj wk (which don’t depend on m, Bx or Bz ), we can encapsulate the core logic in the function effðm; Bx; BzÞ, defined as: effðm; Bx; BzÞ ::¼ h ðmÞ l À m if Bx¼ Bz 0 otherwise 8 < : (8) Using this function, we write the weighted contribution to hfSx wgkm ; fSz wgkm i inherited by position i from the l-mer pair ðlSx j ; lSz k Þ as: impði; j; k ; Sx; SzÞ ::¼ wj wk effðfmðlSx j ; lSz k Þ; Si x ; Sk þðiÀjÞ z Þ (9) i i;S GkmEx Figure cally p GkmEx entire d of the similar its bas gapped tor—ev sensitiv import a theor Integra SA.2. On output wgkmr /wgkmrb i;Sx;S zj Eqn. 11 rt of lSz k ). If we evenly distribute wj wk h ðmÞ among the ðl À mÞ ing positions between l-mers lSx i and lSz j , then position i would an importance of wjwk h ðmÞ lÀm if Bx ¼ Bz , and would inherit an im- nce of 0 if Bx 6¼ Bz . Leaving out the weights wj wk (which don’t d on m, Bx or Bz ), we can encapsulate the core logic in the on effðm; Bx; BzÞ, defined as: effðm; Bx; BzÞ ::¼ h ðmÞ l À m if Bx¼ Bz 0 otherwise 8 < : (8) ing this function, we write the weighted contribution to ; fSz wgkm i inherited by position i from the l-mer pair ðlSx j ; lSz k Þ as: impði; j; k ; Sx; SzÞ ::¼ wj wk effðfmðlSx j ; lSz k Þ; Si x ; Sk þðiÀjÞ z Þ (9) we sum this quantity over all possible l-mer pairs where lSx j ps position i—that is, we sum over the range j ¼ maxði À ðl À to j ¼ minði; lenðSxÞ À lÞ—and normalize by the total wgkm e vector lengths kfSx wgkm kkfSz wgkm k (as is done in the wgkm ker- we arrive at the total contribution of position i in Sx to ðSx; SzÞ, which we denote as /wgkm i;Sx;Sz : /wgkm ::¼ X minði;lenðSxÞÀlÞ X impði; j; k ; Sx; SzÞ ! (10) its baseline value, and therefore gapped k -mer matches between the tor—even though similarity with t sensitive to specific individual importance scores sum to the ‘differ a theoretical connection between G Integrated Gradients, which we disc SA.2. Once we have the contribution output Kwgkmrbf ðSx; SzÞ, we can find wgkmrbf-SVM output by simply /wgkmrbf i;Sx;S zj over all support vectors Sz Eqn. 11, giving: /wgkmrbfsvm i;Sx ¼ X m j¼1 In terms of implementation, Gkm be computed efficiently by modify search originally used to compute t implementation is at https://github.c 5.2 Mutation impact scores matching positions between l-mers l x i and l z j , then position i would nherit an importance of wjwk h ðmÞ lÀm if Bx ¼ Bz , and would inherit an im- ortance of 0 if Bx 6¼ Bz . Leaving out the weights wj wk (which don’t epend on m, Bx or Bz ), we can encapsulate the core logic in the unction effðm; Bx; BzÞ, defined as: effðm; Bx; BzÞ ::¼ h ðmÞ l À m if Bx¼ Bz 0 otherwise 8 < : (8) Using this function, we write the weighted contribution to fSx wgkm ; fSz wgkm i inherited by position i from the l-mer pair ðlSx j ; lSz k Þ as: impði; j; k ; Sx; SzÞ ::¼ wj wk effðfmðlSx j ; lSz k Þ; Si x ; Sk þðiÀjÞ z Þ (9) If we sum this quantity over all possible l-mer pairs where lSx j verlaps position i—that is, we sum over the range j ¼ maxði À ðl À Þ; 0Þ to j ¼ minði; lenðSxÞ À lÞ—and normalize by the total wgkm eature vector lengths kfSx wgkm kkfSz wgkm k (as is done in the wgkm ker- el), we arrive at the total contribution of position i in Sx to Kwgkm ðSx; SzÞ, which we denote as /wgkm i;Sx;Sz : /wgkm i;Sx;Sz ::¼ X minði;lenðSxÞÀlÞ X impði; j; k ; Sx; SzÞ kfSx kkfSz k ! (10) gapped k -mer matches betw tor—even though similarit sensitive to specific ind importance scores sum to t a theoretical connection b Integrated Gradients, whic SA.2. Once we have the cont output Kwgkmrbf ðSx; SzÞ, we wgkmrbf-SVM output by /wgkmrbf i;Sx;S zj over all support v Eqn. 11, giving: /wgkmrb i;Sx In terms of implementa be computed efficiently b search originally used to co implementation is at https:/ 5.2 Mutation impact sc An intuitive approach to e n importance of lÀm if Bx ¼ Bz , and would inherit an im- of 0 if Bx 6¼ Bz . Leaving out the weights wj wk (which don’t n m, Bx or Bz ), we can encapsulate the core logic in the effðm; Bx; BzÞ, defined as: effðm; Bx; BzÞ ::¼ h ðmÞ l À m if Bx¼ Bz 0 otherwise 8 < : (8) this function, we write the weighted contribution to z gkm i inherited by position i from the l-mer pair ðlSx j ; lSz k Þ as: mpði; j; k ; Sx; SzÞ ::¼ wj wk effðfmðlSx j ; lSz k Þ; Si x ; Sk þðiÀjÞ z Þ (9) sum this quantity over all possible l-mer pairs where lSx j position i—that is, we sum over the range j ¼ maxði À ðl À j ¼ minði; lenðSxÞ À lÞ—and normalize by the total wgkm ector lengths kfSx wgkm kkfSz wgkm k (as is done in the wgkm ker- arrive at the total contribution of position i in Sx to ; SzÞ, which we denote as /wgkm i;Sx;Sz : X minði;lenðSxÞÀlÞ X ! sensitive to specific importance scores sum a theoretical connectio Integrated Gradients, w SA.2. Once we have the c output Kwgkmrbf ðSx; SzÞ, wgkmrbf-SVM output /wgkmrbf i;Sx;S zj over all suppo Eqn. 11, giving: /wg i;Sx In terms of impleme be computed efficientl search originally used t implementation is at ht 0 otherwise Using this function, we write the weighted contribution to x gkm ; fSz wgkm i inherited by position i from the l-mer pair ðlSx j ; lSz k Þ as: impði; j; k ; Sx; SzÞ ::¼ wj wk effðfmðlSx j ; lSz k Þ; Si x ; Sk þðiÀjÞ z Þ (9) If we sum this quantity over all possible l-mer pairs where lSx j rlaps position i—that is, we sum over the range j ¼ maxði À ðl À 0Þ to j ¼ minði; lenðSxÞ À lÞ—and normalize by the total wgkm ture vector lengths kfSx wgkm kkfSz wgkm k (as is done in the wgkm ker- ), we arrive at the total contribution of position i in Sx to gkm ðSx; SzÞ, which we denote as /wgkm i;Sx;Sz : /wgkm i;Sx;Sz ::¼ X minði;lenðSxÞÀlÞ j¼maxðiÀðlÀ1Þ;0Þ X k impði; j; k ; Sx; SzÞ kfSx wgkm kkfSz wgkm k ! (10) kfSx wgkm k ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Kwgkm ðSx; SxÞ p is efficiently computed using the ker- function. Now that we have defined the contribution of base i to wgkm kernel output Kwgkm ðSx; SxÞ, we can define the final con- ution of base i to the output of the wgkm-SVM as simply being weighted sum of /wgkm i;Sx;S j over all support vectors Szj , where the output Kwgkmrbf ðSx; SzÞ, wgkmrbf-SVM output /wgkmrbf i;Sx;S zj over all suppo Eqn. 11, giving: /wgk i;Sx In terms of impleme be computed efficientl search originally used t implementation is at htt 5.2 Mutation impac An intuitive approach t tions is in-silico mutage In ISM, a mutation is in the predicted output Figures 1 and 2, ISM model has saturated in when RBF variants of th lengths kf wgkm kkf wgkm k (as is done in the wgkm ker- ve at the total contribution of position i in Sx to which we denote as /wgkm i;Sx;Sz : km ;Sz ::¼ X minði;lenðSxÞÀlÞ j¼maxðiÀðlÀ1Þ;0Þ X k impði; j; k ; Sx; SzÞ kfSx wgkm kkfSz wgkm k ! (10) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Kwgkm ðSx; SxÞ p is efficiently computed using the ker- Now that we have defined the contribution of base i to nel output Kwgkm ðSx; SxÞ, we can define the final con- ase i to the output of the wgkm-SVM as simply being sum of /wgkm i;Sx;S zj over all support vectors Szj , where the yj (as per Eqn. 1). We denote this quantity as /wgkmsvm i;Sx /wgkmsvm i;Sx ¼ X m j¼1 aj yj/wgkm i;Sx;S zj (11) we compute a similar quantity for per-base attributions search originally used implementation is at 5.2 Mutation impa An intuitive approac tions is in-silico muta In ISM, a mutation i the predicted outpu Figures 1 and 2, ISM model has saturated when RBF variants o can be beneficial to e ent quantity that we which we describe in Consider a pair changes base Bx in l the same relative pos