gapped k-mer SVMs Avanti Shrikumar1,*,†, Eva Prakash2,† and Anshul Kundaje 1,3,* 1Department of Computer Science, Stanford University, Stanford, CA 94305, USA, 2Computer Science, BASIS Independent Silicon Valley, San Jose, CA 95126, USA and 3Department of Genetics, Stanford University, Stanford, CA 94305, USA *To whom correspondence should be addressed. †The authors wish it to be known that, in their opinion, the ﬁrst two authors should be regarded as Joint First Authors. Abstract Summary: Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM) or SHAP either do not scale well or make limiting assump- tions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose GkmExplain: a computationally efﬁcient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients. Using simulated regulatory DNA sequences, we show that GkmExplain identiﬁes predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efﬁcient than SHAP. By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo transcription factor (TF) binding data, we recover consolidated, Bioinformatics, 35, 2019, i173–i182 doi: 10.1093/bioinformatics/btz322 ISMB/ECCB 2019 Downloaded from https://academic.oup.com/bioinformatics/article-abstract/35/14/i1
.ͱಉ༷ɺ ରԽ ଘࡏ͢Δ Δ ξn = 0 ξn = |tn − y(xn)| 0 ≤ an ≤ C the main recent literature (i.e. 32 papers) related to applications of deep models to clinical imaging, EHRs, genomics and wear- able device data. Electronic health records More recently deep learning has been applied to process aggre- gated EHRs, including both structured (e.g. diagnosis, medica- Figure 1. Comparison between ANNs and deep architectures. While ANNs are usually composed by three layers and one transformation toward the ﬁnal outputs, deep learning architectures are constituted by several layers of neural networks. Layer-wise unsupervised pre-training allows deep networks to be tuned efﬁciently and to extract deep structure from inputs to serve as higher-level features that are used to obtain better predictions. 1238 | Miotto et al. (PRML 7ষ) (Miotto et al. Brief. Bioinf. (2018)) w ػցֶशͷ݁ՌΛղऍ͢Δॏཁੑ͕ߴ·͍ͬͯΔ
• ཏతͳղੳʹ͕͔͔ؒΔ্ɺΈ߹ΘͤޮՌΛݕग़ग़དྷͳ͍ ͱ͍͏͕͋Δ de- pref- it of ages n at apped k-mer SVMs i175 Download e de- pref- mit of rages on at gapped k-mer SVMs i175 Download e de- pref- mit of rages on at d the tially gapped k-mer SVMs i175 Downloaded fro
support vector, yi be or -1) associated with the ith support vector, ai be the ated with the ith support vector, b be a constant bias e a kernel function that is used to compute a similarity Zi and x. SVMs produce an output of the form: F ðXÞ ¼ bþ X m i¼1 ai yiKðZi; xÞ (1) nctions can be thought of as implicitly mapping their ors in some feature space and then computing a dot example, the gapped k-mer kernel implicitly maps its e inputs to feature vectors representing the normalized nct gapped k-mers. Formally, it can be written as: ¼ * fS1 gkm kfS1 gkm k ; fS2 gkm kfS2 gkm k + ¼ hfS1 gkm ; fS2 gkm i ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ hfS1 gkm ; fS1 gkm i q ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ hfS2 gkm ; fS2 gkm i q (2) nd fS2 gkm are feature vectors representing the counts of om https://academic.oup.com/bioinformatics/article-abstract/35/14 47. n r - d s - n s . r e - - y e s - - n e r term and K be a kernel function that is used to compute a similarity score between Zi and x. SVMs produce an output of the form: F ðXÞ ¼ bþ X m i¼1 ai yiKðZi; xÞ (1) Kernel functions can be thought of as implicitly mapping their inputs to vectors in some feature space and then computing a dot product. For example, the gapped k-mer kernel implicitly maps its DNA sequence inputs to feature vectors representing the normalized counts of distinct gapped k-mers. Formally, it can be written as: Kgkm ðS1; S2 Þ ¼ * fS1 gkm kfS1 gkm k ; fS2 gkm kfS2 gkm k + ¼ hfS1 gkm ; fS2 gkm i ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ hfS1 gkm ; fS1 gkm i q ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ hfS2 gkm ; fS2 gkm i q (2) where fS1 gkm and fS2 gkm are feature vectors representing the counts of distinct gapped k-mers in sequences S1 and S2 respectively. As the feature space corresponding to gapped k-mer counts can be large, for computationally efficiency the gkm kernel (Ghandi et al., 2014; Leslie and Kuang, 2004) computes the dot product hfS1 gkm ; fS2 gkm i with- out explicitly computing fS1 gkm or fS2 gkm . Let uSx i represent the identity of the l-mer at position i in sequence Sx , and let fmðuS1 i ; uS2 j Þ be a function that returns the number of mismatching positions between academic.oup.com/bioinformatics/article-abstract/35/14/i173/5529147 by Unive distinct gapped k-mers. Formally, it can be written as: ; S2 Þ ¼ * fS1 gkm kfS1 gkm k ; fS2 gkm kfS2 gkm k + ¼ hfS1 gkm ; fS2 gkm i ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ hfS1 gkm ; fS1 gkm i q ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ hfS2 gkm ; fS2 gkm i q (2) m and fS2 gkm are feature vectors representing the counts of apped k-mers in sequences S1 and S2 respectively. As the ace corresponding to gapped k-mer counts can be large, utationally efficiency the gkm kernel (Ghandi et al., 2014; Kuang, 2004) computes the dot product hfS1 gkm ; fS2 gkm i with- citly computing fS1 gkm or fS2 gkm . Let uSx i represent the identity mer at position i in sequence Sx , and let fmðuS1 i ; uS2 j Þ be a hat returns the number of mismatching positions between uS1 i and uS2 j . The gkm kernel leverages the fact that: hfS1 gkm ; fS2 gkm i ¼ X i X j h fmðuS1 i ; uS2 j Þ (3) indexes i and j sum over all l-mers in S1 and S2 respective- m) is a function that returns the contribution of an l-mer atics/article-abstract/35/14/i173/5529147 by University of Tokyo Libra w ɹΛ4YͷJ൪͔Β·ΔlNFS GN B C ϛεϚον͢Δϙδγϣϯͷ C N Λੵͷد༩ͱ͢Δɻ ී௨ͳΒlN$L mally, it can be written as: ¼ hfS1 gkm ; fS2 gkm i ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ hfS1 gkm ; fS1 gkm i q ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ hfS2 gkm ; fS2 gkm i q (2) ors representing the counts of S1 and S2 respectively. As the ed k-mer counts can be large, m kernel (Ghandi et al., 2014; e dot product hfS1 gkm ; fS2 gkm i with- . Let uSx i represent the identity e Sx , and let fmðuS1 i ; uS2 j Þ be a mismatching positions between l leverages the fact that: h fmðuS1 i ; uS2 j Þ (3) cs/article-abstract/35/14/i173/5529147 by University of To
position i in sequence Sx , and let fmðuS1 i ; uS2 j Þ be a eturns the number of mismatching positions between and uS2 j . The gkm kernel leverages the fact that: hfS1 gkm ; fS2 gkm i ¼ X i X j h fmðuS1 i ; uS2 j Þ (3) xes i and j sum over all l-mers in S1 and S2 respective- a function that returns the contribution of an l-mer smatches to the dot product hfS1 ; fS2 i. For example, if e a pair of l-mers with m mismatches between them, er of gapped k-mers they share is l À m k . Thus, in traditional gapped k-mer kernel, hðmÞ ¼ l À m k . 2014) additionally proposed variants of the gapped k- h as the truncated gkm-filter that differ in the func- otherwise have an identical formulation. utational efficiency, the gkm-SVM implementation ameter d that sets h(m) ¼ 0 for all m > d (regardless of and k). This is efficient because it limits the number that need to be considered to those where m d. In ation, d is referred to as the maximum number of atches. 7 by University of Tokyo Library user on 13 August 2019 hat returns the number of mismatches and h ds on the specific variant of the kernel used. nce, we will denote fmðlSx j ; lSz k Þ as m here. ribute the quantity wj wk h ðmÞ over the bases ion i in sequence Sx that overlaps the l-mer base at position i as Bx ¼ Si x , and the corre- e other l-mer lSz k (from position k in support ÀjÞ (‘corresponding’ means that the offset of f lSx j is the same as the offset of Bz relative to enly distribute wj wk h ðmÞ among the ðl À mÞ ween l-mers lSx i and lSz j , then position i would wjwk h ðmÞ lÀm if Bx ¼ Bz , and would inherit an im- . Leaving out the weights wj wk (which don’t ), we can encapsulate the core logic in the defined as: ; BzÞ ::¼ h ðmÞ l À m if Bx¼ Bz 0 otherwise 8 < : (8) , we write the weighted contribution to by position i from the l-mer pair ðlSx j ; lSz k Þ as: wgkm x z Thus, by design, we again have the P i /wgkmrbf i;Sx;Sz ¼ Kwgkmrbf ðSx; SzÞ À expðÀcÞ. This i GkmExplain to address the saturation issue fa Figure 1: even in situations where perturbing any in cally produces a near-zero change in the kernel Kw GkmExplain importance scores are not saturated entire difference Kwgkmrbf ðSx; SzÞ À expðÀcÞ. Specif of the toy example in Figure 1, GkmExplain rec similarity with the positively-weighted support vec its baseline value, and therefore grants positiv gapped k -mer matches between the input sequence tor—even though similarity with the support vec sensitive to specific individual perturbations. importance scores sum to the ‘difference from base a theoretical connection between GkmExplain an Integrated Gradients, which we discuss in Supplem SA.2. Once we have the contribution of base i to the output Kwgkmrbf ðSx; SzÞ, we can find the contributio wgkmrbf-SVM output by simply taking the w /wgkmrbf i;Sx;S zj over all support vectors Szj , analogous to Eqn. 11, giving: How should we distribute the quantity wj wk h ðmÞ over the bases in Sx ? Consider a position i in sequence Sx that overlaps the l-mer lSx j . Let us denote the base at position i as Bx ¼ Si x , and the corre- sponding base within the other l-mer lSz k (from position k in support vector Sz ) as Bz ¼ Sk þðiÀjÞ z (‘corresponding’ means that the offset of Bx relative to the start of lSx j is the same as the offset of Bz relative to the start of lSz k ). If we evenly distribute wj wk h ðmÞ among the ðl À mÞ matching positions between l-mers lSx i and lSz j , then position i would inherit an importance of wjwk h ðmÞ lÀm if Bx ¼ Bz , and would inherit an im- portance of 0 if Bx 6¼ Bz . Leaving out the weights wj wk (which don’t depend on m, Bx or Bz ), we can encapsulate the core logic in the function effðm; Bx; BzÞ, defined as: effðm; Bx; BzÞ ::¼ h ðmÞ l À m if Bx¼ Bz 0 otherwise 8 < : (8) Using this function, we write the weighted contribution to hfSx wgkm ; fSz wgkm i inherited by position i from the l-mer pair ðlSx j ; lSz k Þ as: impði; j; k ; Sx; SzÞ ::¼ wj wk effðfmðlSx j ; lSz k Þ; Si x ; Sk þðiÀjÞ z Þ (9) i i;S GkmEx Figure cally p GkmEx entire d of the similar its bas gapped tor—ev sensitiv import a theor Integra SA.2. On output wgkmr /wgkmrb i;Sx;S zj Eqn. 11 rt of lSz k ). If we evenly distribute wj wk h ðmÞ among the ðl À mÞ ing positions between l-mers lSx i and lSz j , then position i would an importance of wjwk h ðmÞ lÀm if Bx ¼ Bz , and would inherit an im- nce of 0 if Bx 6¼ Bz . Leaving out the weights wj wk (which don’t d on m, Bx or Bz ), we can encapsulate the core logic in the on effðm; Bx; BzÞ, defined as: effðm; Bx; BzÞ ::¼ h ðmÞ l À m if Bx¼ Bz 0 otherwise 8 < : (8) ing this function, we write the weighted contribution to ; fSz wgkm i inherited by position i from the l-mer pair ðlSx j ; lSz k Þ as: impði; j; k ; Sx; SzÞ ::¼ wj wk effðfmðlSx j ; lSz k Þ; Si x ; Sk þðiÀjÞ z Þ (9) we sum this quantity over all possible l-mer pairs where lSx j ps position i—that is, we sum over the range j ¼ maxði À ðl À to j ¼ minði; lenðSxÞ À lÞ—and normalize by the total wgkm e vector lengths kfSx wgkm kkfSz wgkm k (as is done in the wgkm ker- we arrive at the total contribution of position i in Sx to ðSx; SzÞ, which we denote as /wgkm i;Sx;Sz : /wgkm ::¼ X minði;lenðSxÞÀlÞ X impði; j; k ; Sx; SzÞ ! (10) its baseline value, and therefore gapped k -mer matches between the tor—even though similarity with t sensitive to specific individual importance scores sum to the ‘differ a theoretical connection between G Integrated Gradients, which we disc SA.2. Once we have the contribution output Kwgkmrbf ðSx; SzÞ, we can find wgkmrbf-SVM output by simply /wgkmrbf i;Sx;S zj over all support vectors Sz Eqn. 11, giving: /wgkmrbfsvm i;Sx ¼ X m j¼1 In terms of implementation, Gkm be computed efficiently by modify search originally used to compute t implementation is at https://github.c 5.2 Mutation impact scores matching positions between l-mers l x i and l z j , then position i would nherit an importance of wjwk h ðmÞ lÀm if Bx ¼ Bz , and would inherit an im- ortance of 0 if Bx 6¼ Bz . Leaving out the weights wj wk (which don’t epend on m, Bx or Bz ), we can encapsulate the core logic in the unction effðm; Bx; BzÞ, defined as: effðm; Bx; BzÞ ::¼ h ðmÞ l À m if Bx¼ Bz 0 otherwise 8 < : (8) Using this function, we write the weighted contribution to fSx wgkm ; fSz wgkm i inherited by position i from the l-mer pair ðlSx j ; lSz k Þ as: impði; j; k ; Sx; SzÞ ::¼ wj wk effðfmðlSx j ; lSz k Þ; Si x ; Sk þðiÀjÞ z Þ (9) If we sum this quantity over all possible l-mer pairs where lSx j verlaps position i—that is, we sum over the range j ¼ maxði À ðl À Þ; 0Þ to j ¼ minði; lenðSxÞ À lÞ—and normalize by the total wgkm eature vector lengths kfSx wgkm kkfSz wgkm k (as is done in the wgkm ker- el), we arrive at the total contribution of position i in Sx to Kwgkm ðSx; SzÞ, which we denote as /wgkm i;Sx;Sz : /wgkm i;Sx;Sz ::¼ X minði;lenðSxÞÀlÞ X impði; j; k ; Sx; SzÞ kfSx kkfSz k ! (10) gapped k -mer matches betw tor—even though similarit sensitive to specific ind importance scores sum to t a theoretical connection b Integrated Gradients, whic SA.2. Once we have the cont output Kwgkmrbf ðSx; SzÞ, we wgkmrbf-SVM output by /wgkmrbf i;Sx;S zj over all support v Eqn. 11, giving: /wgkmrb i;Sx In terms of implementa be computed efficiently b search originally used to co implementation is at https:/ 5.2 Mutation impact sc An intuitive approach to e n importance of lÀm if Bx ¼ Bz , and would inherit an im- of 0 if Bx 6¼ Bz . Leaving out the weights wj wk (which don’t n m, Bx or Bz ), we can encapsulate the core logic in the effðm; Bx; BzÞ, defined as: effðm; Bx; BzÞ ::¼ h ðmÞ l À m if Bx¼ Bz 0 otherwise 8 < : (8) this function, we write the weighted contribution to z gkm i inherited by position i from the l-mer pair ðlSx j ; lSz k Þ as: mpði; j; k ; Sx; SzÞ ::¼ wj wk effðfmðlSx j ; lSz k Þ; Si x ; Sk þðiÀjÞ z Þ (9) sum this quantity over all possible l-mer pairs where lSx j position i—that is, we sum over the range j ¼ maxði À ðl À j ¼ minði; lenðSxÞ À lÞ—and normalize by the total wgkm ector lengths kfSx wgkm kkfSz wgkm k (as is done in the wgkm ker- arrive at the total contribution of position i in Sx to ; SzÞ, which we denote as /wgkm i;Sx;Sz : X minði;lenðSxÞÀlÞ X ! sensitive to specific importance scores sum a theoretical connectio Integrated Gradients, w SA.2. Once we have the c output Kwgkmrbf ðSx; SzÞ, wgkmrbf-SVM output /wgkmrbf i;Sx;S zj over all suppo Eqn. 11, giving: /wg i;Sx In terms of impleme be computed efficientl search originally used t implementation is at ht 0 otherwise Using this function, we write the weighted contribution to x gkm ; fSz wgkm i inherited by position i from the l-mer pair ðlSx j ; lSz k Þ as: impði; j; k ; Sx; SzÞ ::¼ wj wk effðfmðlSx j ; lSz k Þ; Si x ; Sk þðiÀjÞ z Þ (9) If we sum this quantity over all possible l-mer pairs where lSx j rlaps position i—that is, we sum over the range j ¼ maxði À ðl À 0Þ to j ¼ minði; lenðSxÞ À lÞ—and normalize by the total wgkm ture vector lengths kfSx wgkm kkfSz wgkm k (as is done in the wgkm ker- ), we arrive at the total contribution of position i in Sx to gkm ðSx; SzÞ, which we denote as /wgkm i;Sx;Sz : /wgkm i;Sx;Sz ::¼ X minði;lenðSxÞÀlÞ j¼maxðiÀðlÀ1Þ;0Þ X k impði; j; k ; Sx; SzÞ kfSx wgkm kkfSz wgkm k ! (10) kfSx wgkm k ¼ ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Kwgkm ðSx; SxÞ p is efficiently computed using the ker- function. Now that we have defined the contribution of base i to wgkm kernel output Kwgkm ðSx; SxÞ, we can define the final con- ution of base i to the output of the wgkm-SVM as simply being weighted sum of /wgkm i;Sx;S j over all support vectors Szj , where the output Kwgkmrbf ðSx; SzÞ, wgkmrbf-SVM output /wgkmrbf i;Sx;S zj over all suppo Eqn. 11, giving: /wgk i;Sx In terms of impleme be computed efficientl search originally used t implementation is at htt 5.2 Mutation impac An intuitive approach t tions is in-silico mutage In ISM, a mutation is in the predicted output Figures 1 and 2, ISM model has saturated in when RBF variants of th lengths kf wgkm kkf wgkm k (as is done in the wgkm ker- ve at the total contribution of position i in Sx to which we denote as /wgkm i;Sx;Sz : km ;Sz ::¼ X minði;lenðSxÞÀlÞ j¼maxðiÀðlÀ1Þ;0Þ X k impði; j; k ; Sx; SzÞ kfSx wgkm kkfSz wgkm k ! (10) ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Kwgkm ðSx; SxÞ p is efficiently computed using the ker- Now that we have defined the contribution of base i to nel output Kwgkm ðSx; SxÞ, we can define the final con- ase i to the output of the wgkm-SVM as simply being sum of /wgkm i;Sx;S zj over all support vectors Szj , where the yj (as per Eqn. 1). We denote this quantity as /wgkmsvm i;Sx /wgkmsvm i;Sx ¼ X m j¼1 aj yj/wgkm i;Sx;S zj (11) we compute a similar quantity for per-base attributions search originally used implementation is at 5.2 Mutation impa An intuitive approac tions is in-silico muta In ISM, a mutation i the predicted outpu Figures 1 and 2, ISM model has saturated when RBF variants o can be beneficial to e ent quantity that we which we describe in Consider a pair changes base Bx in l the same relative pos
the positive set from the negative set (due to the non-additive nature of the interaction between the TAL1 and GATA1 motif, a regular gkm kernel does not perform well on this task). As shown in Figure 5, the core TAL1 and GATA1 motifs are 6 bp long, hence the choice of l ¼ 6. The gkmrbf-SVM attained 90% auROC. Figure 2 illustrates the behavior of different algorithms on a se- quence containing three GATA1 motifs and one TAL1 motif. For deltaSVM and ISM, the importance of a position was computed as the negative of the mean impact of all 3 possible mutations at that position (positions that produce negative deltas when mutated will therefore receive positive importance). SHAP was run with the fol- lowing different settings: 2000 samples per example sequence with 20 dinuc-shuffled backgrounds each (for a total of 40 000 model evaluations per sequence), 2000 samples per example sequence with 200 dinuc-shuffled backgrounds each (for a total of 400 000 model evaluations per sequence), and 20 000 samples per example se- quence with 20 dinuc-shuffled backgrounds each (for a total of 400 000 model evaluations per sequence). See Section 4.3 for more quence for the output to be positive). I some individual GATA1 motifs in this presence of multiple GATA1 motifs ha nonlinear decision function. SHAP sho the relevant motifs, but only when man used. Unfortunately, at 40 000þ mode SHAP has a very slow runtime comp (Fig. 3). To confirm this was not an isol the ability of GkmExplain, ISM and SH motifs across all examples in the GkmExplain does indeed perform bette Fig. SA.1). As further confirmation that GkmEx embedded TAL1 and GATA1 motifs, derived importance score profiles across set to the recently-developed importanc gation tool TF-MoDISco (Shrikumar MoDISco identifies subsequences (term ance in all the input sequences, builds a between seqlets using a cross-correlati seqlets using the affinity matrix, and the in each cluster to form consolidated m both importance score profiles as well a score profiles of multiple sequences. Hy can be intuitively thought of as revealing fier for seeing alternative bases at any g (see Section 5.3). We also normalized Supplementary Appendix SA.3, as we Fig. 3. Time taken to compute importance on a single sequence for various algorithms (log scale). Evaluation was done on model and data described in
and SHAP on an individual simulated sequence. An SVM with a gkmrbf kernel with l ¼ 6, k ¼ 5 and d ¼ 1 was used to distinguish sequences containing both TAL1 and GATA1 motifs from sequences containing only one kind of motif or neither kind of motif. The locations i176 A.Shrikumar et al. Fig. 4. GkmExplain outperforms ISM at identifying GATA1 motifs. A gkmrbf SVM was t set were scanned using 10 bp windows (the length of the GATA_disc1 motif). Windows tive. Windows containing no portion of any embedded motif were labeled negative. A according to the total importance score produced by the importance scoring method method outperforms ISM and SHAP on both metrics. The dashed black line shows the pe Fig. score are p feren Gand example, ISM requires a min. of 601 model evaluations for a sequence of length 200 (one for the original sequence, and 600 for all three possible muta- tions at every position). SHAP with 2 K samples per sequence and 20 back- grounds requires a minimum of 40 K model evaluations (see Section 4.3 for more details on the SHAP algorithm)