Latent probabilistic modeling for mutational signature in cancer genomes

Latent probabilistic modeling for mutational signature in cancer genomes Yuichi
Shiraishi(1), George Tremmel(1), Satoru Miyano(1) and Matthew Stephens(2) 1.  The University of Tokyo 2.  The University of Chicago Yuichi Shiraishi et al. “A simple model-bases approach to inferring and visualizing cancer mutation signatures”, PLoS Genetics, 2015.

Outline of the talk •  Introduction and motivation •  Proposed
method •  Experiment of real data

Mutational signature in cancer genome •  Mutation pattern reflects – 
mutagenic exposures –  DNA repair defects •  Classical analysis revealed –  skin cancer: C>T, CC>TT –  lung cancer (smoking history): C>A •  Sequence context is also important –  C>T on NpCpG site •  Problem in classical analysis –  Mutations are aggregated among samples within the same cancer type –  Sample-wise mutation pattern could not be explored. Pfeifer et al., 2009

Huge mutation data produced by NGS •  Extracting ‘sample-wise’ mutation
signature becomes possible. •  New statistical methods are necessary to extract characteristic mutational signatures efficiently Alexandrov et al., Nature, 2013

Mutational Signature Extraction using NMF 1.  Classification into 96 categoriesʢconsidering
upstream and downstream 1 bases) –  e.g., TpApG → TpCpGɽ –  4 (previous base)×4 (original base)×4 (next base) ×3 (alternate base) ÷2 (complementary strand redundancy) 2.  For each individual cancer genome, calculate the proportion of 96 mutation patterns 3.  Perform Nonnegative Matrix Factorization (NMF). We can generalize Equation 2 for all K mutation types and G genomes by expressing exposures to mutational processes and mutational catalogs as matrices (Experimental Procedures): 2 6 m1 1 m1 2 / m1 GÀ1 m1 G 3 7 2 6 p1 1 p1 2 / p1 NÀ1 p1 N 3 7 or this equation can be simpliﬁed in a matrix form as: MzP 3 E: (Equation 3) Deciphering the Signatures of Mutational Processes from Somatic Mutational Catalogs of Cancer Genomes Figure 1. Modeling Signatures of Mutational Processes Operative in Cancer Genomes (A) Simulated example of three mutational processes operative in a single cancer genome. The mutational catalog of the cancer genome is modeled as a linear superposition of the signatures of the three processes and the respective number of mutations contributed by each signature, plus added nonsystematic noise. (B) Simulated example illustrating mutational processes operative in a set of G cancer genomes. The mutational catalogs of these G cancer genomes can be used to decipher the signatures of N mutational processes as well as the number of mutations caused by each of the processes in each of the genomes. The extracted signatures and contributions do not allow an exact reconstruction of the original set, thus resulting in genome-speciﬁc reconstruction error. Alexandrov et al., Cell Reports, 2013

C > T at NpCpG (Aging) C > A (smoking)
Mutation signature base on 96 categorization http://cancer.sanger.ac.uk/cosmic/signatures

21 mutation signatures from ~7000 pan- cancers (Alexandrov et al.,
Nature, 2013)

Problems of previous approach •  Current framework cannot include many
contextual factors. •  When considering two 5’ and 3’ flanking bases? –  1536 patterns! –  Unstable estimate because of too many parameters –  Interpretation is very difficult Alexandrov et al., Cell Reports, 2013

Proposed approach •  A new method that can treat “many
contextual factors.” •  Two extensions 1.  Redefine a mutation signature using probabilistic framework 2.  Mutational process by mixed membership model

Probabilistic mutation signature •  Probabilistic model for generating each element
of mutation patterns –  Two 5’ and 3’ flanking bases, substitution pattern and transcription strand •  Each component is assumed to be determined individuallly •  The number of parameter dramatically reduces (3071 → 18) pos -2 -1 0 1 2 A 0.05 0.45 0 0.2 0.2 C 0.7 0.5 0.02 0.2 0.15 G 0.1 0.02 0 0.2 0.3 T 0.05 0.03 0.8 0.4 0.35 alt A C G T C 0.5 0 0 0.5 T 0.05 0.95 0 0 flanking sequence alternate base C AC T T T C +− strand specificity + - strand 0.2 0.8

Mutational process via mixed membership model •  Mixed membership model
•  Latent Dirichlet Allocation (Blei et al. 2003) •  Population structure estimation (Pritchard et al., 2000)

Parameter estimation •  Two parameters need to be estimated. – 
: mutation signature values for the k-th signature and the l-th features –  : membership parameters for the i-th individual •  EM algorithm –  E-step –  M-step •  Used SQUARE EM (Varadhan et al., Scadinavian Journal of Statistics, 2009) o modelling mutation signatures via independent features, will grow as more and more featu orated into the analysis. ds ter Estimation meters { fk,l } and { qi } must be estimated from the available mutation data { xi,j }. Here mple approach that uses an EM-algorithm to maximise the likelihood. m denote the number of mutations in the i -th sample that have mutation feature vector m . of the EM algorithm, we calculate values of auxiliary variables ✓i,k,m defined as ✓i,k,m = qi,k QL l=1 fk,l,ml PK k0=1 qi,k0 QL l=1 fk0,l,ml . ( he M-step, we update the parameters { fk,l } and { qi,k } as fk,l,p = P m:ml=p gi,m✓i,k,m P p0 P m:ml=p0 gi,m✓i,k,m , ( qi,k = P m gi,m✓i,k,m P k0 P m gi,m✓i,k0,m . ( the R package SQUAREM [41] to accelerate convergence of this EM algorithm (SQUARE he analysis. mation and { qi } must be estimated from the available mutation data { xi,j }. Here we ch that uses an EM-algorithm to maximise the likelihood. number of mutations in the i -th sample that have mutation feature vector m . In algorithm, we calculate values of auxiliary variables ✓i,k,m defined as ✓i,k,m = qi,k QL l=1 fk,l,ml PK k0=1 qi,k0 QL l=1 fk0,l,ml . (2) we update the parameters { fk,l } and { qi,k } as fk,l,p = P m:ml=p gi,m✓i,k,m P p0 P m:ml=p0 gi,m✓i,k,m , (3) qi,k = P m gi,m✓i,k,m P k0 P m gi,m✓i,k0,m . (4) age SQUAREM [41] to accelerate convergence of this EM algorithm (SQUAREM approach to accelerate the convergence of any fixed-point iterative scheme such as Parameter Estimation The parameters { fk,l } and { qi } must be estimated from the available mu adopt a simple approach that uses an EM-algorithm to maximise the likelih Let gi,m denote the number of mutations in the i -th sample that have m the E step of the EM algorithm, we calculate values of auxiliary variables ✓ ✓i,k,m = qi,k QL l=1 fk,l,ml PK k0=1 qi,k0 QL l=1 fk0,l,ml . Then, in the M-step, we update the parameters { fk,l } and { qi,k } as fk,l,p = P m:ml=p gi,m✓i,k,m P p0 P m:ml=p0 gi,m✓i,k,m , qi,k = P m gi,m✓i,k,m P k0 P m gi,m✓i,k0,m . We use the R package SQUAREM [41] to accelerate convergence of this implements a general approach to accelerate the convergence of any fixed-p an EM algorithm). To address potential problems with convergence to loc algorithm several times (10 times in this paper) using different initial point lculate values of auxiliary variables ✓i,k,m defined as ,m = qi,k QL l=1 fk,l,ml PK k0=1 qi,k0 QL l=1 fk0,l,ml . (2) rameters { fk,l } and { qi,k } as p = P m:ml=p gi,m✓i,k,m P p0 P m:ml=p0 gi,m✓i,k,m , (3) qi,k = P m gi,m✓i,k,m P k0 P m gi,m✓i,k0,m . (4) [41] to accelerate convergence of this EM algorithm (SQUAREM lerate the convergence of any fixed-point iterative scheme such as al problems with convergence to local minima, we apply the EM his paper) using different initial points, and use the estimate with for the derivation of the above updating procedures. nto the analysis. timation fk,l } and { qi } must be estimated from the available mutation data { proach that uses an EM-algorithm to maximise the likelihood. e the number of mutations in the i -th sample that have mutation featu EM algorithm, we calculate values of auxiliary variables ✓i,k,m defined ✓i,k,m = qi,k QL l=1 fk,l,ml PK k0=1 qi,k0 QL l=1 fk0,l,ml . p, we update the parameters { fk,l } and { qi,k } as fk,l,p = P m:ml=p gi,m✓i,k,m P p0 P m:ml=p0 gi,m✓i,k,m ,

Reanalysis of pan-cancer data (Alexandrov et al., Nature, 2013) • 
30 cancer types •  7042 cancer genomes •  4938462 mutations •  Performed the proposed method on each cancer type individually –  If the multiple signatures with the same characteristics emerged across different cancer types, these characteristics seem to be true. –  When summarize, similar signatures are merged.

T T C T T A +− signature 1 C
T C T G A +− signature 2 C C A +− signature 3 C C C G A +− signature 4 T C C A A +− signature 5 G C A T G A +− signature 6 C G T +− signature 7 T T C G A T +− signature 8 C C T T +− signature 9 T C T C C T T +− signature 10 G G C G T +− signature 11 T C C T T +− signature 12 C T C A T G T +− signature 13 T C A T T +− signature 14 C C T +− signature 15 T C A G +− signature 16 C C T C A C +− signature 17 A T A T A T C +− signature 18 G T A C C +− signature 19 C A C T T T C +− signature 20 A C T T G +− signature 21 A T T T T T G +− signature 22 A T A T C T A T T A C +− signature 23 A T T T A T A T C +− signature 24 T T C T A T A T C +− signature 25 C C T AG G A A C +− signature 26 C C T T A C G +− signature 27 Reanalysis of Alexandrov et al.

T T C T T A +− signature 1 C
T C T G A +− signature 2 C C A +− signature 3 C C C G A +− signature 4 T C C A A +− signature 5 G C A T G A +− signature 6 C G T +− signature 7 T T C G A T +− signature 8 C C T T +− signature 9 T C T C C T T +− signature 10 G G C G T +− signature 11 T C C T T +− signature 12 C T C A T G T +− signature 13 T C A T T +− signature 14 C C T +− signature 15 T C A G +− signature 16 C C T C A C +− signature 17 A T A T A T C +− signature 18 G T A C C +− signature 19 C A C T T T C +− signature 20 A C T T G +− signature 21 A T T T T T G +− signature 22 A T A T C T A T T A C +− signature 23 A T T T A T A T C +− signature 24 T T C T A T A T C +− signature 25 C C T AG G A A C +− signature 26 C C T T A C G +− signature 27 Reanalysis of Alexandrov et al. temozolomide Pol epsilon smoking APOBEC “Signature 14” “Signature R1” smoking “Signature 21” Pol epsilon Age ultraviolet APOBEC? BRACA1 “Signature 15” “Signature 19” “Signature 21” “Signature 16” “Signature 17” “Signature 9”

strand specific signatures C C A +− Head−and−Neck_1 C C
A A +− Lung−Adeno_3 C C C A +− Lung−Squamous_2 T C C A A +− Lung−Small−Cell_1 T C T C C T T +− Head−and−Neck_3 T T C C T T +− Melanoma_3 0.0 0.2 0.4 H ead−and−N eck_3 M elanom a_3 type intensity base A C G T C AC T T T C +− tobacco Ultra violet transcription coupled repair

APOBEC signature C T T C A T G T
+− ALL_2 C T C A T G G T +− Bladder_2 C T C A T G T +− Breast_1 C T C A T G T +− Cervix_1 C T C A T G G T +− Head−and−Neck_2 C T C A T +− Kidney−Papillary_2 C T C A T G T +− Lung−Adeno_2 T T C A T +− Lung−Squamous_1 A T C A T T +− Lymphoma−B−Cell_2 C T C A T +− Myeloma_2 T C A T G T +− Pancreas_2 C T C A T +− Thyroid_1 0.0 0.1 0.2 0.3 0.4 ALL_2 Bladder_2 Breast_1 C ervix_1 H ead−and−N eck_2 Kidney−Papillary_2 Lung−Adeno_2 Lung−Squam ous_1 Lym phom a−B−C ell_2 M yelom a_2 Pancreas_2 Thyroid_1 type intensity base A C G T -  Estimated signatures are strongly consistent across cancer types -  At the two 5’ base, the ratio of C and T is strong whereas that of G is weak.

Pol epsilon •  “Signature 10” (C > A at TpCpT
and C>T at TpCpG) •  Two signatures are needed to represent somewhat complex muation patterns. C>A C>G C>T T>A T>C T>G 0.0 0.1 0.2 0.3 probability T T C T T A +− Colorectum_4 T T C T T A +− Uterus_1 T T C G T +− Colorectum_3 T T C G A T +− Uterus_2

Summary •  A new method that can robustly treat many
contextual factors. •  The proposed method has deep relationships with mixed membership model (topic model, structure) – We can utilize the techniques accumulated in other fields. •  We could find several characteristics at the two 5’ bases.

R package •  The R package is available at …
–  https://github.com/friend1ws/pmsignature –  Google “pmsignature” –  Use Rcpp for EM-algorithm part •  Shiny web app is also available at… –  https://friend1ws.shinyapps.io/pmsignature_shiny/ •  The paper have been published: –  Yuichi Shiraishi et al. “A simple model-bases approach to inferring and visualizing cancer mutation signatures”, PLoS Genetics, 2015.

Acknowledgement •  University of Tokyo – George Tremmel – Satoru Miyano • 
University of Chicago – Matthew Stephens

Latent probabilistic modeling for mutational si...

Latent probabilistic modeling for mutational signature in cancer genomes

Yuichi Shiraishi

More Decks by Yuichi Shiraishi

Other Decks in Science

Featured

Transcript

Latent probabilistic modeling for mutational signature in cancer genomes Yuichi

Outline of the talk •  Introduction and motivation •  Proposed

Outline of the talk •  Introduction and motivation •  Proposed

Mutational signature in cancer genome •  Mutation pattern reflects –

Huge mutation data produced by NGS •  Extracting ‘sample-wise’ mutation

Mutational Signature Extraction using NMF 1.  Classification into 96 categoriesʢconsidering

C > T at NpCpG (Aging) C > A (smoking)

21 mutation signatures from ~7000 pan- cancers (Alexandrov et al.,

Problems of previous approach •  Current framework cannot include many

Outline of the talk •  Introduction and motivation •  Proposed

Proposed approach •  A new method that can treat “many

Probabilistic mutation signature •  Probabilistic model for generating each element

Mutational process via mixed membership model •  Mixed membership model

Parameter estimation •  Two parameters need to be estimated. –

Outline of the talk •  Introduction and motivation •  Proposed

Reanalysis of pan-cancer data (Alexandrov et al., Nature, 2013) •

T T C T T A +− signature 1 C

T T C T T A +− signature 1 C

strand specific signatures C C A +− Head−and−Neck_1 C C

APOBEC signature C T T C A T G T

Pol epsilon •  “Signature 10” (C > A at TpCpT

Summary •  A new method that can robustly treat many

R package •  The R package is available at …

Acknowledgement •  University of Tokyo – George Tremmel – Satoru Miyano •