Slide 1

Slide 1 text

Sparse coding methods applied to DNA replication analysis Clara Lage 1 1Univ Lyon, ENS de Lyon, CNRS Laboratoire de Physique CentraleSup´ elec, 2023 1 / 44

Slide 2

Slide 2 text

My scientific journey - PhD Thesis: Price signal quality in energy optimization (2017 - 2020) Institions: Paris Sorbonne University and IMPA (Instituto Nacional de Matematica Pura e Aplicada). Advisors: Mikhail Solodov, Claudia Sagastizabal and Jean-Marc Bonnisseau. § Stochastic optimization § Energy generation - Postdoc at Ecole Polytechnique (2020 - 2022) - Collaborator: Emmanuel Gobet. § Optimal transport optimization § Dictionary learning § Application in finance (credit model) § Postdoc at ENS de Lyon (2022 - ) - Collaborators: Benjamin Audit and Nelly Pustelnik and Jean-Michel Arbona § Inverse problems § Off-the-grid sparse coding § Application in biology (DNA replication) 2 / 44

Slide 3

Slide 3 text

Presentation Guide 1. DNA replication in single molecule 2. Dictionary approach 3. Approach off-the-grid and hybrid 4. Ongoing work 3 / 44

Slide 4

Slide 4 text

General framework Microscope images are widely used to study the structure and function of cells Some characteristics of the replication can be studied by images This work aims to understand replication at a more detailed level, combining genetic sequencing data and DNA replication analysis 4 / 44

Slide 5

Slide 5 text

Context Main Objective: Better characterize the replication of DNA. - Position of origins of replication - Speed and direction of replication Related application: - Characterization of replication stress in cancer cells 5 / 44

Slide 6

Slide 6 text

Context Main Objective: Better characterize the replication of DNA. - Position of origins of replication - Speed and direction of replication Related application: - Characterization of replication stress in cancer cells 5 / 44

Slide 7

Slide 7 text

Single cell characterization of DNA replication Experiment is made during the DNA replication. A marked nucleotide (BrdU) into DNA, reports the fork progression. 6 / 44

Slide 8

Slide 8 text

Single cell characterization of DNA replication Experiment is made during the DNA replication. A marked nucleotide (BrdU) into DNA, reports the fork progression. 6 / 44

Slide 9

Slide 9 text

Single cell characterization of DNA replication Experiment is made during the DNA replication. A marked nucleotide (BrdU) into DNA, reports the fork progression. 6 / 44

Slide 10

Slide 10 text

Data Case 1: Isolated Fork 7 / 44

Slide 11

Slide 11 text

Data Case 2: Origin Experimental configuration Signal 8 / 44

Slide 12

Slide 12 text

Data Case 3: The replication starts/terminates after the beginning of the pulse 9 / 44

Slide 13

Slide 13 text

Data Case 4: Anomalies during the replication The fork stops during the replication 10 / 44

Slide 14

Slide 14 text

BrdU signal Original data: Estimated BrdU incorporation probabilities. BrdU incorporation profiles “ spatial signals that recapitulate the temporal variation of the intracellular BrdU concentration. Non-Gaussian noise. 11 / 44

Slide 15

Slide 15 text

NanoForkSpeed NFS approximates forks by piece-wise linear functions. We are not able to detect terminations/initiations after the beginning of the pulse or anomalies in replication 12 / 44

Slide 16

Slide 16 text

Main objectives - Precision in the detection of speed, position and direction of the fork - Detection of forks in different phases of the replication - Detection of anomalies in the fork replication 13 / 44

Slide 17

Slide 17 text

Dictionary approach 14 / 44

Slide 18

Slide 18 text

Dictionary approach The profile of BrdU concentration in time is a known function ψptq. The resulting spatial profiles depends on the speed of the fork. Level of residual BrdU changes for yeast and human cells. BrdU profile Different scales/speeds Different positions and scales: φp¨; px0, sqq “ ξsψp¨´x0 s q x0 : Position of the fork for time t = 0 s speed of the fork 14 / 44

Slide 19

Slide 19 text

Dictionary approach Examples of dictionary fit: 15 / 44

Slide 20

Slide 20 text

Time-scale dictionary Elements of the dictionary “ the BrdU function φp¨; px, sqq. For a choice of px, sq in the parameter space Θ “ X ˆ S (space x speed): D “ ! tφp¨; px, sqqupx,sqPΘ : φp¨; px, sqq “ ζv ψ ´¨ ´ x s ¯ , ζs P R ) A discrete parameter choice results in a discrete set of dictionaries t˜ d1, ..., ˜ dLu, where L “ K ˆ N, the number of different speeds x the number of different positions. 16 / 44

Slide 21

Slide 21 text

Standard sparse coding formulation r D “ » — – | | | ˜ d1 . . . ˜ dL | | | fi ffi fl Optimization problem: min αPRl }r Dα ´ z}2 2 ` λ}α}˚ Parameters of interest: § Position (x0{α): position in the genome for each fork § Scale: (|s|{dl ) Speed of replication for each fork § Direction: (signpsq{dl ) Direction of replication § Amplitude: (α) The value of the maximum BrdU concentration 17 / 44

Slide 22

Slide 22 text

Standard sparse coding formulation Optimization problem using convolution: min αPRl }r Dα ´ z}2 2 ` λ}α}˚ min pa1,...,aK qPRnˆ...ˆRn › › › › › K ÿ k“1 dk ˚ ak ´ z › › › › › 2 2 ` λ K ÿ k“1 }ak}1. Reduction of dictionary dimension. 18 / 44

Slide 23

Slide 23 text

Kullback-Leibler as fidelity term Noise in the case of study is not Gaussian. The euclidean norm } ¨ } is replaced by Kullback-Leibler distance. KLpz|wq “ ÿ n zn log ˆ zn wn ˙ ` wn ´ zn. The optimization problem is given by: min pa1,...,aK qPRnˆ...ˆRn KL ˆ z ˇ ˇ ˇ ˇ K ÿ k“1 dk ˚ ak ˙ `λ ÿ k }ak}1. 19 / 44

Slide 24

Slide 24 text

Matching Pursuit Optimization problem for Matching Pursuit (before convolution): min tα, }α}0“M0u L ÿ l“1 αl r dl ´ z 2 Main idea: For a dictionary r dl with }r dl } “ 1, we write: z “ xr dl , zyr dl ` r Then clearly xr dl , ry “ 0, and: }z}2 “ |xr dl , zy| ` }r}2 20 / 44

Slide 25

Slide 25 text

Matching Pursuit }z}2 “ |xr dl , zy| ` }r}2 Since we would like to minimize }r}2, we should find l such that: score0 l “ xz, r dl y }z}}r dl } is maximum. - At each iteration ℓ we compute max scoreℓ l where: scoreℓ l “ xz ´ řℓ l“1 αlpℓqr dlpℓq , r dl y }dl }}z ´ řℓ l“1 αlpℓqr dlpℓq } is maximum. - The convolution can be well integrated in MP approach making the computation very fast 21 / 44

Slide 26

Slide 26 text

Results for simulated data: Human and yeast cases § Simulated data generated with Poisson noise with parameter β § Total number of forks for each β: 200 Speed values: between 500 ´ 3000 bp/min 22 / 44

Slide 27

Slide 27 text

Matching Pursuit x NFS A simulation that reproduces real conditions allows us to visualize and measure the advances of the approach 23 / 44

Slide 28

Slide 28 text

Matching Pursuit and extra dictionaries Extra dictionaries : we are able to detect origins that were not detected by NFS Score = accuracy of fork detection 23 / 44

Slide 29

Slide 29 text

MP x NFS : Simulated data examples 24 / 44

Slide 30

Slide 30 text

MP x NFS Comparison of speed detection for MP and NFS: 25 / 44

Slide 31

Slide 31 text

Off-the-grid methods 26 / 44

Slide 32

Slide 32 text

Off-the-grid sparse coding (BLASSO) Context: § The parameter set is continuous and the elements of the dictionary are functions: D “ tφp¨; θq P L2pΘq : θ P Θu, Θ “ X ˆ S § The optimization problem is formulated in the measure space MpΘq. § The sparsity is replaced by the assumption that the solution can be well-approximated by a sum of Dirac Measures: m “ L ÿ i“1 αi δθi , where αi P R, and θi P Θ. Optimization problem: min mPMpΘq › › › Φpmq ´ z › › › 2 2 ` γ|m|pΘq 26 / 44

Slide 33

Slide 33 text

Frank-Wolfe Method Objective: Solve problems of type: min wPC f pwq § C is compact § f is differentiable Frank-Wolfe algorithm: 1. pk :“ minpPC df pwkqpp ´ wkq 2. choose ν and define wk`1 :“ wk ` ν ˚ pk 27 / 44

Slide 34

Slide 34 text

Sliding Frank-Wolfe Method minmPMpΘq › › › Φpmq ´ z › › › 2 2 ` γ|m|pΘq (BLASSO) 28 / 44

Slide 35

Slide 35 text

Sliding Frank-Wolfe Method mintmPMpΘq:|m|pΘqďCu › › › Φpmq ´ z › › › 2 2 ` γ|m|pΘq, C “ }z}2 γ 28 / 44

Slide 36

Slide 36 text

Sliding Frank-Wolfe Method mintpr,mqPR`ˆMpΘq:|m|pΘqďrďCu › › › Φpmq ´ z › › › 2 2 ` γr 28 / 44

Slide 37

Slide 37 text

Sliding Frank-Wolfe Method mintpr,mqPR`ˆMpΘq:|m|pΘqďrďCu › › › Φpmq ´ z › › › 2 2 ` γr 1. First remark: C “ tpr, mq P R` ˆ MpΘq : |m|pΘq ď r ď Cu is compact in w˚-topology. 28 / 44

Slide 38

Slide 38 text

Sliding Frank-Wolfe Method mintpr,mqPR`ˆMpΘq:|m|pΘqďrďCu › › › Φpmq ´ z › › › 2 2 ` γr 1. First remark: C “ tpr, mq P R` ˆ MpΘq : |m|pΘq ď r ď Cu is compact in w˚-topology. 2. Second remark: T pr, mq :“ › › › Φpmq ´ z › › › 2 2 ` γr is differentiable in C We can apply Frank-Wolfe algorithm! 28 / 44

Slide 39

Slide 39 text

Sliding Frank-Wolfe Method 1. First remark: C “ tpr, mq P R` ˆ MpΘq : |m|pΘq ď r ď Cu is compact in w˚-topology. 2. Second remark: T pr, mq :“ › › › Φpmq ´ z › › › 2 2 ` γr is differentiable in C We can apply Frank-Wolfe algorithm! 4. It is necessary to simplify the first step of Frank-Wolfe. Optimization problem in C ùñ Optimization problem in Θ. 28 / 44

Slide 40

Slide 40 text

Sliding Frank-Wolfe Method Step 1: Estimate an additional spike: ¯ θ “ arg max θPΘ ηpθq where: ηpθq “ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ 1 2γ xz ´ Φ ˜ M ÿ i“1 αi δθi ¸ looooooooomooooooooon last iteration error , dictionary hk kik kj φp¨, θq y ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ where M is the number of spikes of the last iteration. Steps 2 and 3: Re-estimate amplitudes and spikes based on the set of spikes until the last iteration. 29 / 44

Slide 41

Slide 41 text

Scale dictionary specificity The function ∇x η presents instabilities with respect to the scale s. These instabilities are related to oscillations in the gradient ∇x ψ with a relatively large grid γ∇x η for z “ φp¨; p¯ x,¯ sqq, where p¯ x,¯ sq “ p100kbp, 1kbp/minq, with respect to variables x and s. 30 / 44

Slide 42

Slide 42 text

Hybrid Method Hybrid formulation: § Speed as a discrete parameter § Continuous positions § Parameter set: Θ “ X ˆ ts1, ..., sK u § Optimization variable are measures mk P MpXq Hybrid optimization problem: min mPMpXqK › › › K ÿ k“1 Φkpmkq ´ z › › › 2 2 ` γ K ÿ k“1 |mk|pXq. where m “ pm1, ..., mK q, mk “ řNk j“1 αk,j δxk,j , where xk,j P X. New algorithm that is able to solve the Hybrid problem. 31 / 44

Slide 43

Slide 43 text

Hybrid Method min m“pm1,...,mK q PMpXqK K ÿ k“1 Φkpmkq ´ z 2 2 ` λ K ÿ k“1 |mk|pXq. (Hybrid) 32 / 44

Slide 44

Slide 44 text

Hybrid Method min ␣m“pm1,...,mK qPMpXqK ř k |mk |pXqďC ( K ÿ k“1 Φkpmkq ´ z 2 2 ` λ K ÿ k“1 |mk|pXq 32 / 44

Slide 45

Slide 45 text

Hybrid Method min ␣m“pr,m1,...,mK qPR`ˆMpXqK ř k |mk |pXqďrďC ( K ÿ k“1 Φkpmkq ´ z 2 2 ` λr 32 / 44

Slide 46

Slide 46 text

Hybrid Method min ␣m“pr,m1,...,mK qPR`ˆMpXqK ř k |mk |pXqďrďC ( K ÿ k“1 Φkpmkq ´ z 2 2 ` λr 1. First remark: C “ tpr, m1, ..., mK q P R` ˆ MpΘqK : ÿ k |mk|pΘq ď r ď Cu is compact in w˚-topology. 32 / 44

Slide 47

Slide 47 text

Hybrid Method min ␣m“pr,m1,...,mK qPR`ˆMpXqK ř k |mk |pXqďrďC ( K ÿ k“1 Φkpmkq ´ z 2 2 ` λr 1. First remark: C “ tpr, m1, ..., mK q P R` ˆ MpΘqK : ÿ k |mk|pΘq ď r ď Cu is compact in w˚-topology. 2. Second remark: T pr, mq :“ › › › K ÿ k Φpmkq ´ z › › › 2 2 ` λr is differentiable in C We can apply Frank-Wolfe algorithm! 32 / 44

Slide 48

Slide 48 text

Hybrid Method 1. First remark: C “ tpr, m1, ..., mK q P R` ˆ MpΘqK : ÿ k |mk|pΘq ď r ď Cu is compact in w˚-topology. 2. Second remark: T pr, mq :“ › › › K ÿ k Φpmkq ´ z › › › 2 2 ` λr is differentiable in C We can apply Frank-Wolfe algorithm! 4. It is necessary to simplify the first step of Frank-Wolfe. Optimization problem in C ùñ Optimization problem in X ˆ ts1, .., sK u 32 / 44

Slide 49

Slide 49 text

Hybrid Algorithm Step 1: Estimate an additional spike: p s x, s s i q “ arg max xPX sPts1,s2,..,sK u ηHpx, sq where ηHpx, sq “ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ 1 2γ xz ´ K ÿ k“1 Mk ÿ i“1 αk,i ψ ˆ¨ ´ xk,i sk ˙ loooooooooooooooomoooooooooooooooon last iteration error , dictionary hkkkkkikkkkkj φp¨, px, sqqy ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ , Mk = number of spikes for the speed sk in the last iteration. Add s x in the set of spikes of speed s s i . Step 2 and 3: Re-estimate amplitudes and spikes based on the set of spikes until the last iteration. 33 / 44

Slide 50

Slide 50 text

Comparative test (Hybrid x MP) § Simulated data generated with Poisson noise with parameter β § Human cells simulation (residual BrdU = 0) 34 / 44

Slide 51

Slide 51 text

Comparative test (Hybrid x MP) § Simulated data generated with Poisson noise with parameter β § Human cells simulation (residual BrdU = 0), β “ 700 § Total number of forks for each β: 200 Hybrid: Lower mean absolute error 35 / 44

Slide 52

Slide 52 text

Comparative test (Hybrid x MP) § Simulated data generated with Poisson noise with parameter β § Hybrid: Lower mean absolute error), β “ 700 § Total number of forks for each β: 200 Hybrid: The quality of parameters estimation can be inferred from reconstruction error 36 / 44

Slide 53

Slide 53 text

Comparative Test (Hybrid x MP) Most precise methods with respect to the type of cell and noise level: Noise Cell Human Yeast High level MP MP Low level Hybrid MP 37 / 44

Slide 54

Slide 54 text

Ongoing work 38 / 44

Slide 55

Slide 55 text

Extra dictionaries for MP p`q Extra dictionaries : we are able to detect origins (-) Larger number of dictionaries (-) Limitation in term of speed 38 / 44

Slide 56

Slide 56 text

Anomalies of replication Anomalies in the replication and terminations can not be identified by a dictionary approach 38 / 44

Slide 57

Slide 57 text

Ongoing work The timing profile contains all replication parameters Constant speed ñ Timing profile is a piece-wise linear function τ : X Ñ R, time for which replication passes through position x 38 / 44

Slide 58

Slide 58 text

Ongoing work For pτ1, ..., τnq timing-profile. Define: B: Rn ` ÝÑ Rn pτ1, ..., τnq ÞÑ pψpτ1q, ..., ψpτnqq. Then: z « Bpτq Property: #tψ´1pbqu “ 2, @b P BpR` q 39 / 44

Slide 59

Slide 59 text

Ongoing work We consider two possibility of inverses Goal: Alternate between then to find the best piece-wise linear fit 40 / 44

Slide 60

Slide 60 text

Ongoing work Main challenges: - Find the adapted norm that is able to compare both inverses - Find the points that alternate between the two inverses. 40 / 44

Slide 61

Slide 61 text

Ongoing work Example with real data: 41 / 44

Slide 62

Slide 62 text

Ongoing work Example with real data: 42 / 44

Slide 63

Slide 63 text

Conclusion: comparative Tables Most precise method with respect to the type of cell and noise level: Noise Cell Human Yeast High level MP MP Low level Hybrid MP Methods with respect to the detection of different replication events: Method Event Isolated fork Initiation Termination Anomalies NFS ✓ x x x MP ✓ ✓ x x Hybrid ✓ x x x Ongoing work ✓ ✓ ✓ ? 43 / 44

Slide 64

Slide 64 text

Conclusion Considering the DNA replication problem: § The Matching Pursuit have shown to be the best method considering the grid methods for the dictionary approach § The off-the-grid Hybrid method can be superior for lower noise levels in the case of human cells § Working directly with the time profile is a promising approach, specially because of the flexibility with respect to fork’s forms. Thank you for your attention!! Link HAL: hal.science/hal-04146737/ Link Gretsi: gretsi.fr/data/colloque/pdf/2023 lage1264.pdf 44 / 44