Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Clara Lage

S³ Seminar
October 13, 2023

Clara Lage

(ENS de Lyon, Physics laboratory)

Title — Sparse coding methods applied to DNA replication analysis

Abstract — Cellular replication is widely studied using modern imaging methods. This work aims to understand DNA replication at its most detailed level, which can not be achieved with microscopy images. The data combines genetic sequencing and replication progress [1]. Our goal is to estimate important parameters such as the positions of replication origins in the DNA strand, speed of replication and direction. We are going to discuss a time-scale dictionary approach with several sparse coding methods: LASSO, Matching Pursuit and Sliding Frank-Wolfe [2,4] (off-the-grid approach). A hybrid method [3], that considers speed as a discrete parameter and position as continuous, will be presented and compared to others in the case of DNA replication. We will also discuss other ways of approaching this problem by reformulating it as an inverse problem.

References

[1] B. Theulot et al. “Genome-wide mapping of individual replication fork velocities using nanopore sequencing”. In: Nature Communications 13 (2022). doi: 10.1038/s41467-022-31012-0.

[2] L. Blanc-Féraud B. Laville and G. Aubert. “Off-The-Grid Variational Sparse Spike Recovery: Methods and Algorithms”. In: Journal of Imaging 7.12 (2021). issn: 2313-433X. doi: 10 . 3390 / jimaging7120266. url: https://www.mdpi.com/2313-433X/7/12/266.

[3] Clara Lage et al. Time-Scale Hybrid Continuous-Discrete Sliding Frank-Wolfe Method. eprint: https://hal.science/hal-04146737/.

[4] G. Peyré Q. Denoyelle V. Duval and E. Soubies. “The sliding Frank–Wolfe algorithm and its application to super-resolution microscopy”. In: Inverse Problems 36.1 (2019), p. 014001. doi: 10.1088/1361-6420/ab2a29.

S³ Seminar

October 13, 2023
Tweet

More Decks by S³ Seminar

Other Decks in Research

Transcript

  1. Sparse coding methods applied to DNA replication analysis Clara Lage

    1 1Univ Lyon, ENS de Lyon, CNRS Laboratoire de Physique CentraleSup´ elec, 2023 1 / 44
  2. My scientific journey - PhD Thesis: Price signal quality in

    energy optimization (2017 - 2020) Institions: Paris Sorbonne University and IMPA (Instituto Nacional de Matematica Pura e Aplicada). Advisors: Mikhail Solodov, Claudia Sagastizabal and Jean-Marc Bonnisseau. § Stochastic optimization § Energy generation - Postdoc at Ecole Polytechnique (2020 - 2022) - Collaborator: Emmanuel Gobet. § Optimal transport optimization § Dictionary learning § Application in finance (credit model) § Postdoc at ENS de Lyon (2022 - ) - Collaborators: Benjamin Audit and Nelly Pustelnik and Jean-Michel Arbona § Inverse problems § Off-the-grid sparse coding § Application in biology (DNA replication) 2 / 44
  3. Presentation Guide 1. DNA replication in single molecule 2. Dictionary

    approach 3. Approach off-the-grid and hybrid 4. Ongoing work 3 / 44
  4. General framework Microscope images are widely used to study the

    structure and function of cells Some characteristics of the replication can be studied by images This work aims to understand replication at a more detailed level, combining genetic sequencing data and DNA replication analysis 4 / 44
  5. Context Main Objective: Better characterize the replication of DNA. -

    Position of origins of replication - Speed and direction of replication Related application: - Characterization of replication stress in cancer cells 5 / 44
  6. Context Main Objective: Better characterize the replication of DNA. -

    Position of origins of replication - Speed and direction of replication Related application: - Characterization of replication stress in cancer cells 5 / 44
  7. Single cell characterization of DNA replication Experiment is made during

    the DNA replication. A marked nucleotide (BrdU) into DNA, reports the fork progression. 6 / 44
  8. Single cell characterization of DNA replication Experiment is made during

    the DNA replication. A marked nucleotide (BrdU) into DNA, reports the fork progression. 6 / 44
  9. Single cell characterization of DNA replication Experiment is made during

    the DNA replication. A marked nucleotide (BrdU) into DNA, reports the fork progression. 6 / 44
  10. BrdU signal Original data: Estimated BrdU incorporation probabilities. BrdU incorporation

    profiles “ spatial signals that recapitulate the temporal variation of the intracellular BrdU concentration. Non-Gaussian noise. 11 / 44
  11. NanoForkSpeed NFS approximates forks by piece-wise linear functions. We are

    not able to detect terminations/initiations after the beginning of the pulse or anomalies in replication 12 / 44
  12. Main objectives - Precision in the detection of speed, position

    and direction of the fork - Detection of forks in different phases of the replication - Detection of anomalies in the fork replication 13 / 44
  13. Dictionary approach The profile of BrdU concentration in time is

    a known function ψptq. The resulting spatial profiles depends on the speed of the fork. Level of residual BrdU changes for yeast and human cells. BrdU profile Different scales/speeds Different positions and scales: φp¨; px0, sqq “ ξsψp¨´x0 s q x0 : Position of the fork for time t = 0 s speed of the fork 14 / 44
  14. Time-scale dictionary Elements of the dictionary “ the BrdU function

    φp¨; px, sqq. For a choice of px, sq in the parameter space Θ “ X ˆ S (space x speed): D “ ! tφp¨; px, sqqupx,sqPΘ : φp¨; px, sqq “ ζv ψ ´¨ ´ x s ¯ , ζs P R ) A discrete parameter choice results in a discrete set of dictionaries t˜ d1, ..., ˜ dLu, where L “ K ˆ N, the number of different speeds x the number of different positions. 16 / 44
  15. Standard sparse coding formulation r D “ » — –

    | | | ˜ d1 . . . ˜ dL | | | fi ffi fl Optimization problem: min αPRl }r Dα ´ z}2 2 ` λ}α}˚ Parameters of interest: § Position (x0{α): position in the genome for each fork § Scale: (|s|{dl ) Speed of replication for each fork § Direction: (signpsq{dl ) Direction of replication § Amplitude: (α) The value of the maximum BrdU concentration 17 / 44
  16. Standard sparse coding formulation Optimization problem using convolution: min αPRl

    }r Dα ´ z}2 2 ` λ}α}˚ min pa1,...,aK qPRnˆ...ˆRn › › › › › K ÿ k“1 dk ˚ ak ´ z › › › › › 2 2 ` λ K ÿ k“1 }ak}1. Reduction of dictionary dimension. 18 / 44
  17. Kullback-Leibler as fidelity term Noise in the case of study

    is not Gaussian. The euclidean norm } ¨ } is replaced by Kullback-Leibler distance. KLpz|wq “ ÿ n zn log ˆ zn wn ˙ ` wn ´ zn. The optimization problem is given by: min pa1,...,aK qPRnˆ...ˆRn KL ˆ z ˇ ˇ ˇ ˇ K ÿ k“1 dk ˚ ak ˙ `λ ÿ k }ak}1. 19 / 44
  18. Matching Pursuit Optimization problem for Matching Pursuit (before convolution): min

    tα, }α}0“M0u L ÿ l“1 αl r dl ´ z 2 Main idea: For a dictionary r dl with }r dl } “ 1, we write: z “ xr dl , zyr dl ` r Then clearly xr dl , ry “ 0, and: }z}2 “ |xr dl , zy| ` }r}2 20 / 44
  19. Matching Pursuit }z}2 “ |xr dl , zy| ` }r}2

    Since we would like to minimize }r}2, we should find l such that: score0 l “ xz, r dl y }z}}r dl } is maximum. - At each iteration ℓ we compute max scoreℓ l where: scoreℓ l “ xz ´ řℓ l“1 αlpℓqr dlpℓq , r dl y }dl }}z ´ řℓ l“1 αlpℓqr dlpℓq } is maximum. - The convolution can be well integrated in MP approach making the computation very fast 21 / 44
  20. Results for simulated data: Human and yeast cases § Simulated

    data generated with Poisson noise with parameter β § Total number of forks for each β: 200 Speed values: between 500 ´ 3000 bp/min 22 / 44
  21. Matching Pursuit x NFS A simulation that reproduces real conditions

    allows us to visualize and measure the advances of the approach 23 / 44
  22. Matching Pursuit and extra dictionaries Extra dictionaries : we are

    able to detect origins that were not detected by NFS Score = accuracy of fork detection 23 / 44
  23. Off-the-grid sparse coding (BLASSO) Context: § The parameter set is

    continuous and the elements of the dictionary are functions: D “ tφp¨; θq P L2pΘq : θ P Θu, Θ “ X ˆ S § The optimization problem is formulated in the measure space MpΘq. § The sparsity is replaced by the assumption that the solution can be well-approximated by a sum of Dirac Measures: m “ L ÿ i“1 αi δθi , where αi P R, and θi P Θ. Optimization problem: min mPMpΘq › › › Φpmq ´ z › › › 2 2 ` γ|m|pΘq 26 / 44
  24. Frank-Wolfe Method Objective: Solve problems of type: min wPC f

    pwq § C is compact § f is differentiable Frank-Wolfe algorithm: 1. pk :“ minpPC df pwkqpp ´ wkq 2. choose ν and define wk`1 :“ wk ` ν ˚ pk 27 / 44
  25. Sliding Frank-Wolfe Method minmPMpΘq › › › Φpmq ´ z

    › › › 2 2 ` γ|m|pΘq (BLASSO) 28 / 44
  26. Sliding Frank-Wolfe Method mintmPMpΘq:|m|pΘqďCu › › › Φpmq ´ z

    › › › 2 2 ` γ|m|pΘq, C “ }z}2 γ 28 / 44
  27. Sliding Frank-Wolfe Method mintpr,mqPR`ˆMpΘq:|m|pΘqďrďCu › › › Φpmq ´ z

    › › › 2 2 ` γr 1. First remark: C “ tpr, mq P R` ˆ MpΘq : |m|pΘq ď r ď Cu is compact in w˚-topology. 28 / 44
  28. Sliding Frank-Wolfe Method mintpr,mqPR`ˆMpΘq:|m|pΘqďrďCu › › › Φpmq ´ z

    › › › 2 2 ` γr 1. First remark: C “ tpr, mq P R` ˆ MpΘq : |m|pΘq ď r ď Cu is compact in w˚-topology. 2. Second remark: T pr, mq :“ › › › Φpmq ´ z › › › 2 2 ` γr is differentiable in C We can apply Frank-Wolfe algorithm! 28 / 44
  29. Sliding Frank-Wolfe Method 1. First remark: C “ tpr, mq

    P R` ˆ MpΘq : |m|pΘq ď r ď Cu is compact in w˚-topology. 2. Second remark: T pr, mq :“ › › › Φpmq ´ z › › › 2 2 ` γr is differentiable in C We can apply Frank-Wolfe algorithm! 4. It is necessary to simplify the first step of Frank-Wolfe. Optimization problem in C ùñ Optimization problem in Θ. 28 / 44
  30. Sliding Frank-Wolfe Method Step 1: Estimate an additional spike: ¯

    θ “ arg max θPΘ ηpθq where: ηpθq “ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ 1 2γ xz ´ Φ ˜ M ÿ i“1 αi δθi ¸ looooooooomooooooooon last iteration error , dictionary hk kik kj φp¨, θq y ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ where M is the number of spikes of the last iteration. Steps 2 and 3: Re-estimate amplitudes and spikes based on the set of spikes until the last iteration. 29 / 44
  31. Scale dictionary specificity The function ∇x η presents instabilities with

    respect to the scale s. These instabilities are related to oscillations in the gradient ∇x ψ with a relatively large grid γ∇x η for z “ φp¨; p¯ x,¯ sqq, where p¯ x,¯ sq “ p100kbp, 1kbp/minq, with respect to variables x and s. 30 / 44
  32. Hybrid Method Hybrid formulation: § Speed as a discrete parameter

    § Continuous positions § Parameter set: Θ “ X ˆ ts1, ..., sK u § Optimization variable are measures mk P MpXq Hybrid optimization problem: min mPMpXqK › › › K ÿ k“1 Φkpmkq ´ z › › › 2 2 ` γ K ÿ k“1 |mk|pXq. where m “ pm1, ..., mK q, mk “ řNk j“1 αk,j δxk,j , where xk,j P X. New algorithm that is able to solve the Hybrid problem. 31 / 44
  33. Hybrid Method min m“pm1,...,mK q PMpXqK K ÿ k“1 Φkpmkq

    ´ z 2 2 ` λ K ÿ k“1 |mk|pXq. (Hybrid) 32 / 44
  34. Hybrid Method min ␣m“pm1,...,mK qPMpXqK ř k |mk |pXqďC (

    K ÿ k“1 Φkpmkq ´ z 2 2 ` λ K ÿ k“1 |mk|pXq 32 / 44
  35. Hybrid Method min ␣m“pr,m1,...,mK qPR`ˆMpXqK ř k |mk |pXqďrďC (

    K ÿ k“1 Φkpmkq ´ z 2 2 ` λr 1. First remark: C “ tpr, m1, ..., mK q P R` ˆ MpΘqK : ÿ k |mk|pΘq ď r ď Cu is compact in w˚-topology. 32 / 44
  36. Hybrid Method min ␣m“pr,m1,...,mK qPR`ˆMpXqK ř k |mk |pXqďrďC (

    K ÿ k“1 Φkpmkq ´ z 2 2 ` λr 1. First remark: C “ tpr, m1, ..., mK q P R` ˆ MpΘqK : ÿ k |mk|pΘq ď r ď Cu is compact in w˚-topology. 2. Second remark: T pr, mq :“ › › › K ÿ k Φpmkq ´ z › › › 2 2 ` λr is differentiable in C We can apply Frank-Wolfe algorithm! 32 / 44
  37. Hybrid Method 1. First remark: C “ tpr, m1, ...,

    mK q P R` ˆ MpΘqK : ÿ k |mk|pΘq ď r ď Cu is compact in w˚-topology. 2. Second remark: T pr, mq :“ › › › K ÿ k Φpmkq ´ z › › › 2 2 ` λr is differentiable in C We can apply Frank-Wolfe algorithm! 4. It is necessary to simplify the first step of Frank-Wolfe. Optimization problem in C ùñ Optimization problem in X ˆ ts1, .., sK u 32 / 44
  38. Hybrid Algorithm Step 1: Estimate an additional spike: p s

    x, s s i q “ arg max xPX sPts1,s2,..,sK u ηHpx, sq where ηHpx, sq “ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ 1 2γ xz ´ K ÿ k“1 Mk ÿ i“1 αk,i ψ ˆ¨ ´ xk,i sk ˙ loooooooooooooooomoooooooooooooooon last iteration error , dictionary hkkkkkikkkkkj φp¨, px, sqqy ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ , Mk = number of spikes for the speed sk in the last iteration. Add s x in the set of spikes of speed s s i . Step 2 and 3: Re-estimate amplitudes and spikes based on the set of spikes until the last iteration. 33 / 44
  39. Comparative test (Hybrid x MP) § Simulated data generated with

    Poisson noise with parameter β § Human cells simulation (residual BrdU = 0) 34 / 44
  40. Comparative test (Hybrid x MP) § Simulated data generated with

    Poisson noise with parameter β § Human cells simulation (residual BrdU = 0), β “ 700 § Total number of forks for each β: 200 Hybrid: Lower mean absolute error 35 / 44
  41. Comparative test (Hybrid x MP) § Simulated data generated with

    Poisson noise with parameter β § Hybrid: Lower mean absolute error), β “ 700 § Total number of forks for each β: 200 Hybrid: The quality of parameters estimation can be inferred from reconstruction error 36 / 44
  42. Comparative Test (Hybrid x MP) Most precise methods with respect

    to the type of cell and noise level: Noise Cell Human Yeast High level MP MP Low level Hybrid MP 37 / 44
  43. Extra dictionaries for MP p`q Extra dictionaries : we are

    able to detect origins (-) Larger number of dictionaries (-) Limitation in term of speed 38 / 44
  44. Anomalies of replication Anomalies in the replication and terminations can

    not be identified by a dictionary approach 38 / 44
  45. Ongoing work The timing profile contains all replication parameters Constant

    speed ñ Timing profile is a piece-wise linear function τ : X Ñ R, time for which replication passes through position x 38 / 44
  46. Ongoing work For pτ1, ..., τnq timing-profile. Define: B: Rn

    ` ÝÑ Rn pτ1, ..., τnq ÞÑ pψpτ1q, ..., ψpτnqq. Then: z « Bpτq Property: #tψ´1pbqu “ 2, @b P BpR` q 39 / 44
  47. Ongoing work We consider two possibility of inverses Goal: Alternate

    between then to find the best piece-wise linear fit 40 / 44
  48. Ongoing work Main challenges: - Find the adapted norm that

    is able to compare both inverses - Find the points that alternate between the two inverses. 40 / 44
  49. Conclusion: comparative Tables Most precise method with respect to the

    type of cell and noise level: Noise Cell Human Yeast High level MP MP Low level Hybrid MP Methods with respect to the detection of different replication events: Method Event Isolated fork Initiation Termination Anomalies NFS ✓ x x x MP ✓ ✓ x x Hybrid ✓ x x x Ongoing work ✓ ✓ ✓ ? 43 / 44
  50. Conclusion Considering the DNA replication problem: § The Matching Pursuit

    have shown to be the best method considering the grid methods for the dictionary approach § The off-the-grid Hybrid method can be superior for lower noise levels in the case of human cells § Working directly with the time profile is a promising approach, specially because of the flexibility with respect to fork’s forms. Thank you for your attention!! Link HAL: hal.science/hal-04146737/ Link Gretsi: gretsi.fr/data/colloque/pdf/2023 lage1264.pdf 44 / 44