Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deciphering the mysteries of human genomes

Deciphering the mysteries of human genomes

This keynote was presented at EuroPython 2024 in Prague, Czech Republic.

Abstract

Have you ever wondered why we are the way we are? Why some individuals develop diseases while others remain healthy? And what does Python have to do with all of this? Join this talk in which we will explore the interface between biology, technology and medicine, in the context of the research of rare genetic diseases. Learn what the Moore’s law has to do with advances in genetics and medicine, or why bigger is not always better.

Anna Pristoupilova

July 16, 2024
Tweet

Other Decks in Science

Transcript

  1. What happens when there is a change in the code?

    • Nothing happens • Differences between individuals • Disease – Error – Performance Issues • Death – Out of space – Out of memory = death
  2. T>C

  3. Effect size Variant frequency (%) low intermediate very rare high

    modest common Rare diseases Beneficial alleles Modifying factors Oligogenic or polygenic diseases Lethal
  4. Examples of rare diseases • Cystic fibrosis • Huntington's disease

    • Phenylketonuria (PKU) • Spinal muscular atrophy (SMN1) • Amyotrophic lateral sclerosis (ALS) • Duchenne muscular dystrophy • Homocystinuria (HCU) • Gaucher disease • Niemann-Pick disease • Pompe disease • Tay-Sachs disease • …… https://rarediseases.org/rare-diseases/ https://en.wikipedia.org/wiki/Ice_Bucket_Challenge
  5. Rare diseases • Affect people of ALL AGES • Are

    PROGRESSIVE – worsening over time, leading to death • Can affect ANY PART of the body • Wide range of SYMPTOMS – physical, neurological, develpomental, behavioral • Genetic CAUSE is UNKNOWN for many • Patients are often UNDIAGNOSED for years or decades • Social and emotional impact on the patient and their families - feelings of isolation, depression, and anxiety.
  6. Why to study rare diseases • Help patients with rare

    diseases – Treatment • 5% or fewer of rare diseases are estimated to have at least one approved treatment • Diagnose early! – Diagnosis – Family planning – Peace in mind • Are not so rare! • Extend knowledge about human biology and help other patients – Model for the study of cell physiology and patophysiology – New insights into biological function of disease-causing genes – Better understanding of complex diseases
  7. Sanger Sequencing – 1st gen Next Generation Sequencing (NGS) –

    3rd gen Next Generation Sequencing (NGS) – 2nd gen
  8. van Dijk, E. L., Jaszczyszyn, Y., Naquin, D., & Thermes,

    C. (2018). The Third Revolution in Sequencing Technology. Trends in Genetics, 34(9), 666–681. https://doi.org/10.1016/j.tig.2018.05.008
  9. Read mapping • You have the box – reference genome

    • Align the reads to the reference genome Image from https://en.wikipedia.org/wiki/Genomics … is like a jigsaw puzzle … • Missing pieces (coverage bias) • Broken pieces (sequencing errors) • Duplicate pieces (repeats) • Pieces from another puzzle (contamination)  Some pieces easier to place than others
  10. A – C – A – G – G –

    A – T – G – A – T – A – A – C – G – G – G – T – T – C - A patient
  11. A – C – T – G – G –

    A – T – C – A – T – A – C – C – G – G – G – A – T – C - A A – C – A – G – G – A – T – G – A – T – A – A – C – G – G – G – T – T – C - A patient reference
  12. A – C – T – G – G –

    A – T – C – A – T – A – C – C – G – G – G – A – T – C - A A – C – A – G – G – A – T – G – A – T – A – A – C – G – G – G – T – T – C - A patient reference A – C – A – G – G – A – T – C – A – T – A – C – C – G – G – G – A – T – C - A A – C – T – G – G – A – T – C – A – T – A – C – C – G – G – G – T – T – C – A A – C – A – G – G – A – T – C – A – T – A – C – C – C – G – G – A – T – C – A A – C – T – G – A – A – T – G – A – T – A – C – C – G – G – G – A – T – C – A A – C – T – G – G – A – T – C – A – T – A – C – C – G – G – G – A – T – C – A A – C – T – G – A – A – T – G – A – T – A – C – C – G – G – G – A – T – C – T A – C – A – G – G – A – T – C – A – T – A – C – C – C – G – G – A – T – C – A A – C – T – G – G – A – T – C – A – T – A – C – C – G – G – G – A – T – C – A population
  13. Making sense of the variants • Variant Filtering – Removes

    low-quality or irrelevant variants to focus on high-confidence data. • Variant Annotation – Adds biological context and functional predictions to each variant. • Variant Prioritization – Ranks variants based on predicted impact, known associations, and relevance to the study. • Variant Interpretation – Assesses the biological and clinical significance of prioritized variants using functional predictions, clinical databases, and literature review. Population specific databases
  14. How long did it take?  HPC, Cloud Analysis Type

    Initial Processing (CPU hours) Alignment and Variant Calling (CPU hours) Secondary Analysis (CPU hours) Total (CPU hours) Whole Genome Sequencing (WGS) 50-100 100-200 50-100 200-400 Whole Exome Sequencing (WES) 20-40 50-100 20-40 90-180 Targeted Sequencing 5-10 10-20 5-10 20-40
  15. Bioinformatics Workflow FASTQ SAM/ BAM VCF annotated VCF mapping variant

    calling variant annotation variant filtering variant prioritization variant interpretation candidate variant/s
  16. Variant confirmed as disease causing! What next? • Existing treatment

     Yupii!! • Some drug that affects the same metabolical patway?  Repurpose • No treatment – Help the family • Genetic counselling • Family planning (IVF) – Spread the world • Publish • Submit to databases • Connect with other researchers – Start working on one
  17. Variant not identified? What now? • Was it well sequenced?

    – NO  Sequenced again – YES  Check for other types of variants • Reanalyze with different tools, settings, reference, annotations • Wait and ranalyze later with updated annotations • Used Illumina?  Try 3rd generation sequencer • Maybe is not genetic?
  18. Bioinformatics Workflow FASTQ SAM/ BAM VCF annotated VCF mapping BWA

    Bowtie2 Novoalign variant calling GATK FreeBayes VarScan SAMtools variant annotation ANNOVAR SnpEff VEP variant filtering VCFtools Gemini variant prioritization SIFT PolyPhen CADD Exomiser variant interpretation ClinVar OMIM HGMD
  19. Bioconda • Lets you install thousands of software packages related

    to biomedical research using the conda package manager. https://bioconda.github.io/
  20. Workflow management system • Snakemake – Python • Nextflow –

    Polyglot • Open WDL – A community driven data processing language
  21. Bioinformatics Workflow FASTQ SAM/ BAM VCF annotated VCF mapping BWA

    Bowtie2 Novoalign variant calling GATK FreeBayes VarScan SAMtools variant annotation ANNOVAR SnpEff VEP variant filtering VCFtools Gemini variant prioritization SIFT PolyPhen CADD Exomiser variant interpretation ClinVar OMIM HGMD
  22. Biopython Python Tools for Computational Molecular Biology • Wiki documentation

    Seq and SeqRecord objects • Bio.SeqIO - sequence input/output • Bio.AlignIO - alignment input/output • Bio.PopGen - population genetics • Bio.PDB - structural bioinformatics • Biopython’s BioSQL interface
  23. Biopyton from Bio.Seq import Seq # Define a DNA sequence

    dna_sequence = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG") print("DNA Sequence:", dna_sequence) # Transcribe (T -> U + reverse complement) reverse_complement_dna_sequence = dna_sequence.reverse_complement() rna_sequence = reverse_complement_dna_sequence.transcribe() # Print the RNA sequence print("RNA Sequence:", rna_sequence) # Translate the RNA sequence to a protein sequence protein_sequence = rna_sequence.translate() # Print the protein sequence print("Protein Sequence:", protein_sequence) DNA Sequence: ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG RNA Sequence: CUAUCGGGCACCCUUUCAGCGGCCCAUUACAAUGGCCAU Protein Sequence: LSGTLSAAHYNGH
  24. Bioinformatics Workflow FASTQ SAM/ BAM VCF annotated VCF mapping BWA

    Bowtie2 Novoalign variant calling GATK FreeBayes VarScan SAMtools variant annotation ANNOVAR SnpEff VEP variant filtering VCFtools Gemini variant prioritization SIFT PolyPhen CADD Exomiser variant interpretation ClinVar OMIM HGMD
  25. CASE STUDY Patient Profile: • Demographics: Healthy Caucasian female •

    Reproductive History: Experienced 3 miscarriages at 8-10 weeks (Recurrent Pregnancy Loss). Medical Investigations: • Comprehensive Examinations: gynecologycal examination, genetics, reproductive immunology, general immunology, hysteroscopy, myology, hematology, physiotherapy • Immunological Findings: Detected meningococcal and streptococcal infections in sputum, treated successfully with antibiotics. • Other findings: healthy
  26. Whole genome sequencing Genetic Findings: • Gene: MTHFR (Methylenetetrahydrofolate Reductase)

    • Mutation: Homozygous C677T Clinical Significance: • The C677T mutation in the MTHFR gene results in a substitution of cytosine (C) with thymine (T) at position 677. • Homozygous individuals (TT) for this mutation have significantly reduced MTHFR enzyme activity (30%-35% of normal activity).
  27. Literature search 100 mg/day aspirin 5 mg/day folic acid 100

    mg/day aspirin 5 mg/day folic acid 0.4 mg/day enoxaparin triparous women without RM or thrombophilia Group 1 Group 2 Control group 123 123 117 46.3% 79.7% 86.3% Delivery rate Patients Conclusion: Treatment with low-dose aspirin, enoxaparin and folic acid was the most effective therapy in women with RM who carried a C677T MTHFR mutation.
  28. Treatment Treatment Approach: • Protocol: The patient underwent a treatment

    regimen described in a study comparing preventive treatments for patients with recurrent miscarriages carrying the C677T MTHFR mutation. • Modification: Instead of folic acid, the patient used active folic acid (5-methyl THF). Outcome: • Following the treatment with aspirin, 5-methyl THF and enoxoparin, the patient successfully conceived and delivered a healthy baby.