2020-05-11_-_Botvinnik_-_DIB_Lab_Meeting_v2.pdf

8d40364a11a4d8fe33e6c5166046506a?s=47 Olga Botvinnik
May 11, 2020
1

 2020-05-11_-_Botvinnik_-_DIB_Lab_Meeting_v2.pdf

8d40364a11a4d8fe33e6c5166046506a?s=128

Olga Botvinnik

May 11, 2020
Tweet

Transcript

  1. 1.

    Olga Botvinnik, Data Scientist Chan Zuckerberg Biohub olga.botvinnik@czbiohub.org @olgabot Functional

    prediction of transcriptomic “dark matter” across species 2020-05-11 Data Intensive Biology Lab Meeting
  2. 2.

    Overview Introduction Methods Applications OUTLINE Introduction • Motivation • Less

    than 0.1% of species on earth have genome assemblies and annotations • Genomes alone are not sufficient for comparative transcriptomics • Beyond gene orthology, need a common language to embed transcriptomes across species • Prior Art • K-mers from reduced amino acid alphabets approximate orthology • Sketching algorithms compress sequence data and enable scalable analyses 2 Methods • Translate RNA-seq into protein via sencha • Translate and compress RNA-seq data into protein k-mers via kmermaid • Get functional prediction of differential k-mer expression with nf-predictorthologs Applications • Identify “missing” genes, not in the genome assembly in a primate brain RNA-seq dataset
  3. 3.

    Overview Introduction Methods Applications TRANSCRIPTOMES OF INDIVIDUAL CELLS ARE AN

    INTERMEDIATE BETWEEN DNA AND PHENOTYPE 3 In multicellular organisms, nearly every cell contains the same genome but not every gene is transcriptionally active in every cell Transcriptome offers a closer view of the real time gene expression in a cell Angela Oliveira Pisco DNA
  4. 4.

    Overview Introduction Methods Applications TRANSCRIPTOMES OF INDIVIDUAL CELLS ARE AN

    INTERMEDIATE BETWEEN DNA AND PHENOTYPE 3 In multicellular organisms, nearly every cell contains the same genome but not every gene is transcriptionally active in every cell Transcriptome offers a closer view of the real time gene expression in a cell Angela Oliveira Pisco DNA
  5. 5.

    Overview Introduction Methods Applications TRANSCRIPTOMES OF INDIVIDUAL CELLS ARE AN

    INTERMEDIATE BETWEEN DNA AND PHENOTYPE 3 In multicellular organisms, nearly every cell contains the same genome but not every gene is transcriptionally active in every cell Transcriptome offers a closer view of the real time gene expression in a cell Angela Oliveira Pisco DNA
  6. 6.

    Overview Introduction Methods Applications TRANSCRIPTOMES OF INDIVIDUAL CELLS ARE AN

    INTERMEDIATE BETWEEN DNA AND PHENOTYPE 3 In multicellular organisms, nearly every cell contains the same genome but not every gene is transcriptionally active in every cell Transcriptome offers a closer view of the real time gene expression in a cell Angela Oliveira Pisco DNA Phenotype
  7. 7.

    Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING

    BLOCKS OF CELLS 4 Central dogma of Biology:
 DNA → RNA → Protein
  8. 8.

    Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING

    BLOCKS OF CELLS 4 Central dogma of Biology:
 DNA → RNA → Protein Protein Cell DNA mRNA
  9. 9.

    Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING

    BLOCKS OF CELLS 4 Different species, different language of instructions Central dogma of Biology:
 DNA → RNA → Protein Protein Cell DNA mRNA
  10. 10.

    Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING

    BLOCKS OF CELLS 4 Different species, different language of instructions Central dogma of Biology:
 DNA → RNA → Protein Instructions Outcome Protein Cell DNA mRNA
  11. 11.

    Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING

    BLOCKS OF CELLS 4 Different species, different language of instructions Central dogma of Biology:
 DNA → RNA → Protein Instructions Outcome Protein Cell DNA mRNA
  12. 12.

    Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING

    BLOCKS OF CELLS 4 Different species, different language of instructions Central dogma of Biology:
 DNA → RNA → Protein Instructions Outcome Protein Cell DNA mRNA
  13. 13.

    Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING

    BLOCKS OF CELLS 4 Different species, different language of instructions Central dogma of Biology:
 DNA → RNA → Protein Instructions Outcome Protein Cell DNA mRNA Can’t just call IKEA for help…
  14. 14.

    Overview Introduction Methods Applications GOAL: ALIGN TRANSCRIPTOMES TO NON-MODEL ORGANISMS

    TO FIND “MATCHING” CELL TYPES AND “MISSING” GENES 5 AO Pisco et al, biorxiv (2019) Tabula Muris Senis
  15. 15.

    Overview Introduction Methods Applications GOAL: ALIGN TRANSCRIPTOMES TO NON-MODEL ORGANISMS

    TO FIND “MATCHING” CELL TYPES AND “MISSING” GENES 5 AO Pisco et al, biorxiv (2019) Neurons Epithelial cells Stem cells Tabula Muris Senis Finding “matching” cell types
  16. 16.

    Overview Introduction Methods Applications GOAL: ALIGN TRANSCRIPTOMES TO NON-MODEL ORGANISMS

    TO FIND “MATCHING” CELL TYPES AND “MISSING” GENES 5 AO Pisco et al, biorxiv (2019) Neurons Epithelial cells Stem cells Tabula Muris Senis vs Cell type-enriched sequences Finding “matching” cell types Finding “missing” genes
  17. 17.

    Overview Introduction Methods Applications GOAL: ALIGN TRANSCRIPTOMES TO NON-MODEL ORGANISMS

    TO FIND “MATCHING” CELL TYPES AND “MISSING” GENES 5 AO Pisco et al, biorxiv (2019) Neurons Epithelial cells Stem cells Tabula Muris Senis vs Genome Cell type-enriched sequences Present in genome Finding “matching” cell types Finding “missing” genes
  18. 18.

    Overview Introduction Methods Applications GOAL: ALIGN TRANSCRIPTOMES TO NON-MODEL ORGANISMS

    TO FIND “MATCHING” CELL TYPES AND “MISSING” GENES 5 AO Pisco et al, biorxiv (2019) Neurons Epithelial cells Stem cells Tabula Muris Senis vs Genome Cell type-enriched sequences Present in genome Not in genome “Missing” genes Finding “matching” cell types Finding “missing” genes
  19. 20.

    Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa

    36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 Huge diversity of species on Earth
  20. 21.

    Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa

    36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 Huge diversity of species on Earth 8,740,000 Eukaryotes predicted to exist
  21. 22.

    Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa

    36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 # Eukaryotes % All Euks Category 11,691 0.01332% Genome assembly submitted on NCBI Huge diversity of species on Earth 8,740,000 Eukaryotes predicted to exist
  22. 23.

    Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa

    36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 # Eukaryotes % All Euks Category 11,691 0.01332% Genome assembly submitted on NCBI 1,695 0.00193% High-quality genome in ENSEMBL (Metazoa + Vertebrates) Huge diversity of species on Earth 8,740,000 Eukaryotes predicted to exist
  23. 24.

    Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa

    36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 # Eukaryotes % All Euks Category 11,691 0.01332% Genome assembly submitted on NCBI 1,695 0.00193% High-quality genome in ENSEMBL (Metazoa + Vertebrates) 190 0.00022% Annotated gene orthology to human Huge diversity of species on Earth 8,740,000 Eukaryotes predicted to exist
  24. 25.

    Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa

    36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 # Eukaryotes % All Euks Category 11,691 0.01332% Genome assembly submitted on NCBI 1,695 0.00193% High-quality genome in ENSEMBL (Metazoa + Vertebrates) 190 0.00022% Annotated gene orthology to human 47 0.00004% UniProt Reference Proteomes with disputed orthology Huge diversity of species on Earth 8,740,000 Eukaryotes predicted to exist
  25. 26.

    Overview Introduction Methods Applications MORE GENOMES DON’T SOLVE THE FUNDAMENTAL

    CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. https://omabrowser.org/oma/type/ Gene tree for Insulin in Human, Mouse and Rat:
  26. 27.

    Overview Introduction Methods Applications MORE GENOMES DON’T SOLVE THE FUNDAMENTAL

    CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Speciation event https://omabrowser.org/oma/type/ Gene tree for Insulin in Human, Mouse and Rat:
  27. 28.

    Overview Introduction Methods Applications MORE GENOMES DON’T SOLVE THE FUNDAMENTAL

    CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Speciation event https://omabrowser.org/oma/type/ Duplication event Gene tree for Insulin in Human, Mouse and Rat:
  28. 29.

    Overview Introduction Methods Applications MORE GENOMES DON’T SOLVE THE FUNDAMENTAL

    CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Speciation event https://omabrowser.org/oma/type/ Duplication event Speciation event Gene tree for Insulin in Human, Mouse and Rat:
  29. 30.

    Overview Introduction Methods Applications MORE GENOMES DON’T SOLVE THE FUNDAMENTAL

    CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Quest for Orthologs Consortium Papers Speciation event https://omabrowser.org/oma/type/ Duplication event Speciation event Gene tree for Insulin in Human, Mouse and Rat:
  30. 31.

    Overview Introduction Methods Applications 2014 Bioinformatics MORE GENOMES DON’T SOLVE

    THE FUNDAMENTAL CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Quest for Orthologs Consortium Papers Speciation event https://omabrowser.org/oma/type/ Duplication event Speciation event Gene tree for Insulin in Human, Mouse and Rat:
  31. 32.

    Overview Introduction Methods Applications 2016 Nature Methods 2014 Bioinformatics MORE

    GENOMES DON’T SOLVE THE FUNDAMENTAL CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Quest for Orthologs Consortium Papers Speciation event https://omabrowser.org/oma/type/ Duplication event Speciation event Gene tree for Insulin in Human, Mouse and Rat:
  32. 33.

    Overview Introduction Methods Applications 2016 Nature Methods 2014 Bioinformatics MORE

    GENOMES DON’T SOLVE THE FUNDAMENTAL CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Quest for Orthologs Consortium Papers Speciation event https://omabrowser.org/oma/type/ 2019 Molecular Biology and Evolution 2017 Bioinformatics Duplication event Speciation event Gene tree for Insulin in Human, Mouse and Rat:
  33. 34.

    Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS

    RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367
  34. 35.

    Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS

    RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367
  35. 36.

    Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS

    RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367
  36. 37.

    Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS

    RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes
  37. 38.

    Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS

    RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes Single amino acid change: Leucine (L) → Valine (V)
  38. 39.

    Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS

    RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) k = 3
  39. 40.

    Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS

    RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE Original k = 3
  40. 41.

    Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS

    RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE L → V Original VES AWV WVE FLA LAW ESS k = 3
  41. 42.

    Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS

    RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb Dayhoff alphabet: no change in k-mers! Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE L → V Original VES AWV WVE FLA LAW ESS k = 3
  42. 43.

    Overview Introduction Methods Applications feb ebf bfe ecb cbb fec

    K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb Dayhoff alphabet: no change in k-mers! Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE L → V Original Original VES AWV WVE FLA LAW ESS k = 3
  43. 44.

    Overview Introduction Methods Applications feb ebf bfe ecb cbb fec

    feb ebf bfe ecb cbb fec K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb Dayhoff alphabet: no change in k-mers! Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE L → V Original L → V Original VES AWV WVE FLA LAW ESS k = 3
  44. 45.

    Overview Introduction Methods Applications feb ebf bfe ecb cbb fec

    feb ebf bfe ecb cbb fec K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes Recent paper used reduced amino acid encodings to identify orthologous genes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb Dayhoff alphabet: no change in k-mers! Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE L → V Original L → V Original VES AWV WVE FLA LAW ESS k = 3
  45. 46.

    Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES

    9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words
  46. 47.

    Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES

    9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ACT..TCG CTA..TTC .. Millions of reads ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit> ACT..TCG CTA..TTC
  47. 48.

    Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES

    9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … ACT..TCG CTA..TTC .. Millions of reads ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit> ACT..TCG CTA..TTC
  48. 49.

    Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES

    9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit> ACT..TCG CTA..TTC
  49. 50.

    Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES

    9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit> Typically, MinHashing scales down the data by several orders of magnitude, e.g. from ~106 reads to ~103 “words” ACT..TCG CTA..TTC
  50. 51.

    Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES

    9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit> Typically, MinHashing scales down the data by several orders of magnitude, e.g. from ~106 reads to ~103 “words” What is the overlap in k-mers between two sequencing datasets, A and B? Dataset A Dataset B ACT..TCG CTA..TTC
  51. 52.

    Overview Introduction Methods Applications Intersection Union Jaccard Index = =

    SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit> Typically, MinHashing scales down the data by several orders of magnitude, e.g. from ~106 reads to ~103 “words” What is the overlap in k-mers between two sequencing datasets, A and B? Dataset A Dataset B ACT..TCG CTA..TTC
  52. 53.

    Overview Introduction Methods Applications Intersection Union Jaccard Index = =

    SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit> Overlap in MinHash- subsampled bag of words ≅ True overlap of k-mers between two read datasets There exists a proof such that subsampling k-mers using MinHashing approximates the true Jaccard similarity Typically, MinHashing scales down the data by several orders of magnitude, e.g. from ~106 reads to ~103 “words” What is the overlap in k-mers between two sequencing datasets, A and B? Dataset A Dataset B ACT..TCG CTA..TTC
  53. 54.

    Overview Introduction Methods Applications Intersection Union Jaccard Index = =

    SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit> Overlap in MinHash- subsampled bag of words ≅ True overlap of k-mers between two read datasets There exists a proof such that subsampling k-mers using MinHashing approximates the true Jaccard similarity Typically, MinHashing scales down the data by several orders of magnitude, e.g. from ~106 reads to ~103 “words” What is the overlap in k-mers between two sequencing datasets, A and B? Dataset A Dataset B 2019 F1000 Resarch ACT..TCG CTA..TTC
  54. 55.

    Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT

    PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG RNA- seq read
  55. 56.

    Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT

    PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP (1) Six-frame translation RNA- seq read
  56. 57.

    Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT

    PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE (1) Six-frame translation RNA- seq read
  57. 58.

    Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT

    PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb (1) Six-frame translation (2) Re-encode to reduced alphabet Stop codon ✗ Stop codon ✗ Stop codon ✗ RNA- seq read
  58. 59.

    Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT

    PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb (1) Six-frame translation (2) Re-encode to reduced alphabet ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe Stop codon ✗ Stop codon ✗ deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb Stop codon ✗ beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee (3) k-merize each frame RNA- seq read
  59. 60.

    Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT

    PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb (1) Six-frame translation (2) Re-encode to reduced alphabet ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe Stop codon ✗ Stop codon ✗ deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb Stop codon ✗ beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee (3) k-merize each frame (4) k-mers present in database? UniProt manually curated peptide sequences for Opisthokonta (animal-like + fungi-like) stored as a Bloom filter RNA- seq read
  60. 61.

    Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT

    PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb (1) Six-frame translation (2) Re-encode to reduced alphabet ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe Stop codon ✗ Stop codon ✗ deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb Stop codon ✗ beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee (3) k-merize each frame (4) k-mers present in database? UniProt manually curated peptide sequences for Opisthokonta (animal-like + fungi-like) stored as a Bloom filter 23/25 < 95% of k-mers matched 100% matching k-mers 100% matching k-mers RNA- seq read
  61. 62.

    Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT

    PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb (1) Six-frame translation (2) Re-encode to reduced alphabet ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe Stop codon ✗ Stop codon ✗ deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb Stop codon ✗ beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee (3) k-merize each frame (4) k-mers present in database? UniProt manually curated peptide sequences for Opisthokonta (animal-like + fungi-like) stored as a Bloom filter ✔ Accepted reading frame ✔ Accepted reading frame 23/25 < 95% of k-mers matched 100% matching k-mers 100% matching k-mers Too few matches ✗ (5) Use reading frames with >95% matching k-mers RNA- seq read
  62. 63.

    Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT

    PROTEIN-CODING READING FRAME 11 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb (1) Six-frame translation (2) Re-encode to reduced alphabet ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe Stop codon ✗ Stop codon ✗ RNA- seq read Stop codon ✗ ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe (3) k-merize each frame (4) k-mers present in database? UniProt manually curated peptide sequences for Opisthokonta (animal-like + fungi-like) stored as a Bloom filter 23/25 < 95% of k-mers matched Too few matches ✗ (5) Use reading frames with >95% matching k-mers AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee ✔ Accepted reading frame ✔ Accepted reading frame 100% matching k-mers 100% matching k-mers -1: -2:
  63. 64.

    Overview Introduction Methods Applications AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd

    ebbebebebeebbebbb beeebeeefcddcddb deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee ✔ Accepted reading frame ✔ Accepted reading frame 100% matching k-mers 100% matching k-mers -1: -2: SENCHA ACCURATELY FINDS PROTEIN-CODING SEQUENCES 12 https://github.com/czbiohub/sencha - Gene is on negative strand - Alternative splicing event upstream caused frameshift and thus there are two possible reading frames BLAT search of sequencing read shows multiple reading frames
  64. 65.

    Overview Introduction Methods Applications AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd

    ebbebebebeebbebbb beeebeeefcddcddb deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee ✔ Accepted reading frame ✔ Accepted reading frame 100% matching k-mers 100% matching k-mers -1: -2: SENCHA ACCURATELY FINDS PROTEIN-CODING SEQUENCES 12 https://github.com/czbiohub/sencha - Gene is on negative strand - Alternative splicing event upstream caused frameshift and thus there are two possible reading frames True positive rate False positive rate AUC = 0.933 ENSEMBL 97 human peptides Protein alphabet, K-mer size = 7 Sencha accurately finds reads in CDS sequences BLAT search of sequencing read shows multiple reading frames
  65. 66.

    Overview Introduction Methods Applications KMERMAID TRANSLATES RNA-SEQ READS INTO PROTEINS

    AND COMPRESSES SEQUENCES INTO BAGS OF K-MER WORDS 13 https://github.com/nf-core/kmermaid/
  66. 67.

    Overview Introduction Methods Applications KMERMAID TRANSLATES RNA-SEQ READS INTO PROTEINS

    AND COMPRESSES SEQUENCES INTO BAGS OF K-MER WORDS 13 https://github.com/nf-core/kmermaid/ “Bags of k-mer words” Translate RNA- seq reads with sencha RNA → Protein Convert to reduced alphabet and compress sequences with sourmash Protein ↓ Dayhoff Input read datasets ACT..TCG CTA..TTC .. ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit> .. ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit>
  67. 68.

    Overview Introduction Methods Applications KMERMAID TRANSLATES RNA-SEQ READS INTO PROTEINS

    AND COMPRESSES SEQUENCES INTO BAGS OF K-MER WORDS 13 https://github.com/nf-core/kmermaid/ “Bags of k-mer words” Translate RNA- seq reads with sencha RNA → Protein Convert to reduced alphabet and compress sequences with sourmash Protein ↓ Dayhoff Input read datasets ACT..TCG CTA..TTC .. ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit> .. ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit>
  68. 69.

    Overview Introduction Methods Applications KMERMAID TRANSLATES RNA-SEQ READS INTO PROTEINS

    AND COMPRESSES SEQUENCES INTO BAGS OF K-MER WORDS 13 https://github.com/nf-core/kmermaid/ “Bags of k-mer words” Translate RNA- seq reads with sencha RNA → Protein Convert to reduced alphabet and compress sequences with sourmash Protein ↓ Dayhoff Input read datasets ACT..TCG CTA..TTC .. ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit> .. ⇢ <latexit sha1_base64="wABSmBjmPHMo37UD+Ss4RY3UKD0=">AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=</latexit>
  69. 70.

    Overview Introduction Methods Applications NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL

    K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential gene expression Bang et al, Scientific Reports (2019)
  70. 71.

    Overview Introduction Methods Applications NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL

    K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Logistic regression Differential k-mer expression Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression Bang et al, Scientific Reports (2019)
  71. 72.

    Overview Introduction Methods Applications NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL

    K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Logistic regression Differential k-mer expression Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)
  72. 73.

    Overview Introduction Methods Applications NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL

    K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential k-mers Logistic regression Differential k-mer expression Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)
  73. 74.

    Overview Introduction Methods Applications NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL

    K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential k-mers Not in genome In genome Logistic regression Differential k-mer expression Search for k-mer in NCBI RefSeq protein sequences sourmash search Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)
  74. 75.

    Overview Introduction Methods Applications Multi-mapped Singly-mapped Not in a gene

    In a gene Non-ortholog Ortholog Under construction NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential k-mers Not in genome In genome Logistic regression Differential k-mer expression Search for k-mer in NCBI RefSeq protein sequences sourmash search Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)
  75. 76.

    Overview Introduction Methods Applications Multi-mapped Singly-mapped Not in a gene

    In a gene Non-ortholog Ortholog Under construction NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential k-mers Not in genome In genome Logistic regression Differential k-mer expression Search for k-mer in NCBI RefSeq protein sequences sourmash search Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)
  76. 77.

    Overview Introduction Methods Applications Multi-mapped Singly-mapped Not in a gene

    In a gene Non-ortholog Ortholog Under construction NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential k-mers Not in genome In genome Logistic regression Differential k-mer expression Search for k-mer in NCBI RefSeq protein sequences sourmash search Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)
  77. 78.

    Overview Introduction Methods Applications DIFFERENTIAL K-MER EXPRESSION FINDS UNALIGNED “DARK

    MATTER” IN TRANSLATED TRANSCRIPTOMES 15 Brawand et al, Nature (2011) 2011 Nature • Performed differential k-mer expression on Brain vs Liver on bonobo and rhesus RNA-seq data • Used protein k-mers as evolutionary distance is <10 million years vs bonobo rhesus rhesus bonobo rhesus rhesus
  78. 79.

    Overview Introduction Methods Applications DIFFERENTIAL K-MER EXPRESSION FINDS UNALIGNED “DARK

    MATTER” IN TRANSLATED TRANSCRIPTOMES 15 Brawand et al, Nature (2011) K-mer Translated read with k-mer RefSeq match Not in genome QSLFFHFPP LGQSLFFHFPPLLRDGENY NP_001004067.1 nodal modulator 3 precursor [Homo sapiens] Bonobo (Pan paniscus) RLDLMREMY THYWSLEKLKQRLDLMREMYDRAG NP_032466.2 kinesin-like protein KIF1A isoform a [Mus musculus] Rhesus macaque (Macaca mulatta) TYFSKNYQD EKLIETYFSKNYQDYEYLINV NP_000524.3 myelin proteolipid protein isoform 1 [Homo sapiens] Rhesus macaque (Macaca mulatta) CSAVPVYIY LLVFACSAVPVYIYFNTWTT NP_000524.3 myelin proteolipid protein isoform 1 [Homo sapiens] Rhesus macaque (Macaca mulatta) GDRNNSSCR VTGDRNNSSCRNYNKQASEQNWAN NP_001185877.1 gap junction alpha-1 protein [Oryctolagus cuniculus] Rhesus macaque (Macaca mulatta) VLFVPKMRR IVFSSYITLVVLFVPKMRR NP_062312.3 gamma-aminobutyric acid type B receptor subunit 1 precursor [Mus musculus] Rhesus macaque (Macaca mulatta) ASIRDANLY KVSYARPSSASIRDANLYVSG NP_001361166.1 ELAV-like protein 2 isoform 5 [Mus musculus] Rhesus macaque (Macaca mulatta) Brain-enriched differential k-mers, not present in some genomes: 2011 Nature • Performed differential k-mer expression on Brain vs Liver on bonobo and rhesus RNA-seq data • Used protein k-mers as evolutionary distance is <10 million years vs bonobo rhesus rhesus bonobo rhesus rhesus
  79. 80.

    Overview Introduction Methods Applications DIFFERENTIAL K-MER EXPRESSION FINDS UNALIGNED “DARK

    MATTER” IN TRANSLATED TRANSCRIPTOMES 15 Brawand et al, Nature (2011) K-mer Translated read with k-mer RefSeq match Not in genome QSLFFHFPP LGQSLFFHFPPLLRDGENY NP_001004067.1 nodal modulator 3 precursor [Homo sapiens] Bonobo (Pan paniscus) RLDLMREMY THYWSLEKLKQRLDLMREMYDRAG NP_032466.2 kinesin-like protein KIF1A isoform a [Mus musculus] Rhesus macaque (Macaca mulatta) TYFSKNYQD EKLIETYFSKNYQDYEYLINV NP_000524.3 myelin proteolipid protein isoform 1 [Homo sapiens] Rhesus macaque (Macaca mulatta) CSAVPVYIY LLVFACSAVPVYIYFNTWTT NP_000524.3 myelin proteolipid protein isoform 1 [Homo sapiens] Rhesus macaque (Macaca mulatta) GDRNNSSCR VTGDRNNSSCRNYNKQASEQNWAN NP_001185877.1 gap junction alpha-1 protein [Oryctolagus cuniculus] Rhesus macaque (Macaca mulatta) VLFVPKMRR IVFSSYITLVVLFVPKMRR NP_062312.3 gamma-aminobutyric acid type B receptor subunit 1 precursor [Mus musculus] Rhesus macaque (Macaca mulatta) ASIRDANLY KVSYARPSSASIRDANLYVSG NP_001361166.1 ELAV-like protein 2 isoform 5 [Mus musculus] Rhesus macaque (Macaca mulatta) Brain-enriched differential k-mers, not present in some genomes: 2011 Nature • Performed differential k-mer expression on Brain vs Liver on bonobo and rhesus RNA-seq data • Used protein k-mers as evolutionary distance is <10 million years Not a typo: Differential hash found in the same RefSeq protein sequence vs bonobo rhesus rhesus bonobo rhesus rhesus
  80. 81.

    Overview Introduction Methods Applications CONCLUSIONS AND FUTURE WORK Conclusions •

    Combining Dayhoff-encoded reduced protein alphabet k- mers with sketching algorithms enables scalable, genome-agnostic cross-species analyses • sencha translates RNA-seq data into the correct protein- coding frame (https://github.com/czbiohub/sencha/) • kmermaid is a Nextflow pipeline that translates RNA-seq reads into protein and subsamples k-mers by MinHashing (https://github.com/nf-core/kmermaid/) • nf-predictorthologs is a Nextflow pipeline that performs differential k-mer expression and searches for functional annotations of differential k-mers (https:// github.com/czbiohub/nf-predictorthologs) Potential applications: • Aligning cell atlases across large evolutionary distances • Identifying orthologous differential genes in multi-species transcriptomes without genomes or gene annotations 16 Future Work • sencha • Benchmark against synthetically generated coding/ noncoding dataset • Compute ROC AUC for protein-coding reads as aligning into CDS-annotated regions for “real” data • kmermaid • Polish for 1.0.0 release • nf-predictorthologs • Add gene counting of differential k-mers • Use diamond blastp as backup protein search if k-mer is not present in NCBI RefSeq • Apply to full set of Brawand2011 data • Apply to datasets from larger evolutionary distances
  81. 84.

    Overview Introduction Methods Applications ACKNOWLEDGEMENTS K-mermidons group - Phoenix Logan

    - Pranathi Vemuri - Saba Nafees - Lekha Karanam Jim Karkanias, VP Data Sciences and IT Spyros Darmanis and group Outside of Biohub (@github) - Sourmash (github.com/dib-lab/sourmash/): - C. Titus Brown (@ctb), Luiz Irber (@luizirber), Tessa Pierce (@bluegenes) - Nextflow (github.com/nextflow-io/nextflow/): - Paolo Di Tommaso (@pditommaso), @KochTobi, Rad Suchecki (@rsuchecki) - nf-core (nf-co.re): - Phil Ewels (@ewels), Alexander Peltzer (@apeltzer), Harshil Patel (@drpatelh) 18 CZ Biohub Data Sciences and Information Technology Team Jim Karkanias Joshua Batson James Webber Aaron McGeever Angela Oliveira Pisco Jenny Folkesson Samantha Hao Phoenix Logan Giana Cirolia Olga Botvinnik Saransh Kaul Lekha Karanam Jack Kamm David Dynerman Lucy Li Pranathi Vemuri Jim Karkanias Saba Nafees Clarissa Vasquez olga.botvinnik@czbiohub.org @olgabot
  82. 85.

    Overview Introduction Methods Applications ACKNOWLEDGEMENTS K-mermidons group - Phoenix Logan

    - Pranathi Vemuri - Saba Nafees - Lekha Karanam Jim Karkanias, VP Data Sciences and IT Spyros Darmanis and group Outside of Biohub (@github) - Sourmash (github.com/dib-lab/sourmash/): - C. Titus Brown (@ctb), Luiz Irber (@luizirber), Tessa Pierce (@bluegenes) - Nextflow (github.com/nextflow-io/nextflow/): - Paolo Di Tommaso (@pditommaso), @KochTobi, Rad Suchecki (@rsuchecki) - nf-core (nf-co.re): - Phil Ewels (@ewels), Alexander Peltzer (@apeltzer), Harshil Patel (@drpatelh) 18 CZ Biohub Data Sciences and Information Technology Team Jim Karkanias Joshua Batson James Webber Aaron McGeever Angela Oliveira Pisco Jenny Folkesson Samantha Hao Phoenix Logan Giana Cirolia Olga Botvinnik Saransh Kaul Lekha Karanam Jack Kamm David Dynerman Lucy Li Pranathi Vemuri Jim Karkanias Saba Nafees Clarissa Vasquez Questions? olga.botvinnik@czbiohub.org @olgabot