Slide 1

Slide 1 text

Olga Botvinnik, Data Scientist Chan Zuckerberg Biohub [email protected] @olgabot Functional prediction of transcriptomic “dark matter” across species 2020-05-11 Data Intensive Biology Lab Meeting

Slide 2

Slide 2 text

Overview Introduction Methods Applications OUTLINE Introduction • Motivation • Less than 0.1% of species on earth have genome assemblies and annotations • Genomes alone are not sufficient for comparative transcriptomics • Beyond gene orthology, need a common language to embed transcriptomes across species • Prior Art • K-mers from reduced amino acid alphabets approximate orthology • Sketching algorithms compress sequence data and enable scalable analyses 2 Methods • Translate RNA-seq into protein via sencha • Translate and compress RNA-seq data into protein k-mers via kmermaid • Get functional prediction of differential k-mer expression with nf-predictorthologs Applications • Identify “missing” genes, not in the genome assembly in a primate brain RNA-seq dataset

Slide 3

Slide 3 text

Overview Introduction Methods Applications TRANSCRIPTOMES OF INDIVIDUAL CELLS ARE AN INTERMEDIATE BETWEEN DNA AND PHENOTYPE 3 In multicellular organisms, nearly every cell contains the same genome but not every gene is transcriptionally active in every cell Transcriptome offers a closer view of the real time gene expression in a cell Angela Oliveira Pisco DNA

Slide 4

Slide 4 text

Overview Introduction Methods Applications TRANSCRIPTOMES OF INDIVIDUAL CELLS ARE AN INTERMEDIATE BETWEEN DNA AND PHENOTYPE 3 In multicellular organisms, nearly every cell contains the same genome but not every gene is transcriptionally active in every cell Transcriptome offers a closer view of the real time gene expression in a cell Angela Oliveira Pisco DNA

Slide 5

Slide 5 text

Overview Introduction Methods Applications TRANSCRIPTOMES OF INDIVIDUAL CELLS ARE AN INTERMEDIATE BETWEEN DNA AND PHENOTYPE 3 In multicellular organisms, nearly every cell contains the same genome but not every gene is transcriptionally active in every cell Transcriptome offers a closer view of the real time gene expression in a cell Angela Oliveira Pisco DNA

Slide 6

Slide 6 text

Overview Introduction Methods Applications TRANSCRIPTOMES OF INDIVIDUAL CELLS ARE AN INTERMEDIATE BETWEEN DNA AND PHENOTYPE 3 In multicellular organisms, nearly every cell contains the same genome but not every gene is transcriptionally active in every cell Transcriptome offers a closer view of the real time gene expression in a cell Angela Oliveira Pisco DNA Phenotype

Slide 7

Slide 7 text

Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING BLOCKS OF CELLS 4 Central dogma of Biology:
 DNA → RNA → Protein

Slide 8

Slide 8 text

Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING BLOCKS OF CELLS 4 Central dogma of Biology:
 DNA → RNA → Protein Protein Cell DNA mRNA

Slide 9

Slide 9 text

Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING BLOCKS OF CELLS 4 Different species, different language of instructions Central dogma of Biology:
 DNA → RNA → Protein Protein Cell DNA mRNA

Slide 10

Slide 10 text

Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING BLOCKS OF CELLS 4 Different species, different language of instructions Central dogma of Biology:
 DNA → RNA → Protein Instructions Outcome Protein Cell DNA mRNA

Slide 11

Slide 11 text

Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING BLOCKS OF CELLS 4 Different species, different language of instructions Central dogma of Biology:
 DNA → RNA → Protein Instructions Outcome Protein Cell DNA mRNA

Slide 12

Slide 12 text

Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING BLOCKS OF CELLS 4 Different species, different language of instructions Central dogma of Biology:
 DNA → RNA → Protein Instructions Outcome Protein Cell DNA mRNA

Slide 13

Slide 13 text

Overview Introduction Methods Applications TRANSCRIPTOMES ARE INSTRUCTIONS FOR THE BUILDING BLOCKS OF CELLS 4 Different species, different language of instructions Central dogma of Biology:
 DNA → RNA → Protein Instructions Outcome Protein Cell DNA mRNA Can’t just call IKEA for help…

Slide 14

Slide 14 text

Overview Introduction Methods Applications GOAL: ALIGN TRANSCRIPTOMES TO NON-MODEL ORGANISMS TO FIND “MATCHING” CELL TYPES AND “MISSING” GENES 5 AO Pisco et al, biorxiv (2019) Tabula Muris Senis

Slide 15

Slide 15 text

Overview Introduction Methods Applications GOAL: ALIGN TRANSCRIPTOMES TO NON-MODEL ORGANISMS TO FIND “MATCHING” CELL TYPES AND “MISSING” GENES 5 AO Pisco et al, biorxiv (2019) Neurons Epithelial cells Stem cells Tabula Muris Senis Finding “matching” cell types

Slide 16

Slide 16 text

Overview Introduction Methods Applications GOAL: ALIGN TRANSCRIPTOMES TO NON-MODEL ORGANISMS TO FIND “MATCHING” CELL TYPES AND “MISSING” GENES 5 AO Pisco et al, biorxiv (2019) Neurons Epithelial cells Stem cells Tabula Muris Senis vs Cell type-enriched sequences Finding “matching” cell types Finding “missing” genes

Slide 17

Slide 17 text

Overview Introduction Methods Applications GOAL: ALIGN TRANSCRIPTOMES TO NON-MODEL ORGANISMS TO FIND “MATCHING” CELL TYPES AND “MISSING” GENES 5 AO Pisco et al, biorxiv (2019) Neurons Epithelial cells Stem cells Tabula Muris Senis vs Genome Cell type-enriched sequences Present in genome Finding “matching” cell types Finding “missing” genes

Slide 18

Slide 18 text

Overview Introduction Methods Applications GOAL: ALIGN TRANSCRIPTOMES TO NON-MODEL ORGANISMS TO FIND “MATCHING” CELL TYPES AND “MISSING” GENES 5 AO Pisco et al, biorxiv (2019) Neurons Epithelial cells Stem cells Tabula Muris Senis vs Genome Cell type-enriched sequences Present in genome Not in genome “Missing” genes Finding “matching” cell types Finding “missing” genes

Slide 19

Slide 19 text

Wikipedia Mora et al, PLoS Biology (2011) Huge diversity of species on Earth

Slide 20

Slide 20 text

Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa 36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 Huge diversity of species on Earth

Slide 21

Slide 21 text

Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa 36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 Huge diversity of species on Earth 8,740,000 Eukaryotes predicted to exist

Slide 22

Slide 22 text

Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa 36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 # Eukaryotes % All Euks Category 11,691 0.01332% Genome assembly submitted on NCBI Huge diversity of species on Earth 8,740,000 Eukaryotes predicted to exist

Slide 23

Slide 23 text

Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa 36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 # Eukaryotes % All Euks Category 11,691 0.01332% Genome assembly submitted on NCBI 1,695 0.00193% High-quality genome in ENSEMBL (Metazoa + Vertebrates) Huge diversity of species on Earth 8,740,000 Eukaryotes predicted to exist

Slide 24

Slide 24 text

Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa 36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 # Eukaryotes % All Euks Category 11,691 0.01332% Genome assembly submitted on NCBI 1,695 0.00193% High-quality genome in ENSEMBL (Metazoa + Vertebrates) 190 0.00022% Annotated gene orthology to human Huge diversity of species on Earth 8,740,000 Eukaryotes predicted to exist

Slide 25

Slide 25 text

Wikipedia Mora et al, PLoS Biology (2011) Chromists 27,500 Protozoa 36,400 Fungi 611,000 Plants 298,000 Metazoans 7,770,000 # Eukaryotes % All Euks Category 11,691 0.01332% Genome assembly submitted on NCBI 1,695 0.00193% High-quality genome in ENSEMBL (Metazoa + Vertebrates) 190 0.00022% Annotated gene orthology to human 47 0.00004% UniProt Reference Proteomes with disputed orthology Huge diversity of species on Earth 8,740,000 Eukaryotes predicted to exist

Slide 26

Slide 26 text

Overview Introduction Methods Applications MORE GENOMES DON’T SOLVE THE FUNDAMENTAL CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. https://omabrowser.org/oma/type/ Gene tree for Insulin in Human, Mouse and Rat:

Slide 27

Slide 27 text

Overview Introduction Methods Applications MORE GENOMES DON’T SOLVE THE FUNDAMENTAL CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Speciation event https://omabrowser.org/oma/type/ Gene tree for Insulin in Human, Mouse and Rat:

Slide 28

Slide 28 text

Overview Introduction Methods Applications MORE GENOMES DON’T SOLVE THE FUNDAMENTAL CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Speciation event https://omabrowser.org/oma/type/ Duplication event Gene tree for Insulin in Human, Mouse and Rat:

Slide 29

Slide 29 text

Overview Introduction Methods Applications MORE GENOMES DON’T SOLVE THE FUNDAMENTAL CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Speciation event https://omabrowser.org/oma/type/ Duplication event Speciation event Gene tree for Insulin in Human, Mouse and Rat:

Slide 30

Slide 30 text

Overview Introduction Methods Applications MORE GENOMES DON’T SOLVE THE FUNDAMENTAL CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Quest for Orthologs Consortium Papers Speciation event https://omabrowser.org/oma/type/ Duplication event Speciation event Gene tree for Insulin in Human, Mouse and Rat:

Slide 31

Slide 31 text

Overview Introduction Methods Applications 2014 Bioinformatics MORE GENOMES DON’T SOLVE THE FUNDAMENTAL CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Quest for Orthologs Consortium Papers Speciation event https://omabrowser.org/oma/type/ Duplication event Speciation event Gene tree for Insulin in Human, Mouse and Rat:

Slide 32

Slide 32 text

Overview Introduction Methods Applications 2016 Nature Methods 2014 Bioinformatics MORE GENOMES DON’T SOLVE THE FUNDAMENTAL CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Quest for Orthologs Consortium Papers Speciation event https://omabrowser.org/oma/type/ Duplication event Speciation event Gene tree for Insulin in Human, Mouse and Rat:

Slide 33

Slide 33 text

Overview Introduction Methods Applications 2016 Nature Methods 2014 Bioinformatics MORE GENOMES DON’T SOLVE THE FUNDAMENTAL CHALLENGE OF DEFINING ORTHOLOGOUS GENE RELATIONSHIPS 7 The fundamental problem in comparative transcriptomics is that defining orthologous genes is hard. Quest for Orthologs Consortium Papers Speciation event https://omabrowser.org/oma/type/ 2019 Molecular Biology and Evolution 2017 Bioinformatics Duplication event Speciation event Gene tree for Insulin in Human, Mouse and Rat:

Slide 34

Slide 34 text

Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367

Slide 35

Slide 35 text

Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367

Slide 36

Slide 36 text

Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367

Slide 37

Slide 37 text

Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes

Slide 38

Slide 38 text

Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes Single amino acid change: Leucine (L) → Valine (V)

Slide 39

Slide 39 text

Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) k = 3

Slide 40

Slide 40 text

Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE Original k = 3

Slide 41

Slide 41 text

Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE L → V Original VES AWV WVE FLA LAW ESS k = 3

Slide 42

Slide 42 text

Overview Introduction Methods Applications K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb Dayhoff alphabet: no change in k-mers! Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE L → V Original VES AWV WVE FLA LAW ESS k = 3

Slide 43

Slide 43 text

Overview Introduction Methods Applications feb ebf bfe ecb cbb fec K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb Dayhoff alphabet: no change in k-mers! Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE L → V Original Original VES AWV WVE FLA LAW ESS k = 3

Slide 44

Slide 44 text

Overview Introduction Methods Applications feb ebf bfe ecb cbb fec feb ebf bfe ecb cbb fec K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb Dayhoff alphabet: no change in k-mers! Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE L → V Original L → V Original VES AWV WVE FLA LAW ESS k = 3

Slide 45

Slide 45 text

Overview Introduction Methods Applications feb ebf bfe ecb cbb fec feb ebf bfe ecb cbb fec K-MERS FROM REDUCED AMINO ACID ALPHABETS RETAIN EVOLUTIONARILY CONSERVED BIOCHEMICAL PROPERTIES 8 Amino acid C A, G, P, S, T D, E, N, Q H, K, R I, L, M, V F, W, Y Property Sulfur polymerization Small Acid and amide Basic Hydrophobic Aromatic Dayhoff a b c d e f Protein: FLAWLESS Dayhoff: febfecbb Dayhoff MO (1965). An Atlas of Protein Sequence. Phillips R, Kondev J, & Theriot J. (2012) Physical Biology of the Cell
 Peris, P., López, D., & Campos, M. (2008). IgTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics, 9(1), 1029–11. http://doi.org/10.1186/1471-2105-9-367 Reduced alphabet k-mers are resilient to amnio acid changes Recent paper used reduced amino acid encodings to identify orthologous genes FLAWLESS FLAWLESS FLAWLESS FLAWLESS FLAWLES FLAWVESS FLAWVESS FLAWLESS FLAWVESS FLAWVESS febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb febfecbb Dayhoff alphabet: no change in k-mers! Protein alphabet: k k-mers affected Single amino acid change: Leucine (L) → Valine (V) FLA LAW ESS LES AWL WLE L → V Original L → V Original VES AWV WVE FLA LAW ESS k = 3

Slide 46

Slide 46 text

Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words

Slide 47

Slide 47 text

Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ACT..TCG CTA..TTC .. Millions of reads ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI= ACT..TCG CTA..TTC

Slide 48

Slide 48 text

Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … ACT..TCG CTA..TTC .. Millions of reads ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI= ACT..TCG CTA..TTC

Slide 49

Slide 49 text

Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI= ACT..TCG CTA..TTC

Slide 50

Slide 50 text

Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI= Typically, MinHashing scales down the data by several orders of magnitude, e.g. from ~106 reads to ~103 “words” ACT..TCG CTA..TTC

Slide 51

Slide 51 text

Overview Introduction Methods Applications SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI= Typically, MinHashing scales down the data by several orders of magnitude, e.g. from ~106 reads to ~103 “words” What is the overlap in k-mers between two sequencing datasets, A and B? Dataset A Dataset B ACT..TCG CTA..TTC

Slide 52

Slide 52 text

Overview Introduction Methods Applications Intersection Union Jaccard Index = = SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI= Typically, MinHashing scales down the data by several orders of magnitude, e.g. from ~106 reads to ~103 “words” What is the overlap in k-mers between two sequencing datasets, A and B? Dataset A Dataset B ACT..TCG CTA..TTC

Slide 53

Slide 53 text

Overview Introduction Methods Applications Intersection Union Jaccard Index = = SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI= Overlap in MinHash- subsampled bag of words ≅ True overlap of k-mers between two read datasets There exists a proof such that subsampling k-mers using MinHashing approximates the true Jaccard similarity Typically, MinHashing scales down the data by several orders of magnitude, e.g. from ~106 reads to ~103 “words” What is the overlap in k-mers between two sequencing datasets, A and B? Dataset A Dataset B ACT..TCG CTA..TTC

Slide 54

Slide 54 text

Overview Introduction Methods Applications Intersection Union Jaccard Index = = SKETCHING ALGORITHMS ENABLE SCALABLE SEQUENCE ANALYSES 9 2016 Genome Biology Each sequencing experiment as a bag of k-mer words Compress sequence data with MinHash “Bag of words” ATCGA CTGAG AATCG CTAGG TTTTC … 4865228557101083157 18248316805952202104 12187299837937207145 1556650891075415342 15754894269028698533 … … Except the “words” are actually integer hashes ACT..TCG CTA..TTC .. Millions of reads ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI= Overlap in MinHash- subsampled bag of words ≅ True overlap of k-mers between two read datasets There exists a proof such that subsampling k-mers using MinHashing approximates the true Jaccard similarity Typically, MinHashing scales down the data by several orders of magnitude, e.g. from ~106 reads to ~103 “words” What is the overlap in k-mers between two sequencing datasets, A and B? Dataset A Dataset B 2019 F1000 Resarch ACT..TCG CTA..TTC

Slide 55

Slide 55 text

Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG RNA- seq read

Slide 56

Slide 56 text

Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP (1) Six-frame translation RNA- seq read

Slide 57

Slide 57 text

Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE (1) Six-frame translation RNA- seq read

Slide 58

Slide 58 text

Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb (1) Six-frame translation (2) Re-encode to reduced alphabet Stop codon ✗ Stop codon ✗ Stop codon ✗ RNA- seq read

Slide 59

Slide 59 text

Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb (1) Six-frame translation (2) Re-encode to reduced alphabet ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe Stop codon ✗ Stop codon ✗ deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb Stop codon ✗ beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee (3) k-merize each frame RNA- seq read

Slide 60

Slide 60 text

Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb (1) Six-frame translation (2) Re-encode to reduced alphabet ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe Stop codon ✗ Stop codon ✗ deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb Stop codon ✗ beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee (3) k-merize each frame (4) k-mers present in database? UniProt manually curated peptide sequences for Opisthokonta (animal-like + fungi-like) stored as a Bloom filter RNA- seq read

Slide 61

Slide 61 text

Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb (1) Six-frame translation (2) Re-encode to reduced alphabet ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe Stop codon ✗ Stop codon ✗ deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb Stop codon ✗ beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee (3) k-merize each frame (4) k-mers present in database? UniProt manually curated peptide sequences for Opisthokonta (animal-like + fungi-like) stored as a Bloom filter 23/25 < 95% of k-mers matched 100% matching k-mers 100% matching k-mers RNA- seq read

Slide 62

Slide 62 text

Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT PROTEIN-CODING READING FRAME 10 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb (1) Six-frame translation (2) Re-encode to reduced alphabet ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe Stop codon ✗ Stop codon ✗ deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb Stop codon ✗ beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee (3) k-merize each frame (4) k-mers present in database? UniProt manually curated peptide sequences for Opisthokonta (animal-like + fungi-like) stored as a Bloom filter ✔ Accepted reading frame ✔ Accepted reading frame 23/25 < 95% of k-mers matched 100% matching k-mers 100% matching k-mers Too few matches ✗ (5) Use reading frames with >95% matching k-mers RNA- seq read

Slide 63

Slide 63 text

Overview Introduction Methods Applications SENCHA TRANSLATES RNA-SEQ READS INTO CORRECT PROTEIN-CODING READING FRAME 11 https://github.com/czbiohub/sencha GTAACAGTAGCAGAGCCGGTGACA GCGCCAGGCTGGGCTGGGTTCTCT CTGTGGGTGTGCACGGCAAAGCTG 1: 2: 3: -1: -2: -3: SSALPPLPQDDPNEQGG CPQASQDPQGQG*SSQ PRRCRLCHKMTPMSRAA VPRPPRIPRARARVPS LGVAAFATR*PQ*AGR LSPGLPGSPGPGLEFP WEL*PWPWGSWEAWGQ PPCSLGSSCGKGGNAE bddadeaddebbebdbb ebdbbdebdbdbdebb (1) Six-frame translation (2) Re-encode to reduced alphabet ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe Stop codon ✗ Stop codon ✗ RNA- seq read Stop codon ✗ ddadeadde bebdbbdeb bdebdbdbd bdbbebdbb addebbebd ebdbbdebd deaddebbe debbebdbb debdbdbde ddebbebdb dbbebdbbd bdbbdebdb bbebdbbeb ebdbdbdeb bebdbbebd adeaddebb bdbdbdebb dadeaddeb dbbdebdbd ebdbbebdb bddadeadd bbdebdbdb bbebdbbde eaddebbeb ebbebdbbe (3) k-merize each frame (4) k-mers present in database? UniProt manually curated peptide sequences for Opisthokonta (animal-like + fungi-like) stored as a Bloom filter 23/25 < 95% of k-mers matched Too few matches ✗ (5) Use reading frames with >95% matching k-mers AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee ✔ Accepted reading frame ✔ Accepted reading frame 100% matching k-mers 100% matching k-mers -1: -2:

Slide 64

Slide 64 text

Overview Introduction Methods Applications AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee ✔ Accepted reading frame ✔ Accepted reading frame 100% matching k-mers 100% matching k-mers -1: -2: SENCHA ACCURATELY FINDS PROTEIN-CODING SEQUENCES 12 https://github.com/czbiohub/sencha - Gene is on negative strand - Alternative splicing event upstream caused frameshift and thus there are two possible reading frames BLAT search of sequencing read shows multiple reading frames

Slide 65

Slide 65 text

Overview Introduction Methods Applications AGNSSPGPGDPGRPGDS RPAHWGHLVAKAATPR LGTLALALGILGGLGTA ALLIGVILWQRRQRRG bbcbbbbbbcbbdbbcb dbbdfbdeebdbbbbd ebbebebebeebbebbb beeebeeefcddcddb deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee deebdbbbb bcbbbbbbc dbbcbdbbd bbdbbcbdb fbdeebdbb bdfbdeebd cbbdbbcbd bbcbbbbbb bbbbbbcbb bbbbcbbdb dfbdeebdb bdeebdbbb bbdfbdeeb cbdbbdfbd dbbdfbdee eebdbbbbd bcbbdbbcb bbcbdbbdf bbbcbbdbb bbcbbdbbc bbbbbcbbd cbbbbbbcb bcbdbbdfb bdbbdfbde bdbbcbdbb beebbebbb eebbebbbb beeebeeef bbbbeeebe bbebebebe eefcddcdd ebeeefcdd bebebebee ebbebbbbe ebbbbeeeb bebeebbeb ebeebbebb bebbbbeee ebebebeeb eebeeefcd ebebeebbe bebebeebb efcddcddb bbebbbbee ebbebebeb beeefcddc eeefcddcd eeebeeefc bbbeeebee bbeeebeee ✔ Accepted reading frame ✔ Accepted reading frame 100% matching k-mers 100% matching k-mers -1: -2: SENCHA ACCURATELY FINDS PROTEIN-CODING SEQUENCES 12 https://github.com/czbiohub/sencha - Gene is on negative strand - Alternative splicing event upstream caused frameshift and thus there are two possible reading frames True positive rate False positive rate AUC = 0.933 ENSEMBL 97 human peptides Protein alphabet, K-mer size = 7 Sencha accurately finds reads in CDS sequences BLAT search of sequencing read shows multiple reading frames

Slide 66

Slide 66 text

Overview Introduction Methods Applications KMERMAID TRANSLATES RNA-SEQ READS INTO PROTEINS AND COMPRESSES SEQUENCES INTO BAGS OF K-MER WORDS 13 https://github.com/nf-core/kmermaid/

Slide 67

Slide 67 text

Overview Introduction Methods Applications KMERMAID TRANSLATES RNA-SEQ READS INTO PROTEINS AND COMPRESSES SEQUENCES INTO BAGS OF K-MER WORDS 13 https://github.com/nf-core/kmermaid/ “Bags of k-mer words” Translate RNA- seq reads with sencha RNA → Protein Convert to reduced alphabet and compress sequences with sourmash Protein ↓ Dayhoff Input read datasets ACT..TCG CTA..TTC .. ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI= .. ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=

Slide 68

Slide 68 text

Overview Introduction Methods Applications KMERMAID TRANSLATES RNA-SEQ READS INTO PROTEINS AND COMPRESSES SEQUENCES INTO BAGS OF K-MER WORDS 13 https://github.com/nf-core/kmermaid/ “Bags of k-mer words” Translate RNA- seq reads with sencha RNA → Protein Convert to reduced alphabet and compress sequences with sourmash Protein ↓ Dayhoff Input read datasets ACT..TCG CTA..TTC .. ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI= .. ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=

Slide 69

Slide 69 text

Overview Introduction Methods Applications KMERMAID TRANSLATES RNA-SEQ READS INTO PROTEINS AND COMPRESSES SEQUENCES INTO BAGS OF K-MER WORDS 13 https://github.com/nf-core/kmermaid/ “Bags of k-mer words” Translate RNA- seq reads with sencha RNA → Protein Convert to reduced alphabet and compress sequences with sourmash Protein ↓ Dayhoff Input read datasets ACT..TCG CTA..TTC .. ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI= .. ⇢ AAAB7nicbVC7SgNBFL3rM8ZXoqXNYBCswq4Iahe1sYxgHpBdwuxkdjNkdmaZmRXCko+wsVDE1t7ej7Cz8F+cPApNPHDhcM693HtPmHKmjet+OUvLK6tr64WN4ubW9s5uqbzX1DJThDaI5FK1Q6wpZ4I2DDOctlNFcRJy2goH12O/dU+VZlLcmWFKgwTHgkWMYGOllh+yOPbzbqniVt0J0CLxZqRSu3Tfv8sfot4tffo9SbKECkM41rrjuakJcqwMI5yOin6maYrJAMe0Y6nACdVBPjl3hI6s0kORVLaEQRP190SOE62HSWg7E2z6et4bi/95ncxE50HORJoZKsh0UZRxZCQa/456TFFi+NASTBSztyLSxwoTYxMq2hC8+ZcXSfOk6p1WL25tGlcwRQEO4BCOwYMzqMEN1KEBBAbwAE/w7KTOo/PivE5bl5zZzD78gfP2AweZkxI=

Slide 70

Slide 70 text

Overview Introduction Methods Applications NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential gene expression Bang et al, Scientific Reports (2019)

Slide 71

Slide 71 text

Overview Introduction Methods Applications NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Logistic regression Differential k-mer expression Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression Bang et al, Scientific Reports (2019)

Slide 72

Slide 72 text

Overview Introduction Methods Applications NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Logistic regression Differential k-mer expression Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)

Slide 73

Slide 73 text

Overview Introduction Methods Applications NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential k-mers Logistic regression Differential k-mer expression Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)

Slide 74

Slide 74 text

Overview Introduction Methods Applications NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential k-mers Not in genome In genome Logistic regression Differential k-mer expression Search for k-mer in NCBI RefSeq protein sequences sourmash search Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)

Slide 75

Slide 75 text

Overview Introduction Methods Applications Multi-mapped Singly-mapped Not in a gene In a gene Non-ortholog Ortholog Under construction NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential k-mers Not in genome In genome Logistic regression Differential k-mer expression Search for k-mer in NCBI RefSeq protein sequences sourmash search Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)

Slide 76

Slide 76 text

Overview Introduction Methods Applications Multi-mapped Singly-mapped Not in a gene In a gene Non-ortholog Ortholog Under construction NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential k-mers Not in genome In genome Logistic regression Differential k-mer expression Search for k-mer in NCBI RefSeq protein sequences sourmash search Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)

Slide 77

Slide 77 text

Overview Introduction Methods Applications Multi-mapped Singly-mapped Not in a gene In a gene Non-ortholog Ortholog Under construction NF-PREDICTORTHOLOGS FINDS FUNCTIONAL ANNOTATIONS OF DIFFERENTIAL K-MER EXPRESSION 14 https://github.com/czbiohub/nf-predictorthologs Differential k-mers Not in genome In genome Logistic regression Differential k-mer expression Search for k-mer in NCBI RefSeq protein sequences sourmash search Alignments Compressed k-mers Translated protein sequences Differential groups Differential gene expression ≅ Bang et al, Scientific Reports (2019)

Slide 78

Slide 78 text

Overview Introduction Methods Applications DIFFERENTIAL K-MER EXPRESSION FINDS UNALIGNED “DARK MATTER” IN TRANSLATED TRANSCRIPTOMES 15 Brawand et al, Nature (2011) 2011 Nature • Performed differential k-mer expression on Brain vs Liver on bonobo and rhesus RNA-seq data • Used protein k-mers as evolutionary distance is <10 million years vs bonobo rhesus rhesus bonobo rhesus rhesus

Slide 79

Slide 79 text

Overview Introduction Methods Applications DIFFERENTIAL K-MER EXPRESSION FINDS UNALIGNED “DARK MATTER” IN TRANSLATED TRANSCRIPTOMES 15 Brawand et al, Nature (2011) K-mer Translated read with k-mer RefSeq match Not in genome QSLFFHFPP LGQSLFFHFPPLLRDGENY NP_001004067.1 nodal modulator 3 precursor [Homo sapiens] Bonobo (Pan paniscus) RLDLMREMY THYWSLEKLKQRLDLMREMYDRAG NP_032466.2 kinesin-like protein KIF1A isoform a [Mus musculus] Rhesus macaque (Macaca mulatta) TYFSKNYQD EKLIETYFSKNYQDYEYLINV NP_000524.3 myelin proteolipid protein isoform 1 [Homo sapiens] Rhesus macaque (Macaca mulatta) CSAVPVYIY LLVFACSAVPVYIYFNTWTT NP_000524.3 myelin proteolipid protein isoform 1 [Homo sapiens] Rhesus macaque (Macaca mulatta) GDRNNSSCR VTGDRNNSSCRNYNKQASEQNWAN NP_001185877.1 gap junction alpha-1 protein [Oryctolagus cuniculus] Rhesus macaque (Macaca mulatta) VLFVPKMRR IVFSSYITLVVLFVPKMRR NP_062312.3 gamma-aminobutyric acid type B receptor subunit 1 precursor [Mus musculus] Rhesus macaque (Macaca mulatta) ASIRDANLY KVSYARPSSASIRDANLYVSG NP_001361166.1 ELAV-like protein 2 isoform 5 [Mus musculus] Rhesus macaque (Macaca mulatta) Brain-enriched differential k-mers, not present in some genomes: 2011 Nature • Performed differential k-mer expression on Brain vs Liver on bonobo and rhesus RNA-seq data • Used protein k-mers as evolutionary distance is <10 million years vs bonobo rhesus rhesus bonobo rhesus rhesus

Slide 80

Slide 80 text

Overview Introduction Methods Applications DIFFERENTIAL K-MER EXPRESSION FINDS UNALIGNED “DARK MATTER” IN TRANSLATED TRANSCRIPTOMES 15 Brawand et al, Nature (2011) K-mer Translated read with k-mer RefSeq match Not in genome QSLFFHFPP LGQSLFFHFPPLLRDGENY NP_001004067.1 nodal modulator 3 precursor [Homo sapiens] Bonobo (Pan paniscus) RLDLMREMY THYWSLEKLKQRLDLMREMYDRAG NP_032466.2 kinesin-like protein KIF1A isoform a [Mus musculus] Rhesus macaque (Macaca mulatta) TYFSKNYQD EKLIETYFSKNYQDYEYLINV NP_000524.3 myelin proteolipid protein isoform 1 [Homo sapiens] Rhesus macaque (Macaca mulatta) CSAVPVYIY LLVFACSAVPVYIYFNTWTT NP_000524.3 myelin proteolipid protein isoform 1 [Homo sapiens] Rhesus macaque (Macaca mulatta) GDRNNSSCR VTGDRNNSSCRNYNKQASEQNWAN NP_001185877.1 gap junction alpha-1 protein [Oryctolagus cuniculus] Rhesus macaque (Macaca mulatta) VLFVPKMRR IVFSSYITLVVLFVPKMRR NP_062312.3 gamma-aminobutyric acid type B receptor subunit 1 precursor [Mus musculus] Rhesus macaque (Macaca mulatta) ASIRDANLY KVSYARPSSASIRDANLYVSG NP_001361166.1 ELAV-like protein 2 isoform 5 [Mus musculus] Rhesus macaque (Macaca mulatta) Brain-enriched differential k-mers, not present in some genomes: 2011 Nature • Performed differential k-mer expression on Brain vs Liver on bonobo and rhesus RNA-seq data • Used protein k-mers as evolutionary distance is <10 million years Not a typo: Differential hash found in the same RefSeq protein sequence vs bonobo rhesus rhesus bonobo rhesus rhesus

Slide 81

Slide 81 text

Overview Introduction Methods Applications CONCLUSIONS AND FUTURE WORK Conclusions • Combining Dayhoff-encoded reduced protein alphabet k- mers with sketching algorithms enables scalable, genome-agnostic cross-species analyses • sencha translates RNA-seq data into the correct protein- coding frame (https://github.com/czbiohub/sencha/) • kmermaid is a Nextflow pipeline that translates RNA-seq reads into protein and subsamples k-mers by MinHashing (https://github.com/nf-core/kmermaid/) • nf-predictorthologs is a Nextflow pipeline that performs differential k-mer expression and searches for functional annotations of differential k-mers (https:// github.com/czbiohub/nf-predictorthologs) Potential applications: • Aligning cell atlases across large evolutionary distances • Identifying orthologous differential genes in multi-species transcriptomes without genomes or gene annotations 16 Future Work • sencha • Benchmark against synthetically generated coding/ noncoding dataset • Compute ROC AUC for protein-coding reads as aligning into CDS-annotated regions for “real” data • kmermaid • Polish for 1.0.0 release • nf-predictorthologs • Add gene counting of differential k-mers • Use diamond blastp as backup protein search if k-mer is not present in NCBI RefSeq • Apply to full set of Brawand2011 data • Apply to datasets from larger evolutionary distances

Slide 82

Slide 82 text

Overview Introduction Methods Applications PAPER IN PROGRESS 17 https://github.com/czbiohub/de-novo-orthology-paper

Slide 83

Slide 83 text

Overview Introduction Methods Applications PAPER IN PROGRESS 17 https://github.com/czbiohub/de-novo-orthology-paper

Slide 84

Slide 84 text

Overview Introduction Methods Applications ACKNOWLEDGEMENTS K-mermidons group - Phoenix Logan - Pranathi Vemuri - Saba Nafees - Lekha Karanam Jim Karkanias, VP Data Sciences and IT Spyros Darmanis and group Outside of Biohub (@github) - Sourmash (github.com/dib-lab/sourmash/): - C. Titus Brown (@ctb), Luiz Irber (@luizirber), Tessa Pierce (@bluegenes) - Nextflow (github.com/nextflow-io/nextflow/): - Paolo Di Tommaso (@pditommaso), @KochTobi, Rad Suchecki (@rsuchecki) - nf-core (nf-co.re): - Phil Ewels (@ewels), Alexander Peltzer (@apeltzer), Harshil Patel (@drpatelh) 18 CZ Biohub Data Sciences and Information Technology Team Jim Karkanias Joshua Batson James Webber Aaron McGeever Angela Oliveira Pisco Jenny Folkesson Samantha Hao Phoenix Logan Giana Cirolia Olga Botvinnik Saransh Kaul Lekha Karanam Jack Kamm David Dynerman Lucy Li Pranathi Vemuri Jim Karkanias Saba Nafees Clarissa Vasquez [email protected] @olgabot

Slide 85

Slide 85 text

Overview Introduction Methods Applications ACKNOWLEDGEMENTS K-mermidons group - Phoenix Logan - Pranathi Vemuri - Saba Nafees - Lekha Karanam Jim Karkanias, VP Data Sciences and IT Spyros Darmanis and group Outside of Biohub (@github) - Sourmash (github.com/dib-lab/sourmash/): - C. Titus Brown (@ctb), Luiz Irber (@luizirber), Tessa Pierce (@bluegenes) - Nextflow (github.com/nextflow-io/nextflow/): - Paolo Di Tommaso (@pditommaso), @KochTobi, Rad Suchecki (@rsuchecki) - nf-core (nf-co.re): - Phil Ewels (@ewels), Alexander Peltzer (@apeltzer), Harshil Patel (@drpatelh) 18 CZ Biohub Data Sciences and Information Technology Team Jim Karkanias Joshua Batson James Webber Aaron McGeever Angela Oliveira Pisco Jenny Folkesson Samantha Hao Phoenix Logan Giana Cirolia Olga Botvinnik Saransh Kaul Lekha Karanam Jack Kamm David Dynerman Lucy Li Pranathi Vemuri Jim Karkanias Saba Nafees Clarissa Vasquez Questions? [email protected] @olgabot