Slide 1

Slide 1 text

Reference-free comparative transcriptomics Olga Botvinnik, Data Scientist [email protected] @olgabot @olgabot May 18th, 2019

Slide 2

Slide 2 text

!2 CELLS ARE AN INTERMEDIATE BETWEEN DNA AND PHENOTYPE !2 Overview Introduction Methods Results Conclusions DNA Phenotype Technoscience.global2.vic.edu.au Background vector created by freepik Cell Tissue Organ Organ System Organism

Slide 3

Slide 3 text

FROM SMOOTHIE (BULK RNA-SEQ) TO FRUIT SALAD (SINGLE-CELL RNA-SEQ) !3 Bulk RNA-Seq Single cell RNA-seq !3 Overview Introduction Methods Results Conclusions

Slide 4

Slide 4 text

!4 COMPARATIVE SINGLE-CELL TRANSCRIPTOMICS TO BUILD A PHYLOGENETIC TREE OF CELL TYPES !4 Cell type 1 Cell type 2 Cell type 3 Cell type 4 Cell type 5 Cell type 6 Cell type 7 Cell type 8 Cell type 9 Cell type 10 • How can cells from organisms without reference genomes be compared to annotated cell types? • When in evolutionary time could a cell type have originally appeared? • Can a new species be defined by the introduction of a new cell type or cell state? Overview Introduction Methods Results Conclusions

Slide 5

Slide 5 text

!5 NOT ALL GENES HAVE A 1:1 EXACT ORTHOLOGUE MATCH BETWEEN SPECIES !5 Altschmied, J., et al (2002). Subfunctionalization of duplicate mitf genes associated with differential degeneration of alternative exons in fish. Genetics, 161(1), 259–267. Alternative first exons Single first exon Alternative first exons <50% of human genes have 1:1 orthologue with mouse or zebrafish Overview Introduction Methods Results Conclusions

Slide 6

Slide 6 text

!6 NOT ALL GENES HAVE A 1:1 EXACT ORTHOLOGUE MATCH BETWEEN SPECIES !6 Altschmied, J., et al (2002). Subfunctionalization of duplicate mitf genes associated with differential degeneration of alternative exons in fish. Genetics, 161(1), 259–267. Solution: Use protein k-mers created by six-frame translation of RNA k-mers <50% of human genes have 1:1 orthologue with mouse or zebrafish Overview Introduction Methods Results Conclusions Alternative first exons Single first exon Alternative first exons

Slide 7

Slide 7 text

Overview Introduction Methods Results Conclusions !7 A “SKETCH” OF SEQUENCES IS A COMPRESSED REPRESENTATION OF THE ENTIRE DATASET !7 Kathe Kollwitz, "Self Portrait", charcoal on brown laid Ingres paper, 1933 Sketch, Wikipedia (2019)

Slide 8

Slide 8 text

!8 COMPRESS A CELL’S CDNA CONTENT TO A “SKETCH” OF PROTEIN K-MERS !8 Overview Introduction Methods Results Conclusions

Slide 9

Slide 9 text

Overview Introduction Methods Results Conclusions K-MERS SEPARATE CELL TYPES AND K-MER ABUNDANCE ONLY ADDS NOISE !9 k-mer presence/absence Binarized gene expression Gene expression k-mer abundance Observe 1/1000 k-mers Ksize: 27 Molecule: cDNA Nearest neighbor graphs, n_neighbors=5 Mouse Bladder, SmartSeq2 Single-cell RNA seq

Slide 10

Slide 10 text

SPECIES SIGNAL CURRENTLY OUTWEIGHS CELL TYPE SIGNAL !10 k=7 amino acids Observe 4096 k-mers per cell Nearest Neighbor graph with n_neighbors=5 Hematopoiesis/Kidney - SmartSeq2/QUARTZ-seq of single cells in: - Mouse Kidney - Zebrafish Kidney Marrow (primary site of hematopoiesis) - Human Bone Marrow (primary site of hematopoiesis) Overview Introduction Methods Results Conclusions Next Steps • Remove species-specific k-mers with term frequency inverse document frequency (TF-IDF)-like method • Model “cell type” and “species” as latent spaces using machine (deep?) learning methods • Use 3-frame translation for stranded RNA-seq data • Compare protein k-mer nearest neighbor graphs to graphs built on gene counts of 1:1 orthologues

Slide 11

Slide 11 text

CONCLUSIONS + NEXT STEPS Conclusions • A few thousand k-mers is sufficient to group similar cells within species • Abundance of k-mers only adds noise • Protein k-mers loosely identify cell types across closely related species Next Steps • Remove species-specific k-mers with term frequency inverse document frequency (TF-IDF)-like method • Model “cell type” and “species” as latent spaces using machine (deep?) learning methods • Use 3-frame translation for stranded RNA-seq data • Compare protein k-mer nearest neighbor graphs to graphs built on gene counts of 1:1 orthologues Orthogonal validation • Compare cell type enriched protein k-mers to cell type specific peptides from bottom-up proteomics Want to check it out? Contributions welcome! https://github.com/czbiohub/kmer-hashing !11 Overview Introduction Methods Results Conclusions

Slide 12

Slide 12 text

ACKNOWLEDGEMENTS !12 - Angela Pisco - James Webber - Josh Batson - Ashley Maynard - Lincoln Harris - Spyros Darmanis - Paolo Carnevali (CZI) - Giana Cirolia - Phoenix Logan - Shayan Hosseinzadeh - Kalani Ratnasiri - Aaron McGeever - Greg Huber Outside of Biohub (@github) - Sourmash (https://github.com/dib-lab/sourmash/): - C. Titus Brown (@ctb), Luiz Irber (@luizirber), Camille Scott (@camillescott) - Nextflow (https://github.com/nextflow-io/nextflow/): - Paolo Di Tommaso (@pditommaso), @KochTobi, Rad Suchecki (@rsuchecki) - Bamnostic (https://github.com/betteridiot/bamnostic/): - Marcus D Sherman (@betteridiot) Data Sciences Overview Introduction Methods Results Conclusions