Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

CHESS 3 paper review

Avatar for Geo Pertea Geo Pertea
March 23, 2024
13

CHESS 3 paper review

Slides for the journal club review presentation:
"CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure"
Varabyou, A., Sommer, M.J., Erdogdu, B. et al.
https://doi.org/10.1186/s13059-023-03088-4

Avatar for Geo Pertea

Geo Pertea

March 23, 2024
Tweet

Transcript

  1. CHESS 3: an improved, comprehensive catalog of human genes and

    transcripts based on large-scale expression data, phylogenetic analysis, and protein structure Ales Varabyou et al. [biorxiv, December 22, 2022] https://doi.org/10.1101/2022.12.21.521274
  2. CHESS 3 Comprehensive Human Expressed SequenceS – a project that

    aimed to augment the existing human reference annotation data with novel transcripts discovered through rigorous transcript assembly and curation of large RNA-Seq datasets. (Genome Biology (2018) 19:208). released CHESS 3.0.1
  3. CHESS 3 - Results Original source Count Coding Description BestRefSeq

    64971 50356 Automated computational gene-prediction method by NCBI StringTie 33783 20618 Gene predictions generated by the StringTie software HAVANA 27165 21098 annotations generated by the HAVANA group at the Sanger Institute Gnomon 15408 6981 Gene predictions generated by the NCBI's Gnomon pipeline Curated Genomic 14999 578 NCBI curation cmsearch 1134 0 annotations from searching for conserved ncRNAs structures with InfeRNAl tRNAscan-SE 431 0 Predicted tRNA genes identified by the tRNAscan-SE software FANTOM 318 98 CHESS 2 transcripts with corroborating evidence from the FANTOM project ENSEMBL 117 97 Gene predictions generated by the Ensembl project RefSeq 37 13 The NCBI RefSeq annotation set
  4. spliced alignment (HISAT2) RNA sequencing (paired reads) exon2 exon1 exon3

    exon1 exon3 exon2 genome sequence GT AG GT AG exon3 exon1 isoform1 isoform1 alignments isoform2 alignments isoform2 transcript assembly (StringTie) exon2 exon3 exon1 exon3 exon1 isoform1 isoform2 1 2 3 Transcript reconstruction from read mappings to the genome annotation agnostic annotation agnostic
  5. Transcript assembly often proposes too many potentially "novel" transfrags There

    is transcriptional noise, and some may be due to alignment artifacts exon1 exon2 exon3 ref tx A1 ref tx A2 Assembled transcripts exon1 exon2 RNAseq read alignments gene A gene B Reference annotation ref tx B1 spurious alignments? genome sequence
  6. CHESS 3 workflow 26,335,900 "The alignments are either directly assembled

    with StringTie2 or aggregated by tissue with TieBrush" Figure 4.
  7. Pooled samples transcript assembly: fusion transfrags masking potentially valid transfrags?

    exon1 exon2 exon3 ref tx1 masked ref tx2 Assembled transfrags exon1 exon2 pooled RNAseq read alignments gene A gene B Reference annotation ref tx3 genome sequence tf2 tf1 tf3 ref tx2 masked ref tx1
  8. CHESS 3 workflow "Only transcripts that were assembled directly from

    the individual samples or from “TieBrush”-ed files, were retained" : 987,244 26M 3M TieBrush assembly branch Raw assembly branch
  9. CHESS 3 workflow 26M 3M TieBrush assembly branch Raw assembly

    branch Raw transfrags TieBrush transfrags 3M 26M 1M
  10. Part 2 CHESS 3: an improved, comprehensive catalog of human

    genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure Ales Varabyou et al. [biorxiv, December 22, 2022] https://doi.org/10.1101/2022.12.21.521274
  11. Comprehensive Human Expressed SequenceS Adjusting the existing human reference annotation

    data with novel transcripts discovered through transcript assembly and curation of almost 10,000 GTEx samples GTEx v7(?) (~9,800 samples): • 53 Body Sites (including multiple brain regions) • samples grouped into 31 tissue types • 133 low quality samples discarded
  12. ❏ compared to other gene catalogs (and even CHESS1?), CHESS3

    claims to be stricter as to avoid including genes and transcripts that are not functional ❏ authors suggest that many of these non-functional transcripts could be the result of transcriptional noise (referencing Ales' paper from 2020 on the issue of transcriptional noise in RNA-seq) ❏ authors opine that other reference annotation databases would be improved if they also excluded such non-functional transcripts (instead of including but tagging them as such - many users do not check those tags in GFF/GTF annotation) CHESS3 stringency
  13. CHESS 3 - Composition MANE : Matched Annotation from NCBI

    and EMBL-EBI GTEx assembled & filtered transfrags + • MANE transcripts (~19,300) • GENCODE ∩ RefSeq transcripts assembled in at least 1 GTEx sample • RefSeq: VDJ segments, Y RNA, tRNAs, rRNAs and nucleolar RNAs
  14. CHESS 3 - Results Original source Count Coding Description BestRefSeq

    64971 50356 Automated computational gene-prediction method by NCBI StringTie 33783 20618 Gene predictions generated by the StringTie software HAVANA 27165 21098 annotations generated by the HAVANA group at the Sanger Institute Gnomon 15408 6981 Gene predictions generated by the NCBI's Gnomon pipeline (RefSeq) Curated Genomic 14999 578 NCBI curation cmsearch 1134 0 annotations from searching for conserved ncRNAs structures with InfeRNAl tRNAscan-SE 431 0 Predicted tRNA genes identified by the tRNAscan-SE software FANTOM 318 98 CHESS 2 transcripts with corroborating evidence from the FANTOM project ENSEMBL 117 97 Gene predictions generated by the Ensembl project RefSeq 37 13 The NCBI RefSeq annotation set
  15. CHESS 3 workflow Raw assembly branch Raw transfrags TieBrush transfrags

    3M 26M 1M 31 Tissue alignments TieBrush (aggregate) HISAT2 (align) Coverage Filter StringTie (assemble) 9814 GTEx alignments Intron classifier Reference annotation gffcompare (merge & annotate) StringTie (assemble) 9814 GTEx Samples
  16. CHESS 3 workflow TieBrush assembly branch Raw assembly branch Raw

    transfrags TieBrush transfrags 3M 26M 1M 31 Tissue alignments TieBrush (aggregate) HISAT2 (align) Coverage Filter StringTie (assemble) 9814 GTEx alignments Intron classifier Reference annotation gffcompare (merge & annotate) StringTie (assemble) 9814 GTEx Samples
  17. Part 3 CHESS 3: an improved, comprehensive catalog of human

    genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure Ales Varabyou et al. [biorxiv, December 22, 2022] https://doi.org/10.1101/2022.12.21.521274
  18. CHESS 3 workflow TieBrush assembly branch Raw assembly branch Raw

    transfrags TieBrush transfrags 3M 26M 1M 31 Tissue alignments TieBrush (aggregate) HISAT2 (align) Coverage Filter StringTie (assemble) 9814 GTEx alignments Reference annotation gffcompare (merge & annotate) StringTie (assemble) 9814 GTEx Samples union set
  19. Fusion/over-extended Tiebrush transfrags masking potentially valid Raw transfrags exon1 exon2

    exon3 tx1 masked tx2 Tiebrush assemblies exon1 exon2 TieBrush/pooled read alignments gene A gene B Raw (per sample) assemblies tx3 genome sequence tb2 tb1 tb3 tx2 masked tx1
  20. Raw transfrags TieBrush transfrags 3M 26M 1M CHESS 3 -

    estimated losses due to TieBrush intersection "filter" vs. "refcomb" dataset: Gencode43 + RefSeq (267,801 transcripts) TieBrush 2M 170,637 refcomb matches 91,753 refcomb matches 89,676 refcomb matches 704,022 intronic 182,414 j (jx sharing) 96,375 k (contain ref, ~34k distinct) 94,176 m,n (retained intron) 47,985 y (ref in their intron) 42,442 e 34,689 o 30,168 c 2,791 matches (104,731 overlap at least 2 genes) http://ccb.jhu.edu/software/stringtie/gffcompare.shtml#transfrag-class-codes
  21. Raw transfrags TieBrush transfrags 3M 26M 1M 31 Tissue alignments

    TieBrush (aggregate) Coverage Filter StringTie (assemble) Intron classifier Reference annotation gffcompare (merge & annotate) StringTie (assemble) min-samples & TPM filtering CHESS 3 - filtering of the "union set" • transfrag in at least 10 samples ("reproducibility filter") • average TPM across samples >= 1.0 • TPM >= 10% of highest TPM per locus (min isoform fraction)
  22. CHESS 3 - the intron classifier Slide courtesy of Ales

    Varabyou Features collected per intron: 1. number of samples supporting it 2. coverage by uniquely mapped reads 3. ratio of unique/total reads 4. ratio of coverage at the donor site to the coverage at first intronic position 5. ratio of coverage at the acceptor site to that at the last intronic position 6. maximum bases by which a single read extends upstream/downstream 7. number of transfrags sharing it 8. ratio of coverage by uniquely mapped reads between forward and reverse strand 9. whether or not the junction was present in the guide (ref) annotation
  23. CHESS 3 - isoform filtering by intron classification Within each

    locus, an algorithm was devised to select a minimum set of isoforms that would explain all the "valid"-labeled introns 1. Sort all isoforms by decreasing cumulative TPM (computed across all samples in each group) 2. For each transcript T (traversed from most abundant to least abundant), if T contains "valid" introns that have not been seen in more abundant transcripts, then: • add T to the set of transcripts to retain • remove all of T's introns from the list of "available" valid introns. At the end of the run, the algorithm produced a concise set of isoforms that covered all splice junctions labeled as valid by the machine learning model. min-isoform-set filter "valid" intron set
  24. Intron Classifier union set (1M) min-sample & TPM filtering min-isoform-set

    filter 160,482 transfrags CHESS 3 - further filtering of the "union set" "valid" intron set in each locus, the algorithm select the minimum set of isoforms that explains all the "valid"-labeled introns novel introns
  25. Intron Classifier union set (1M) min-sample & TPM filtering min-isoform-set

    filter 160,482 transfrags ORFanage CHESS 3 - further filtering of the "union set" assign most likely CDS to novel isoforms 97,661 in protein coding genes "valid" intron set in each locus, the algorithm select the minimum set of isoforms that explains all the "valid"-labeled introns novel introns
  26. CHESS 3 - additional supporting data novel coding isoforms: high

    AlphaFold2 (ColabFold) pLDDT scores (predicted Local Distance Difference Test) novel isoform MANE isoform pLDDT 49.3 pLDDT: 74.5
  27. CHESS 3 - Composition According to the "Adding transcripts from

    known sources" section in the supplement: • all MANE transcripts (~19,300? most of them already present in the assembled transfrags) MANE = Matched Annotation from NCBI and EMBL-EBI • transcripts assembled in at least 1 GTEx sample and also present in GENCODE ∩ RefSeq OR in RefSeq Select • RefSeq: VDJ segments, Y RNA, • RefSeq tRNAs, tmRNAs, rRNAs and nucleolar RNAs ❏ "In particular, we do not include in the primary database any gene or transcript that appears to be non-functional" [?] ❏ "Other catalogs include thousands of these transcripts, sometimes tagged to indicate they are non-functional, but sometimes merely included without any such warning."
  28. CHESS 3 - Results Original source Count Coding Description BestRefSeq

    64971 50356 Automated computational gene-prediction method by NCBI StringTie 33783 20618 Gene predictions generated by the StringTie software HAVANA 27165 21098 annotations generated by the HAVANA group at the Sanger Institute Gnomon 15408 6981 Gene predictions generated by the NCBI's Gnomon pipeline (RefSeq) Curated Genomic 14999 578 NCBI curation cmsearch 1134 0 annotations from searching for conserved ncRNAs structures with InfeRNAl tRNAscan-SE 431 0 Predicted tRNA genes identified by the tRNAscan-SE software FANTOM 318 98 CHESS 2 transcripts with corroborating evidence from the FANTOM project ENSEMBL 117 97 Gene predictions generated by the Ensembl project RefSeq 37 13 The NCBI RefSeq annotation set
  29. CHESS 3 - Gnomon pseudogenes, why? ID Location Exons Length

    Gene name Gene type CHS.2781.1 chr1:143,450,855-144,302,564(-) 1 851,710 LOC124904398 pseudogene CHS.170074.1 chr8:7,613,724-8,065,191(+) 1 451,468 LOC124901865 pseudogene CHS.173173.1 chr16_KI270728v1_random: 393,422-672,099(-) 1 278,678 LOC102723945 pseudogene CHS.167003.1 chr1:73,129,148-73,342,515(-) 1 213,368 LOC105378800 pseudogene CHS.169606.1 chr6:160,773,929-160,898,708(+) 1 124,780 LOC107986665 pseudogene
  30. CHESS 3 - addressing past criticism? ➢ 86% of the

    predictions seem to overlap transposons and other repeat elements, known to often lead to false coding predictions ➢ over half are homologous to each other and also to a set of proteins or protein domains which are poorly annotated or even recently withdrawn from GenBank ➢ the coding predictions show poor evolutionary conservation (PhyloCSF scores were indistinguishable from non-coding RNAs) ➢ only 4 predictions have some mass spectrometry evidence, and that was rather weak and inconclusive ➢ critics claim that DE findings (between tissues) cannot be used as evidence for validation of coding status (though they might indicate RNA functionality) Did they address these issues in the CHESS3 preprint? Not really.
  31. Limitations ❏ lack of direct experimental validation of any of

    the novel isoforms proposed: (RT-PCR, targeted proteomics, mass spectrometry?) ❏ lack of novel isoform confirmation using the available long RNA-seq datasets ❏ using RNA-seq alignments to propose new transcripts structures in order to augument existing annotation is not a new endeavor: both Ensembl and RefSeq(Gnomon pipeline) are also adding such novel transcript structures to their annotation datasets. ❏ insufficiently justified (seemingly arbitrary) decisions made during some of filtering/merging steps (thresholds; assembly of tiebrush alignments) ❏ most of the crucial steps in the proposed workflow are not available as open-source, thus the results are not verifiable