Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Work Log 3/22

Liang Bo Wang
March 22, 2013
42

Work Log 3/22

Liang Bo Wang

March 22, 2013
Tweet

Transcript

  1. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine A

    d a p t e r, R N A - S e q , Tu x e d o p ro t o c o l  Work Log 03/22
  2. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 2

    Sample A Sample B Project Mr.T *.bcl Demultiplexing Illumina CASAVA 1.8 FASTQ R2.fastq FASTQ R1.fastq Sample Z Sample Y Project Mrs.A Paired-end (zipped) FASTQ R2.fastq FASTQ R1.fastq FASTQ R2.fastq FASTQ R1.fastq FASTQ R2.fastq FASTQ R1.fastq Quality Check FastQC v0.10.1 HTML *.html Figs *.png Report *.txt QC & Trimming cutadapt, seqtk, … (cleaned) R2.fastq Free from adaptters, PCR primers, …, contamination
  3. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 3

    Illustration of different constructs and the reads produced. •  I = Inserts •  R = single-end reads •  R1, R2 = paired-end reads •  LR = Read length •  LI = insert length A)  LI ≥ LR B)  LI < LR C)  LI ≥ 2LR D)  LR < LI < 2LR E)  LI<LR
  4. File organization by Illumina demutiplexing  Grouped under YYMMDD_<machine name>_XXXX_FCID/

    •  Project_<Prj name>/ •  Sample_<Smpl name>/ •  <Smpl name>_<Index>_<Lane No>_R1_001.fastq.gz •  <Smpl name>_<Index>_<Lane No>_R1_001.fastq.gz •  SampleSheet.csv •  Project_A/ •  Sample_control/ •  Sample_cond1/ •  Sample_cond2/ •  cond2_AATTCC_L005_R1_001.fastq.gz •  cond2_AATTCC_L005_R2_001.fastq.gz •  SampleSheet.csv Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 4
  5. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 5

    RNA-Seq Tuxedo protocol = TopHat + Cufflinks + CummeRund
  6. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 6

    (cleaned) R2.fastq (cleaned) R1.fastq Genome Alignment TopHat v2.0.8 Transcript Assembly Cufflinks v2.0.2 Transcript Assembly HTSeq v0.5.4p1 FPKM Read counts by gene
  7. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 7

    (cleaned) R2.fastq (cleaned) R1.fastq Genome Alignment TopHat v2.0.8 iGenome DB Whole Genome Sequence Ex. hg19.fa Bowtie2 prebuild FW index genome.* Annotation gene.gtf a read / fragment ( single / paired-end ) Genome Alignment Bowtie v2.1.0 chr15: 314,159 - 320,000 sequence of chr15 Splicing ( known / novel ) sequence of chr15 chr15: 271,828 - 28,000 chr15: 317,000 - 34,000 exon1 exon2 mapped not mapped try spliced accetped_ hits.bam umapped. bam deletions .bed insertions .bed junctions .bed
  8. Running time for TopHat Bioinformatics and Biostatistics Core, NTU Center

    of Genomic Medicine 8 Project Sample Taxonomy Ref # of reads (millions reads) TopHat running time Lin A-D mouse mm10 36.9 4h 39m A-W mouse mm10 33.3 2h 32m D14G chicken galGal4 49.0 3h 1m StageX chicken galGal4 35.5 2h 17m Chou No94 human hg19 61.1 6h 13m No95 human hg19 66.8 7h 5m No97 human hg19 68.1 7h 12m
  9. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 12

    •  if gene is alternatively spliced •  algorithm based on graph theory (b) •  isoforms ( alt. splice transcripts) (c) •  reads map to different sets of exons on same region •  read maps to a portion of exon •  expression rate by FPKM •  FPKM = fragments per kilobase of transcript pairs per million mapped reads •  some exons can be shared •  expression of each isoform (transcript) is not straight-forward (d) •  statistical inference (e) •  gene expression (FPKM) = sum of exp. of all isoforms directly
  10. Terminology - GTF •  GTF = Gene Transfer Format • 

    Gene ID, Transcript ID, feature (exon, intro, …), postition Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 15