Slide 1

Slide 1 text

GENETIC ANALYSIS of Complex Human Diseases Examining Gene Expression and Methylation with Next-Gen Sequencing Stephen Turner, Ph.D. Bioinformatics Core Director bioinformatics.virginia.edu University of Virginia

Slide 2

Slide 2 text

GENETIC ANALYSIS of Complex Human Diseases Gene expression pre-2008 PCR Microarrays

Slide 3

Slide 3 text

GENETIC ANALYSIS of Complex Human Diseases

Slide 4

Slide 4 text

GENETIC ANALYSIS of Complex Human Diseases Advantages of RNA-Seq n  No reference necessary n  Low background (no cross-hybridization) n  Unlimited dynamic range (FC 9000 Science 320:1344) n  Direct counting (microarrays: indirect – hybridization) n  Can characterize full transcriptome u mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc) u Differential gene expression u Differential coding output u Differential TSS usage u Differential isoform expression

Slide 5

Slide 5 text

GENETIC ANALYSIS of Complex Human Diseases Isoform level data

Slide 6

Slide 6 text

GENETIC ANALYSIS of Complex Human Diseases Isoform level data

Slide 7

Slide 7 text

GENETIC ANALYSIS of Complex Human Diseases Differential splicing & TSS use

Slide 8

Slide 8 text

GENETIC ANALYSIS of Complex Human Diseases Is it accurate? n  Marioni et al. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 2008 18:1509.

Slide 9

Slide 9 text

GENETIC ANALYSIS of Complex Human Diseases RNA-Seq Challenges n  Library construction u  Size selection (messenger, small) u  Strand specificity? n  Bioinformatic challenges u  Spliced alignment u  Transcript deconvolution n  Statistical Challenges u  Highly variable abundance u  Sample size: never, ever, plan n=1 u  Normalization (RPKM) ►  More reads from longer transcripts, higher sequencing depth ►  Want to compare features of different lengths ►  Want to compare conditions with different total sequence depth

Slide 10

Slide 10 text

GENETIC ANALYSIS of Complex Human Diseases RNA-Seq Overview Condi&on  1   (normal  colon)   Condi&on  2   (colon  tumor)   Samples  of  interest   AAAAA mRNA AAAAA mRNA TTTTT Library @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***- @HWUSI-EAS100R:6:73:941:1973#0/1 CATCGACGTAGATCGACTACATGAACTGCTCG +HWUSI-EAS100R:6:73:941:1973#0/1 !'’*+(*+!+(*!+*(((***!%%%%!%%(+-

Slide 11

Slide 11 text

GENETIC ANALYSIS of Complex Human Diseases Common question #1: Depth n  Question: how much sequence do I need? n  Answer: it’s complicated. n  Oversimplified answer: 20-50 million PE reads / sample (mouse/human). n  Depends on: u  Size & complexity of transcriptome u  Application: differential gene expression, transcript discovery u  Tissue type, RNA quality, library preparation u  Sequencing type: length, paired-end vs single-end, etc. n  Find a publication in your field with similar goals. n  Good news: ¼ HiSeq lane usually sufficient.

Slide 12

Slide 12 text

GENETIC ANALYSIS of Complex Human Diseases Common question #2: Sample Size n Question: How many samples should I sequence? n Oversimplified Answer: At least 3 biological replicates per condition. n Depends on: u Sequencing depth u Application u Goals (prioritization, biomarker discovery, etc.) u Effect size, desired power, statistical significance n Find a publication with similar goals

Slide 13

Slide 13 text

GENETIC ANALYSIS of Complex Human Diseases Common question #3: Workflow n  How do I analyze the data? n  No standards! u  Unspliced aligners: BWA, Bowtie, Bowtie2, MANY others! u  Spliced aligners: STAR, Rum, Tophat, Tophat2-Bowtie1, Tophat2-Bowtie2, GSNAP, MANY others. u  Reference builds & annotations: UCSC, Entrez, Ensembl u  Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS u  Quantification: Cufflinks, RSEM, eXpress, MISO, etc. u  Differential expression: Cuffdiff, Cuffdiff2, DegSeq, DESeq, EdgeR, Myrna n  Like early microarray days: lots of excitement, lots of tools, little knowledge of integrating tools in pipeline! n  Benchmarks u  Microarray: Spike-ins (Irizarry) u  RNA-Seq: ???, simulation, ???

Slide 14

Slide 14 text

GENETIC ANALYSIS of Complex Human Diseases Common question #3: Workflow Eyras et al. Methods to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993 Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782

Slide 15

Slide 15 text

GENETIC ANALYSIS of Complex Human Diseases Phases  of  NGS  Analysis   n  Primary   u  Conversion  of  raw  machine  signal  into  sequence  and  quali8es   n  secondary   u  Alignment  of  reads  to  reference  genome  or  transcriptome   u  or  de  novo  assembly  of  reads  into  con8gs   n  Ter8ary   u  SNP  discovery/genotyping   u  Peak  discovery/quan8fica8on  (ChIP,  MeDIP)   u  Transcript  assembly/quan8fica8on  (RNA-­‐seq)   n  Quaternary   u  Differen8al  expression   u  Enrichment,  pathways,  correla8on,  clustering,  visualiza8on,  etc.     u  hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-­‐analysis-­‐for-­‐high-­‐throughput.html   u  hKp://www.slideshare.net/turnersd/pathway-­‐analysis-­‐2012-­‐17947529  

Slide 16

Slide 16 text

GENETIC ANALYSIS of Complex Human Diseases Primary  Analysis:  Get  FASTQ  file   @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***-

Slide 17

Slide 17 text

GENETIC ANALYSIS of Complex Human Diseases “Phred-­‐scaled”  base  quali&es   #  $p  is  probability  base  is  erroneous   $Q  =  -­‐10  *  log($p)  /  log(10);  #  Phred  Q   $q  =  chr(($Q<=40?  $Q  :  40)  +  33);  #  FASTQ  quality  character   $Q  =  ord($q)  -­‐  33;  #  33  offset   SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)

Slide 18

Slide 18 text

GENETIC ANALYSIS of Complex Human Diseases Secondary  analysis   n Alignment  back  to  the  reference   u Computa8onally  demanding  –  can’t  use  BLAST   u Many  algorithms  (Maq,  BWA,  bow8e,  bow8e2,   Mosaik,  NovoAlign,  SOAP2,  SSAHA,  …)   u  hKp://en.wikipedia.org/wiki/List_of_sequence_alignment_sokware     u Sensi8vity  to  sequencing  errors,  polymorphisms,   indels,  rearrangements   u Tradeoffs  in  8me  vs.  memory  vs.  performance    

Slide 19

Slide 19 text

GENETIC ANALYSIS of Complex Human Diseases RNA-Seq Workflow 1: Differential Gene Expression

Slide 20

Slide 20 text

GENETIC ANALYSIS of Complex Human Diseases RNA-Seq Workflow 2: Differential Isoform Expression, Exon Usage

Slide 21

Slide 21 text

GENETIC ANALYSIS of Complex Human Diseases Download data & software n  Public data from GEO. E.g. GSE32038 u  http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE32038 u  Trapnell et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 2012: 7:562. n  Sequence, annotation, indexes (Ensembl) u  iGenomes: http://tophat.cbcb.umd.edu/igenomes.html u  Genes: /Annotation/Genes/genes.gtf u  Indexes: /Sequence/BowtieIndex/genome.* n  Software: u  Samtools: http://samtools.sourceforge.net/ u  FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ u  Bowtie: http://bowtie-bio.sourceforge.net/index.shtml u  Tophat: http://tophat.cbcb.umd.edu/ u  HTSeq: http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html u  R: http://www.r-project.org/ u  DESeq2: http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html u  Cufflinks: http://cufflinks.cbcb.umd.edu/ u  cummeRbund: http://compbio.mit.edu/cummeRbund/

Slide 22

Slide 22 text

GENETIC ANALYSIS of Complex Human Diseases Do some quality assessment Software: Picard picard.sourceforge.net FastQC bioinformatics.bbsrc.ac.uk/projects/fastqc RSeQC code.google.com/p/rseqc FastX Toolkit hannonlab.cshl.edu/fastx_toolkit R/ShortRead bioconductor.org/packages/bioc/html/ShortRead.html

Slide 23

Slide 23 text

GENETIC ANALYSIS of Complex Human Diseases Mapping across splice junctions: tophat 1.  Map reads to genome 2.  Collect unmappable reads 3.  Break reads into segments. Small segments often independently align. If align 100bp-kbs apart, infer splice. tophat –G genes.gtf –o C1_R1_tophatout /path/bowtieindex/genome C1_R1_1.fq C1_R1_2.fq tophat –G genes.gtf –o C1_R2_tophatout /path/bowtieindex/genome C1_R2_1.fq C1_R2_2.fq tophat –G genes.gtf –o C1_R3_tophatout /path/bowtieindex/genome C1_R3_1.fq C1_R3_2.fq tophat –G genes.gtf –o C2_R1_tophatout /path/bowtieindex/genome C2_R1_1.fq C2_R1_2.fq tophat –G genes.gtf –o C2_R2_tophatout /path/bowtieindex/genome C2_R2_1.fq C2_R2_2.fq tophat –G genes.gtf –o C2_R3_tophatout /path/bowtieindex/genome C2_R3_1.fq C2_R3_2.fq Gene Annotation Output Directory Bowtie Index Read 1 Read 2

Slide 24

Slide 24 text

GENETIC ANALYSIS of Complex Human Diseases Workflow 1: Differential Gene Expression Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression

Slide 25

Slide 25 text

GENETIC ANALYSIS of Complex Human Diseases Workflow 1: Differential Gene Expression Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression Software: HTSeq http://www-huber.embl.de/users/anders/HTSeq Run htseq-count on each of the alignments: htseq-count First convert binary .bam file to text .sam file using samtools: samtools view accepted_hits.bam > C1_R1.sam

Slide 26

Slide 26 text

GENETIC ANALYSIS of Complex Human Diseases Workflow 1: Differential Gene Expression Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression Software: DESeq2 http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html > library(DESeq2) > sampleFiles <- c("C1_R1.counts.txt", "C1_R2.counts.txt", "C1_R3.counts.txt", "C2_R1.counts.txt", "C2_R2.counts.txt", "C2_R3.counts.txt") > sampleCondition <- factor(substr(sampleFiles, 1, 2)) > sampleTable <- data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition) > sampleTable sampleName fileName condition 1 C1_R1.counts.txt C1_R1.counts.txt C1 2 C1_R2.counts.txt C1_R2.counts.txt C1 3 C1_R3.counts.txt C1_R3.counts.txt C1 4 C2_R1.counts.txt C2_R1.counts.txt C2 5 C2_R2.counts.txt C2_R2.counts.txt C2 6 C2_R3.counts.txt C2_R3.counts.txt C2 dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=".", design=~condition) dds <- DESeq(dds) results <- results(dds) results <- results[order(results$FDR), ] plotMA(dds) ...

Slide 27

Slide 27 text

GENETIC ANALYSIS of Complex Human Diseases RNA-Seq Workflow 2: Differential Isoform Expression, Exon Usage

Slide 28

Slide 28 text

GENETIC ANALYSIS of Complex Human Diseases Changes in fragment count for a gene does not necessarily equal a change in expression. Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2012): 46-53.

Slide 29

Slide 29 text

GENETIC ANALYSIS of Complex Human Diseases Workflow 2a: Assemble transcripts for each sample: cufflinks n Cufflinks u Identifies mutually incompatible fragments u Identify minimal set of transcripts to explain all the fragments. cufflinks -o C1_R1_cufflinksout C1_R1_tophatout/accepted_hits.bam cufflinks -o C1_R2_cufflinksout C1_R2_tophatout/accepted_hits.bam cufflinks -o C1_R3_cufflinksout C1_R3_tophatout/accepted_hits.bam cufflinks -o C2_R1_cufflinksout C2_R1_tophatout/accepted_hits.bam cufflinks -o C2_R2_cufflinksout C2_R2_tophatout/accepted_hits.bam cufflinks -o C2_R3_cufflinksout C2_R3_tophatout/accepted_hits.bam Output Directory Path to alignment

Slide 30

Slide 30 text

GENETIC ANALYSIS of Complex Human Diseases Merge assemblies: cuffmerge n  Merge assemblies to create single merged transcriptome annotation. u  Option 1: Pool alignments and assemble all at once. ►  Computationally demanding ►  Assembler will be faced complex mixture of isoforms à more error u  Option 2: Assemble alignments individually, merge resulting assemblies ►  Cuffmerge: meta-assembler using parsimony. ►  Genes with low expression à insufficient coverage for reconstruction. ►  Merging often recovers complete gene. ►  Newly discovered isoforms integrated w/ known ones (RABT).

Slide 31

Slide 31 text

GENETIC ANALYSIS of Complex Human Diseases Merge assemblies: cuffmerge n Create “manifest” of location of all assemblies n Run Cuffmerge on assemblies using RABT cuffmerge –g /path/to/annotation/genes.gtf –s /path/to/refgenome/genome.fa assemblies.txt Reference Gene Annotation ./C1_R1_cufflinksout/transcripts.gtf ./C1_R2_cufflinksout/transcripts.gtf ./C1_R3_cufflinksout/transcripts.gtf ./C2_R1_cufflinksout/transcripts.gtf ./C2_R2_cufflinksout/transcripts.gtf ./C2_R3_cufflinksout/transcripts.gtf Assemblies.txt: location of assemblies Reference Genome Sequence Manifest from above

Slide 32

Slide 32 text

GENETIC ANALYSIS of Complex Human Diseases Differential expression: cuffdiff n Identify differentially expressed genes & transcripts cuffdiff –o cuffdiff_out –b genome.fa –u merged.gtf \ ./C1_R1_tophatout/accepted_hits.bam,\ ./C1_R2_tophatout/accepted_hits.bam,\ ./C1_R3_tophatout/accepted_hits.bam \ ./C2_R1_tophatout/accepted_hits.bam,\ ./C2_R2_tophatout/accepted_hits.bam,\ ./C2_R3_tophatout/accepted_hits.bam Reference Sequence Output directory Merged assembly Location of alignments •  1 gene •  2 TSS •  2 CDS •  3 Isoforms

Slide 33

Slide 33 text

GENETIC ANALYSIS of Complex Human Diseases Downstream analysis & visualization

Slide 34

Slide 34 text

GENETIC ANALYSIS of Complex Human Diseases Visualization with cummeRbund n Install cummeRbund: u Install from BioConductor: ►  source("http://bioconductor.org/biocLite.R") ►  biocLite("cummeRbund") u Download and install latest version from http://compbio.mit.edu/cummeRbund/ n Load the package u library(cummeRbund) n Read in the data u  cuff <- readCufflinks(“/path/to/cuffdiff/output”)

Slide 35

Slide 35 text

GENETIC ANALYSIS of Complex Human Diseases Visualization with cummeRbund csDensity(genes(cuff)) csBoxplot(genes(cuff)) csScatter(genes(cuff), "C1", "C2", smooth=T) csVolcano(genes(cuff), "C1", "C2")

Slide 36

Slide 36 text

GENETIC ANALYSIS of Complex Human Diseases Visualization with cummeRbund mygene2 <- getGene(cuff, "Rala") expressionBarplot(mygene2) expressionBarplot(isoforms(mygene2))

Slide 37

Slide 37 text

GENETIC ANALYSIS of Complex Human Diseases DEXSeq n  Differential Gene Expression (E.g. DESeq) n  Differential Isoform Expression (E.g. Cufflinks) n  Differential Exon Usage n  What’s different about DEXSeq? u  Doesn’t do full transcript assembly (Cufflinks) u  Doesn’t count fragments mapping to genes (DESeq) u  Avoids assembly and looks for differences in reads mapping to individual exons. u  Uses counts (negative binomial)

Slide 38

Slide 38 text

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Installation n Installation & load: u  source("http://bioconductor.org/biocLite.R") u  biocLite(“DEXSeq”) u  library(DEXSeq) n Installation comes bundled with useful python scripts in the python_scripts directory of the library. Put these in your PATH.

Slide 39

Slide 39 text

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Data preparation n First, prepare “flattened” GFF: n Create sorted SAM files n Count reads overlapping counting bins dexseq_prepare_annotation.py input.gtf exons.gff Reference Annotation Script comes with DEXSeq samtools view C1_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R1.sam samtools view C1_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R2.sam samtools view C1_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R3.sam samtools view C2_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R1.sam samtools view C2_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R2.sam samtools view C2_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R3.sam dexseq_count.py -p no -s no exons.gff C1_R1.sam C1_R1.counts.txt dexseq_count.py -p no -s no exons.gff C1_R2.sam C1_R2.counts.txt dexseq_count.py -p no -s no exons.gff C1_R3.sam C1_R3.counts.txt dexseq_count.py -p no -s no exons.gff C2_R1.sam C2_R1.counts.txt dexseq_count.py -p no -s no exons.gff C2_R2.sam C2_R2.counts.txt dexseq_count.py -p no -s no exons.gff C2_R3.sam C2_R3.counts.txt Script comes with DEXSeq Flattened Annotation Alignment Output file Output file

Slide 40

Slide 40 text

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Data import n  The pasilla package vignette gives detailed instructions on how to do this: http://www.bioconductor.org/packages/release/data/experiment/html/pasilla.html > design <- data.frame(condition=c(rep("C1",3), rep("C2",3)), replicate=rep(1:3,2)) > rownames(design) <- with(design, paste(condition, "_R", replicate, sep="")) > design condition replicate C1_R1 C1 1 C1_R2 C1 2 C1_R3 C1 3 C2_R1 C2 1 C2_R2 C2 2 C2_R3 C2 3 > countfiles <- file.path(".", paste(rownames(design), ".counts.txt", sep="")) > countfiles [1] "./C1_R1.counts.txt" "./C1_R2.counts.txt" "./C1_R3.counts.txt" "./C2_R1.counts.txt" [5] "./C2_R2.counts.txt" "./C2_R3.counts.txt" > flattenedfile <- "/Users/sdt5z/smb/u/genomes/dexseq/exons_dme_ens_bdgp525.gff" > exons <- read.HTSeqCounts(countfiles=countfiles, design=design, flattenedfile=flattenedfile) > sampleNames(exons) <- rownames(design)

Slide 41

Slide 41 text

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Data Analysis # Estimate size factors (normalizes for sequencing depth) exons <- estimateSizeFactors(exons) sizeFactors(exons) # Estimate dispersion exons <- estimateDispersions(exons) exons <- fitDispersionFunction(exons) # Test for Differential Exon Usage exons <- testForDEU(exons) exons <- estimatelog2FoldChanges(exons) result <- DEUresultTable(exons) # How many are significant at FDR 0.001? table(res$padjust<0.0001) # M vs A plot plot(result$meanBase, result[, "log2fold(C2/C1)"], log="x”)

Slide 42

Slide 42 text

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: visualization plotDEXSeq(exons, "FBgn0030362", cex.axis=1.2, cex=1.3, lwd=2, legend=T, displayTranscripts=T)

Slide 43

Slide 43 text

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: HTML Report library(biomaRt) mart <- useMart("ensembl", dataset="dmelanogaster_gene_ensembl") listAttributes(mart)[1:25,] attributes <- c("ensembl_gene_id", "external_gene_id", "description") DEXSeqHTML(exons, FDR=0.0001, mart=mart, filter="ensembl_gene_id", attributes=attributes)

Slide 44

Slide 44 text

GENETIC ANALYSIS of Complex Human Diseases Downstream analysis n  Now you have a list of: u Genes u Isoforms (genes) u Exons (genes) n  How to place in functional context? n  Pathway / functional analysis! u Gene Ontology over-representation u Gene Set Enrichment Analysis u Signaling Pathway Impact Analysis u Many more… n  Resources: u  hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-­‐analysis-­‐for-­‐high-­‐throughput.html   u  hKp://www.slideshare.net/turnersd/pathway-­‐analysis-­‐2012-­‐17947529  

Slide 45

Slide 45 text

GENETIC ANALYSIS of Complex Human Diseases Workflow Management: Galaxy n http:usegalaxy.org

Slide 46

Slide 46 text

GENETIC ANALYSIS of Complex Human Diseases Workflow Management: Taverna n  Taverna: http://www.taverna.org.uk/ n  TavernaPBS: http://sourceforge.net/projects/tavernapbs/

Slide 47

Slide 47 text

GENETIC ANALYSIS of Complex Human Diseases Further Reading n  RNA-Seq: u  Garber, M., Grabherr, M. G., Guttman, M., & Trapnell, C. (2011). Computational methods for transcriptome annotation and quantification using RNA-seq. Nature methods, 8(6), 469-77. u  Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research, 18(9), 1509-17. u  Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621-8. u  Ozsolak, F., & Milos, P. M. (2011). RNA sequencing: advances, challenges and opportunities. Nature reviews. Genetics, 12(2), 87-98. u  Toung, J. M., Morley, M., Li, M., & Cheung, V. G. (2011). RNA-sequence analysis of human B-cells. Genome research, 991-998. u  Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics, 10(1), 57-63. n  Bowtie/Tophat: u  Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3), R25. u  Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. n  Cufflinks: u  Roberts, A., Pimentel, H., Trapnell, C., & Pachter, L. (2011). Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics (Oxford, England), 27(17), 2325-9. u  Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., et al. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562-578. u  Trapnell, C., Williams, B. a, Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5), 511-5. n  DEXSeq: u  Vignette: http://watson.nci.nih.gov/bioc_mirror/packages/2.9/bioc/html/DEXSeq.html. u  Pre-pub manuscript: Anders, S., Reyes, A., Huber, W. (2012). Detecting differential usage of exons from RNA-Seq data. Nautre Precedings, DOI: 10.1038/npre.2012.6837.2.

Slide 48

Slide 48 text

GENETIC ANALYSIS of Complex Human Diseases Online Community Forum and Discussion n Seqanswers u  http://SEQanswers.com u  Format: Forum u  Li et al. SEQanswers : An open access community for collaboratively decoding genomes. Bioinformatics (2012). n BioStar: u  http://biostar.stackexchange.com u  Format: Q&A u  Parnell et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comp Bio (2011). n  Other Bioinformatics Resources: stephenturner.us/p/edu

Slide 49

Slide 49 text

GENETIC ANALYSIS of Complex Human Diseases DNA Methylation: Importance n Occurs most frequently at CpG sites n High methylation at promoters ≈ silencing n Methylation perturbed in cancer n Methylation associated with many other complex diseases: neural, autoimmune, response to env. n Mapping DNA methylation à new disease genes & drug targets.

Slide 50

Slide 50 text

GENETIC ANALYSIS of Complex Human Diseases DNA Methylation: Challenges n Dynamic and tissue-specific n DNA à Collection of cells which vary in 5meC patterns à 5meC pattern is complex. n Further, uneven distribution of CpG targets n Multiple classes of methods: u Bisulfite, sequence-based: Assay methylated target sequences across individual DNAs. u Affinity enrichment, count-based: Assay methylation level across many genomic loci.

Slide 51

Slide 51 text

GENETIC ANALYSIS of Complex Human Diseases DNA Methylation: Mapping BS-Seq Whole-genome bisulfite sequencing RRBS-Seq Reduced representation bisulfite sequencing BC-Seq Bisulfite capture sequencing BSPP Bisulfite specific padlock probes Methyl-Seq Restriction enzyme based methyl-seq MSCC Methyl sensitive cut counting HELP-Seq HpaII fragment enrichment by ligation PCR MCA-Seq Methylated CpG island amplification MeDIP-Seq Methylated DNA immunoprecipitation MBP-Seq Methyl-binding protein sequencing MethylCap-seq Methylated DNA capture by affinity purification MIRA-Seq Methylated CpG island recovery assay RNA-Seq High-throughput cDNA sequencing DNA Methylation Gene Expression

Slide 52

Slide 52 text

GENETIC ANALYSIS of Complex Human Diseases Methylation: REs and PCR n Restriction enzyme digest u Isoschizomers HpaII and MspI both recognize same sequence: 5’-CCGG-3’ u MspI digests regardless of methylation u HpaII only digests at unmethylated sites n PCR à gel electrophoresis à southern blot n Pros: Highly sensitive n Cons: Low-throughput, high false positive rate because of incomplete digestion (for reasons other than methylation).

Slide 53

Slide 53 text

GENETIC ANALYSIS of Complex Human Diseases Bisulfite sequencing n  Sodium bisulfite converts unmethylated (but not methylated) C’s into U’s. n  This introduces a methylation-specific “SNP”. n  RRBS – library enriched for CpG-dense regions by digesting with MspI.

Slide 54

Slide 54 text

GENETIC ANALYSIS of Complex Human Diseases MeDIP-Seq n MeDIP-Seq = Methylated DNA immunoprecipitation n Uses antibody against 5- methylcytosine to retrieve methylated fragments from sonicated DNA. n Enrichment method = count number of reads

Slide 55

Slide 55 text

GENETIC ANALYSIS of Complex Human Diseases MethylCap-Seq n Uses methyl-binding domain (MBD) protein to obtain DNA with similar methylation levels. n Also a counting method.

Slide 56

Slide 56 text

GENETIC ANALYSIS of Complex Human Diseases Methylation: Accuracy n  Bock et al. Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14. n  MeDIP, MethylCap, RRBS largely concordant with Illumina Infinium assay

Slide 57

Slide 57 text

GENETIC ANALYSIS of Complex Human Diseases Methylation methods: coverage n Coverage varies among different methods

Slide 58

Slide 58 text

GENETIC ANALYSIS of Complex Human Diseases Methylation: Features & Biases

Slide 59

Slide 59 text

GENETIC ANALYSIS of Complex Human Diseases Methylation: Bioinformatics Resources Resource   Purpose   URL  Refs   Batman   MeDIP  DNA  methyla8on  analysis  tool   hKp://td-­‐blade.gurdon.cam.ac.uk/sokware/batman   BDPC   DNA  methyla8on  analysis  plalorm   hKp://biochem.jacobs-­‐university.de/BDPC   BSMAP   Whole-­‐genome  bisulphite  sequence  mapping   hKp://code.google.com/p/bsmap   CpG  Analyzer   Windows-­‐based  program  for  bisulphite  DNA   -­‐   CpGcluster   CpG  island  iden8fica8on   hKp://bioinfo2.ugr.es/CpGcluster   CpGFinder   Online  program  for  CpG  island  iden8fica8on   hKp://linux1.sokberry.com   CpG  Island  Explorer   Online  program  for  CpG  Island  iden8fica8on   hKp://bioinfo.hku.hk/cpgieintro.html   CpG  Island  Searcher   Online  program  for  CpG  Island  iden8fica8on   hKp://cpgislands.usc.edu   CpG  PaKernFinder   Windows-­‐based  program  for  bisulphite  DNA   -­‐   CpG  Promoter   Large-­‐scale  promoter  mapping  using  CpG  islands   hKp://www.cshl.edu/OTT/html/cpg_promoter.html   CpG  ra8o  and  GC  content  PloKer   Online  program  for  ploMng  the  observed:expected  ra8o  of  CpG   hKp://mwsross.bms.ed.ac.uk/public/cgi-­‐bin/cpg.pl   CpGviewer   Bisulphite  DNA  sequencing  viewer   hKp://dna.leeds.ac.uk/cpgviewer   CyMATE   Bisulphite-­‐based  analysis  of  plant  genomic  DNA   hKp://www.gmi.oeaw.ac.at/en/cymate-­‐index/   EMBOSS  CpGPlot/  CpGReport   Online  program  for  ploMng  CpG-­‐rich  regions   hKp://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html   Epigenomics  Roadmap   NIH  Epigenomics  Roadmap  Ini8a8ve  homepage   hKp://nihroadmap.nih.gov/epigenomics   Epinexus   DNA  methyla8on  analysis  tools   hKp://epinexus.net/home.html   MEDME   Sokware  package  (using  R)  for  modelling  MeDIP  experimental  data   hKp://espresso.med.yale.edu/medme   methBLAST   Similarity  search  program  for  bisulphite-­‐modified  DNA   hKp://medgen.ugent.be/methBLAST   MethDB   Database  for  DNA  methyla8on  data   hKp://www.methdb.de   MethPrimer   Primer  design  for  bisulphite  PCR   hKp://www.urogene.org/methprimer   methPrimerDB   PCR  primers  for  DNA  methyla8on  analysis   hKp://medgen.ugent.be/methprimerdb   MethTools   Bisulphite  sequence  data  analysis  tool   hKp://www.methdb.de   MethyCancer  Database   Database  of  cancer  DNA  methyla8on  data   hKp://methycancer.psych.ac.cn   Methyl  Primer  Express   Primer  design  for  bisulphite  PCR   hKp://www.appliedbiosystems.com/   Methylumi   Bioconductor  pkg  for  DNA  methyla8on  data  from  Illumina   hKp://www.bioconductor.org/packages/bioc/html/   Methylyzer   Bisulphite  DNA  sequence  visualiza8on  tool   hKp://ubio.bioinfo.cnio.es/Methylyzer/main/index.html   mPod   DNA  methyla8on  viewer  integrated  w/  Ensembl  genome  browser   hKp://www.compbio.group.cam.ac.uk/Projects/   PubMeth   Database  of  DNA  methyla8on  literature   hKp://www.pubmeth.org   QUMA   Quan8fica8on  tool  for  methyla8on  analysis   hKp://quma.cdb.riken.jp   TCGA  Data  Portal   Database  of  TCGA  DNA  methyla8on  data   hKp://cancergenome.nih.gov/dataportal  

Slide 60

Slide 60 text

GENETIC ANALYSIS of Complex Human Diseases Methylation: Further Reading Bock, C., Tomazou, E. M., Brinkman, A. B., Müller, F., Simmer, F., Gu, H., Jäger, N., et al. (2010). Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14. Brinkman, A. B., Simmer, F., Ma, K., Kaan, A., Zhu, J., & Stunnenberg, H. G. (2010). Whole-genome DNA methylation profiling using MethylCap-seq. Methods (San Diego, Calif.), 52(3), 232-6. Brunner, A. L., Johnson, D. S., Kim, S. W., Valouev, A., Reddy, T. E., et al. (2009). Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver, 1044-1056. Gu, H., Bock, C., Mikkelsen, T. S., Jäger, N., Smith, Z. D., Tomazou, E., Gnirke, A., et al. (2010). Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nature methods, 7(2), 133-6. Harris, R. A., Wang, T., Coarfa, C., Nagarajan, R. P., Hong, C., Downey, S. L., Johnson, B. E., et al. (2010). Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nature biotechnology, 28(10), 1097-105. Kerick, M., Fischer, A., & Schweiger, M.-ruth. (2012). Bioinformatics for High Throughput Sequencing. (N. Rodríguez-Ezpeleta, M. Hackenberg, & A. M. Aransay, Eds.), 151-167. New York, NY: Springer New York. Laird, P. W. (2010). Principles and challenges of genomewide DNA methylation analysis. Nature reviews. Genetics, 11(3), 191-203. Weber, M., Davies, J. J., Wittig, D., Oakeley, E. J., Haase, M., Lam, W. L., & Schübeler, D. (2005). Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nature genetics, 37(8), 853-62. doi:10.1038/ng1598

Slide 61

Slide 61 text

GENETIC ANALYSIS of Complex Human Diseases Thank you Web: bioinformatics.virginia.edu E-mail: [email protected] Blog: www.GettingGeneticsDone.com Twitter: twitter.com/genetics_blog