Examining Gene Expression and Methylation with Next-Gen Sequencing

GENETIC ANALYSIS of Complex Human Diseases Examining Gene Expression and
Methylation with Next-Gen Sequencing Stephen Turner, Ph.D. Bioinformatics Core Director bioinformatics.virginia.edu University of Virginia

GENETIC ANALYSIS of Complex Human Diseases Gene expression pre-2008 PCR
Microarrays

GENETIC ANALYSIS of Complex Human Diseases

GENETIC ANALYSIS of Complex Human Diseases Advantages of RNA-Seq n 
No reference necessary n  Low background (no cross-hybridization) n  Unlimited dynamic range (FC 9000 Science 320:1344) n  Direct counting (microarrays: indirect – hybridization) n  Can characterize full transcriptome u mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc) u Differential gene expression u Differential coding output u Differential TSS usage u Differential isoform expression

GENETIC ANALYSIS of Complex Human Diseases Isoform level data

GENETIC ANALYSIS of Complex Human Diseases Differential splicing & TSS
use

GENETIC ANALYSIS of Complex Human Diseases Is it accurate? n 
Marioni et al. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 2008 18:1509.

GENETIC ANALYSIS of Complex Human Diseases RNA-Seq Challenges n  Library
construction u  Size selection (messenger, small) u  Strand specificity? n  Bioinformatic challenges u  Spliced alignment u  Transcript deconvolution n  Statistical Challenges u  Highly variable abundance u  Sample size: never, ever, plan n=1 u  Normalization (RPKM) ►  More reads from longer transcripts, higher sequencing depth ►  Want to compare features of different lengths ►  Want to compare conditions with different total sequence depth

GENETIC ANALYSIS of Complex Human Diseases RNA-Seq Overview Condi&on 1
(normal colon) Condi&on 2 (colon tumor) Samples of interest AAAAA mRNA AAAAA mRNA TTTTT Library @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***- @HWUSI-EAS100R:6:73:941:1973#0/1 CATCGACGTAGATCGACTACATGAACTGCTCG +HWUSI-EAS100R:6:73:941:1973#0/1 !'’*+(*+!+(*!+*(((***!%%%%!%%(+-

GENETIC ANALYSIS of Complex Human Diseases Common question #1: Depth
n  Question: how much sequence do I need? n  Answer: it’s complicated. n  Oversimplified answer: 20-50 million PE reads / sample (mouse/human). n  Depends on: u  Size & complexity of transcriptome u  Application: differential gene expression, transcript discovery u  Tissue type, RNA quality, library preparation u  Sequencing type: length, paired-end vs single-end, etc. n  Find a publication in your field with similar goals. n  Good news: ¼ HiSeq lane usually sufficient.

GENETIC ANALYSIS of Complex Human Diseases Common question #2: Sample
Size n Question: How many samples should I sequence? n Oversimplified Answer: At least 3 biological replicates per condition. n Depends on: u Sequencing depth u Application u Goals (prioritization, biomarker discovery, etc.) u Effect size, desired power, statistical significance n Find a publication with similar goals

GENETIC ANALYSIS of Complex Human Diseases Common question #3: Workflow
n  How do I analyze the data? n  No standards! u  Unspliced aligners: BWA, Bowtie, Bowtie2, MANY others! u  Spliced aligners: STAR, Rum, Tophat, Tophat2-Bowtie1, Tophat2-Bowtie2, GSNAP, MANY others. u  Reference builds & annotations: UCSC, Entrez, Ensembl u  Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS u  Quantification: Cufflinks, RSEM, eXpress, MISO, etc. u  Differential expression: Cuffdiff, Cuffdiff2, DegSeq, DESeq, EdgeR, Myrna n  Like early microarray days: lots of excitement, lots of tools, little knowledge of integrating tools in pipeline! n  Benchmarks u  Microarray: Spike-ins (Irizarry) u  RNA-Seq: ???, simulation, ???

GENETIC ANALYSIS of Complex Human Diseases Common question #3: Workflow
Eyras et al. Methods to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993 Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782

GENETIC ANALYSIS of Complex Human Diseases Phases of NGS Analysis
n  Primary u  Conversion of raw machine signal into sequence and quali8es n  secondary u  Alignment of reads to reference genome or transcriptome u  or de novo assembly of reads into con8gs n  Ter8ary u  SNP discovery/genotyping u  Peak discovery/quan8fica8on (ChIP, MeDIP) u  Transcript assembly/quan8fica8on (RNA-‐seq) n  Quaternary u  Differen8al expression u  Enrichment, pathways, correla8on, clustering, visualiza8on, etc. u  hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-‐analysis-‐for-‐high-‐throughput.html u  hKp://www.slideshare.net/turnersd/pathway-‐analysis-‐2012-‐17947529

GENETIC ANALYSIS of Complex Human Diseases Primary Analysis: Get FASTQ
ﬁle @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***-

GENETIC ANALYSIS of Complex Human Diseases “Phred-‐scaled” base quali&es
# $p is probability base is erroneous $Q = -‐10 * log($p) / log(10); # Phred Q $q = chr(($Q<=40? $Q : 40) + 33); # FASTQ quality character $Q = ord($q) -‐ 33; # 33 oﬀset SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)

GENETIC ANALYSIS of Complex Human Diseases Secondary analysis n Alignment
back to the reference u Computa8onally demanding – can’t use BLAST u Many algorithms (Maq, BWA, bow8e, bow8e2, Mosaik, NovoAlign, SOAP2, SSAHA, …) u  hKp://en.wikipedia.org/wiki/List_of_sequence_alignment_sokware u Sensi8vity to sequencing errors, polymorphisms, indels, rearrangements u Tradeoﬀs in 8me vs. memory vs. performance

GENETIC ANALYSIS of Complex Human Diseases RNA-Seq Workflow 1: Differential
Gene Expression

Isoform Expression, Exon Usage

GENETIC ANALYSIS of Complex Human Diseases Download data & software
n  Public data from GEO. E.g. GSE32038 u  http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE32038 u  Trapnell et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 2012: 7:562. n  Sequence, annotation, indexes (Ensembl) u  iGenomes: http://tophat.cbcb.umd.edu/igenomes.html u  Genes: /Annotation/Genes/genes.gtf u  Indexes: /Sequence/BowtieIndex/genome.* n  Software: u  Samtools: http://samtools.sourceforge.net/ u  FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ u  Bowtie: http://bowtie-bio.sourceforge.net/index.shtml u  Tophat: http://tophat.cbcb.umd.edu/ u  HTSeq: http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html u  R: http://www.r-project.org/ u  DESeq2: http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html u  Cufflinks: http://cufflinks.cbcb.umd.edu/ u  cummeRbund: http://compbio.mit.edu/cummeRbund/

GENETIC ANALYSIS of Complex Human Diseases Do some quality assessment
Software: Picard picard.sourceforge.net FastQC bioinformatics.bbsrc.ac.uk/projects/fastqc RSeQC code.google.com/p/rseqc FastX Toolkit hannonlab.cshl.edu/fastx_toolkit R/ShortRead bioconductor.org/packages/bioc/html/ShortRead.html

GENETIC ANALYSIS of Complex Human Diseases Mapping across splice junctions:
tophat 1.  Map reads to genome 2.  Collect unmappable reads 3.  Break reads into segments. Small segments often independently align. If align 100bp-kbs apart, infer splice. tophat –G genes.gtf –o C1_R1_tophatout /path/bowtieindex/genome C1_R1_1.fq C1_R1_2.fq tophat –G genes.gtf –o C1_R2_tophatout /path/bowtieindex/genome C1_R2_1.fq C1_R2_2.fq tophat –G genes.gtf –o C1_R3_tophatout /path/bowtieindex/genome C1_R3_1.fq C1_R3_2.fq tophat –G genes.gtf –o C2_R1_tophatout /path/bowtieindex/genome C2_R1_1.fq C2_R1_2.fq tophat –G genes.gtf –o C2_R2_tophatout /path/bowtieindex/genome C2_R2_1.fq C2_R2_2.fq tophat –G genes.gtf –o C2_R3_tophatout /path/bowtieindex/genome C2_R3_1.fq C2_R3_2.fq Gene Annotation Output Directory Bowtie Index Read 1 Read 2

GENETIC ANALYSIS of Complex Human Diseases Workflow 1: Differential Gene
Expression Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression

Expression Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression Software: HTSeq http://www-huber.embl.de/users/anders/HTSeq Run htseq-count on each of the alignments: htseq-count <sam_file> <gtf_file> First convert binary .bam file to text .sam file using samtools: samtools view accepted_hits.bam > C1_R1.sam

Expression Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression Software: DESeq2 http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html > library(DESeq2) > sampleFiles <- c("C1_R1.counts.txt", "C1_R2.counts.txt", "C1_R3.counts.txt", "C2_R1.counts.txt", "C2_R2.counts.txt", "C2_R3.counts.txt") > sampleCondition <- factor(substr(sampleFiles, 1, 2)) > sampleTable <- data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition) > sampleTable sampleName fileName condition 1 C1_R1.counts.txt C1_R1.counts.txt C1 2 C1_R2.counts.txt C1_R2.counts.txt C1 3 C1_R3.counts.txt C1_R3.counts.txt C1 4 C2_R1.counts.txt C2_R1.counts.txt C2 5 C2_R2.counts.txt C2_R2.counts.txt C2 6 C2_R3.counts.txt C2_R3.counts.txt C2 dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=".", design=~condition) dds <- DESeq(dds) results <- results(dds) results <- results[order(results$FDR), ] plotMA(dds) ...

Isoform Expression, Exon Usage

GENETIC ANALYSIS of Complex Human Diseases Changes in fragment count
for a gene does not necessarily equal a change in expression. Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2012): 46-53.

GENETIC ANALYSIS of Complex Human Diseases Workflow 2a: Assemble transcripts
for each sample: cufflinks n Cufflinks u Identifies mutually incompatible fragments u Identify minimal set of transcripts to explain all the fragments. cufflinks -o C1_R1_cufflinksout C1_R1_tophatout/accepted_hits.bam cufflinks -o C1_R2_cufflinksout C1_R2_tophatout/accepted_hits.bam cufflinks -o C1_R3_cufflinksout C1_R3_tophatout/accepted_hits.bam cufflinks -o C2_R1_cufflinksout C2_R1_tophatout/accepted_hits.bam cufflinks -o C2_R2_cufflinksout C2_R2_tophatout/accepted_hits.bam cufflinks -o C2_R3_cufflinksout C2_R3_tophatout/accepted_hits.bam Output Directory Path to alignment

GENETIC ANALYSIS of Complex Human Diseases Merge assemblies: cuffmerge n 
Merge assemblies to create single merged transcriptome annotation. u  Option 1: Pool alignments and assemble all at once. ►  Computationally demanding ►  Assembler will be faced complex mixture of isoforms à more error u  Option 2: Assemble alignments individually, merge resulting assemblies ►  Cuffmerge: meta-assembler using parsimony. ►  Genes with low expression à insufficient coverage for reconstruction. ►  Merging often recovers complete gene. ►  Newly discovered isoforms integrated w/ known ones (RABT).

GENETIC ANALYSIS of Complex Human Diseases Merge assemblies: cuffmerge n Create
“manifest” of location of all assemblies n Run Cuffmerge on assemblies using RABT cuffmerge –g /path/to/annotation/genes.gtf –s /path/to/refgenome/genome.fa assemblies.txt Reference Gene Annotation ./C1_R1_cufflinksout/transcripts.gtf ./C1_R2_cufflinksout/transcripts.gtf ./C1_R3_cufflinksout/transcripts.gtf ./C2_R1_cufflinksout/transcripts.gtf ./C2_R2_cufflinksout/transcripts.gtf ./C2_R3_cufflinksout/transcripts.gtf Assemblies.txt: location of assemblies Reference Genome Sequence Manifest from above

GENETIC ANALYSIS of Complex Human Diseases Differential expression: cuffdiff n Identify
differentially expressed genes & transcripts cuffdiff –o cuffdiff_out –b genome.fa –u merged.gtf \ ./C1_R1_tophatout/accepted_hits.bam,\ ./C1_R2_tophatout/accepted_hits.bam,\ ./C1_R3_tophatout/accepted_hits.bam \ ./C2_R1_tophatout/accepted_hits.bam,\ ./C2_R2_tophatout/accepted_hits.bam,\ ./C2_R3_tophatout/accepted_hits.bam Reference Sequence Output directory Merged assembly Location of alignments •  1 gene •  2 TSS •  2 CDS •  3 Isoforms

GENETIC ANALYSIS of Complex Human Diseases Downstream analysis & visualization

GENETIC ANALYSIS of Complex Human Diseases Visualization with cummeRbund n Install
cummeRbund: u Install from BioConductor: ►  source("http://bioconductor.org/biocLite.R") ►  biocLite("cummeRbund") u Download and install latest version from http://compbio.mit.edu/cummeRbund/ n Load the package u library(cummeRbund) n Read in the data u  cuff <- readCufflinks(“/path/to/cuffdiff/output”)

GENETIC ANALYSIS of Complex Human Diseases Visualization with cummeRbund csDensity(genes(cuff))
csBoxplot(genes(cuff)) csScatter(genes(cuff), "C1", "C2", smooth=T) csVolcano(genes(cuff), "C1", "C2")

GENETIC ANALYSIS of Complex Human Diseases Visualization with cummeRbund mygene2
<- getGene(cuff, "Rala") expressionBarplot(mygene2) expressionBarplot(isoforms(mygene2))

GENETIC ANALYSIS of Complex Human Diseases DEXSeq n  Differential Gene
Expression (E.g. DESeq) n  Differential Isoform Expression (E.g. Cufflinks) n  Differential Exon Usage n  What’s different about DEXSeq? u  Doesn’t do full transcript assembly (Cufflinks) u  Doesn’t count fragments mapping to genes (DESeq) u  Avoids assembly and looks for differences in reads mapping to individual exons. u  Uses counts (negative binomial)

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Installation n Installation
& load: u  source("http://bioconductor.org/biocLite.R") u  biocLite(“DEXSeq”) u  library(DEXSeq) n Installation comes bundled with useful python scripts in the python_scripts directory of the library. Put these in your PATH.

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Data preparation
n First, prepare “flattened” GFF: n Create sorted SAM files n Count reads overlapping counting bins dexseq_prepare_annotation.py input.gtf exons.gff Reference Annotation Script comes with DEXSeq samtools view C1_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R1.sam samtools view C1_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R2.sam samtools view C1_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R3.sam samtools view C2_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R1.sam samtools view C2_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R2.sam samtools view C2_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R3.sam dexseq_count.py -p no -s no exons.gff C1_R1.sam C1_R1.counts.txt dexseq_count.py -p no -s no exons.gff C1_R2.sam C1_R2.counts.txt dexseq_count.py -p no -s no exons.gff C1_R3.sam C1_R3.counts.txt dexseq_count.py -p no -s no exons.gff C2_R1.sam C2_R1.counts.txt dexseq_count.py -p no -s no exons.gff C2_R2.sam C2_R2.counts.txt dexseq_count.py -p no -s no exons.gff C2_R3.sam C2_R3.counts.txt Script comes with DEXSeq Flattened Annotation Alignment Output file Output file

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Data import
n  The pasilla package vignette gives detailed instructions on how to do this: http://www.bioconductor.org/packages/release/data/experiment/html/pasilla.html > design <- data.frame(condition=c(rep("C1",3), rep("C2",3)), replicate=rep(1:3,2)) > rownames(design) <- with(design, paste(condition, "_R", replicate, sep="")) > design condition replicate C1_R1 C1 1 C1_R2 C1 2 C1_R3 C1 3 C2_R1 C2 1 C2_R2 C2 2 C2_R3 C2 3 > countfiles <- file.path(".", paste(rownames(design), ".counts.txt", sep="")) > countfiles [1] "./C1_R1.counts.txt" "./C1_R2.counts.txt" "./C1_R3.counts.txt" "./C2_R1.counts.txt" [5] "./C2_R2.counts.txt" "./C2_R3.counts.txt" > flattenedfile <- "/Users/sdt5z/smb/u/genomes/dexseq/exons_dme_ens_bdgp525.gff" > exons <- read.HTSeqCounts(countfiles=countfiles, design=design, flattenedfile=flattenedfile) > sampleNames(exons) <- rownames(design)

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Data Analysis
# Estimate size factors (normalizes for sequencing depth) exons <- estimateSizeFactors(exons) sizeFactors(exons) # Estimate dispersion exons <- estimateDispersions(exons) exons <- fitDispersionFunction(exons) # Test for Differential Exon Usage exons <- testForDEU(exons) exons <- estimatelog2FoldChanges(exons) result <- DEUresultTable(exons) # How many are significant at FDR 0.001? table(res$padjust<0.0001) # M vs A plot plot(result$meanBase, result[, "log2fold(C2/C1)"], log="x”)

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: visualization plotDEXSeq(exons,
"FBgn0030362", cex.axis=1.2, cex=1.3, lwd=2, legend=T, displayTranscripts=T)

GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: HTML Report
library(biomaRt) mart <- useMart("ensembl", dataset="dmelanogaster_gene_ensembl") listAttributes(mart)[1:25,] attributes <- c("ensembl_gene_id", "external_gene_id", "description") DEXSeqHTML(exons, FDR=0.0001, mart=mart, filter="ensembl_gene_id", attributes=attributes)

GENETIC ANALYSIS of Complex Human Diseases Downstream analysis n  Now
you have a list of: u Genes u Isoforms (genes) u Exons (genes) n  How to place in functional context? n  Pathway / functional analysis! u Gene Ontology over-representation u Gene Set Enrichment Analysis u Signaling Pathway Impact Analysis u Many more… n  Resources: u  hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-‐analysis-‐for-‐high-‐throughput.html u  hKp://www.slideshare.net/turnersd/pathway-‐analysis-‐2012-‐17947529

GENETIC ANALYSIS of Complex Human Diseases Workflow Management: Galaxy n http:usegalaxy.org

GENETIC ANALYSIS of Complex Human Diseases Workflow Management: Taverna n 
Taverna: http://www.taverna.org.uk/ n  TavernaPBS: http://sourceforge.net/projects/tavernapbs/

GENETIC ANALYSIS of Complex Human Diseases Further Reading n  RNA-Seq:
u  Garber, M., Grabherr, M. G., Guttman, M., & Trapnell, C. (2011). Computational methods for transcriptome annotation and quantification using RNA-seq. Nature methods, 8(6), 469-77. u  Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research, 18(9), 1509-17. u  Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621-8. u  Ozsolak, F., & Milos, P. M. (2011). RNA sequencing: advances, challenges and opportunities. Nature reviews. Genetics, 12(2), 87-98. u  Toung, J. M., Morley, M., Li, M., & Cheung, V. G. (2011). RNA-sequence analysis of human B-cells. Genome research, 991-998. u  Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics, 10(1), 57-63. n  Bowtie/Tophat: u  Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3), R25. u  Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. n  Cufflinks: u  Roberts, A., Pimentel, H., Trapnell, C., & Pachter, L. (2011). Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics (Oxford, England), 27(17), 2325-9. u  Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., et al. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562-578. u  Trapnell, C., Williams, B. a, Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5), 511-5. n  DEXSeq: u  Vignette: http://watson.nci.nih.gov/bioc_mirror/packages/2.9/bioc/html/DEXSeq.html. u  Pre-pub manuscript: Anders, S., Reyes, A., Huber, W. (2012). Detecting differential usage of exons from RNA-Seq data. Nautre Precedings, DOI: 10.1038/npre.2012.6837.2.

GENETIC ANALYSIS of Complex Human Diseases Online Community Forum and
Discussion n Seqanswers u  http://SEQanswers.com u  Format: Forum u  Li et al. SEQanswers : An open access community for collaboratively decoding genomes. Bioinformatics (2012). n BioStar: u  http://biostar.stackexchange.com u  Format: Q&A u  Parnell et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comp Bio (2011). n  Other Bioinformatics Resources: stephenturner.us/p/edu

GENETIC ANALYSIS of Complex Human Diseases DNA Methylation: Importance n Occurs
most frequently at CpG sites n High methylation at promoters ≈ silencing n Methylation perturbed in cancer n Methylation associated with many other complex diseases: neural, autoimmune, response to env. n Mapping DNA methylation à new disease genes & drug targets.

GENETIC ANALYSIS of Complex Human Diseases DNA Methylation: Challenges n Dynamic
and tissue-specific n DNA à Collection of cells which vary in 5meC patterns à 5meC pattern is complex. n Further, uneven distribution of CpG targets n Multiple classes of methods: u Bisulfite, sequence-based: Assay methylated target sequences across individual DNAs. u Affinity enrichment, count-based: Assay methylation level across many genomic loci.

GENETIC ANALYSIS of Complex Human Diseases DNA Methylation: Mapping BS-Seq
Whole-genome bisulfite sequencing RRBS-Seq Reduced representation bisulfite sequencing BC-Seq Bisulfite capture sequencing BSPP Bisulfite specific padlock probes Methyl-Seq Restriction enzyme based methyl-seq MSCC Methyl sensitive cut counting HELP-Seq HpaII fragment enrichment by ligation PCR MCA-Seq Methylated CpG island amplification MeDIP-Seq Methylated DNA immunoprecipitation MBP-Seq Methyl-binding protein sequencing MethylCap-seq Methylated DNA capture by affinity purification MIRA-Seq Methylated CpG island recovery assay RNA-Seq High-throughput cDNA sequencing DNA Methylation Gene Expression

GENETIC ANALYSIS of Complex Human Diseases Methylation: REs and PCR
n Restriction enzyme digest u Isoschizomers HpaII and MspI both recognize same sequence: 5’-CCGG-3’ u MspI digests regardless of methylation u HpaII only digests at unmethylated sites n PCR à gel electrophoresis à southern blot n Pros: Highly sensitive n Cons: Low-throughput, high false positive rate because of incomplete digestion (for reasons other than methylation).

GENETIC ANALYSIS of Complex Human Diseases Bisulfite sequencing n  Sodium
bisulfite converts unmethylated (but not methylated) C’s into U’s. n  This introduces a methylation-specific “SNP”. n  RRBS – library enriched for CpG-dense regions by digesting with MspI.

GENETIC ANALYSIS of Complex Human Diseases MeDIP-Seq n MeDIP-Seq = Methylated
DNA immunoprecipitation n Uses antibody against 5- methylcytosine to retrieve methylated fragments from sonicated DNA. n Enrichment method = count number of reads

GENETIC ANALYSIS of Complex Human Diseases MethylCap-Seq n Uses methyl-binding domain
(MBD) protein to obtain DNA with similar methylation levels. n Also a counting method.

GENETIC ANALYSIS of Complex Human Diseases Methylation: Accuracy n  Bock
et al. Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14. n  MeDIP, MethylCap, RRBS largely concordant with Illumina Infinium assay

GENETIC ANALYSIS of Complex Human Diseases Methylation methods: coverage n Coverage
varies among different methods

GENETIC ANALYSIS of Complex Human Diseases Methylation: Features & Biases

GENETIC ANALYSIS of Complex Human Diseases Methylation: Bioinformatics Resources Resource
Purpose URL Refs Batman MeDIP DNA methyla8on analysis tool hKp://td-‐blade.gurdon.cam.ac.uk/sokware/batman BDPC DNA methyla8on analysis plalorm hKp://biochem.jacobs-‐university.de/BDPC BSMAP Whole-‐genome bisulphite sequence mapping hKp://code.google.com/p/bsmap CpG Analyzer Windows-‐based program for bisulphite DNA -‐ CpGcluster CpG island iden8fica8on hKp://bioinfo2.ugr.es/CpGcluster CpGFinder Online program for CpG island iden8fica8on hKp://linux1.sokberry.com CpG Island Explorer Online program for CpG Island iden8fica8on hKp://bioinfo.hku.hk/cpgieintro.html CpG Island Searcher Online program for CpG Island iden8fica8on hKp://cpgislands.usc.edu CpG PaKernFinder Windows-‐based program for bisulphite DNA -‐ CpG Promoter Large-‐scale promoter mapping using CpG islands hKp://www.cshl.edu/OTT/html/cpg_promoter.html CpG ra8o and GC content PloKer Online program for ploMng the observed:expected ra8o of CpG hKp://mwsross.bms.ed.ac.uk/public/cgi-‐bin/cpg.pl CpGviewer Bisulphite DNA sequencing viewer hKp://dna.leeds.ac.uk/cpgviewer CyMATE Bisulphite-‐based analysis of plant genomic DNA hKp://www.gmi.oeaw.ac.at/en/cymate-‐index/ EMBOSS CpGPlot/ CpGReport Online program for ploMng CpG-‐rich regions hKp://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html Epigenomics Roadmap NIH Epigenomics Roadmap Ini8a8ve homepage hKp://nihroadmap.nih.gov/epigenomics Epinexus DNA methyla8on analysis tools hKp://epinexus.net/home.html MEDME Sokware package (using R) for modelling MeDIP experimental data hKp://espresso.med.yale.edu/medme methBLAST Similarity search program for bisulphite-‐modified DNA hKp://medgen.ugent.be/methBLAST MethDB Database for DNA methyla8on data hKp://www.methdb.de MethPrimer Primer design for bisulphite PCR hKp://www.urogene.org/methprimer methPrimerDB PCR primers for DNA methyla8on analysis hKp://medgen.ugent.be/methprimerdb MethTools Bisulphite sequence data analysis tool hKp://www.methdb.de MethyCancer Database Database of cancer DNA methyla8on data hKp://methycancer.psych.ac.cn Methyl Primer Express Primer design for bisulphite PCR hKp://www.appliedbiosystems.com/ Methylumi Bioconductor pkg for DNA methyla8on data from Illumina hKp://www.bioconductor.org/packages/bioc/html/ Methylyzer Bisulphite DNA sequence visualiza8on tool hKp://ubio.bioinfo.cnio.es/Methylyzer/main/index.html mPod DNA methyla8on viewer integrated w/ Ensembl genome browser hKp://www.compbio.group.cam.ac.uk/Projects/ PubMeth Database of DNA methyla8on literature hKp://www.pubmeth.org QUMA Quan8fica8on tool for methyla8on analysis hKp://quma.cdb.riken.jp TCGA Data Portal Database of TCGA DNA methyla8on data hKp://cancergenome.nih.gov/dataportal

GENETIC ANALYSIS of Complex Human Diseases Methylation: Further Reading Bock,
C., Tomazou, E. M., Brinkman, A. B., Müller, F., Simmer, F., Gu, H., Jäger, N., et al. (2010). Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14. Brinkman, A. B., Simmer, F., Ma, K., Kaan, A., Zhu, J., & Stunnenberg, H. G. (2010). Whole-genome DNA methylation profiling using MethylCap-seq. Methods (San Diego, Calif.), 52(3), 232-6. Brunner, A. L., Johnson, D. S., Kim, S. W., Valouev, A., Reddy, T. E., et al. (2009). Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver, 1044-1056. Gu, H., Bock, C., Mikkelsen, T. S., Jäger, N., Smith, Z. D., Tomazou, E., Gnirke, A., et al. (2010). Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nature methods, 7(2), 133-6. Harris, R. A., Wang, T., Coarfa, C., Nagarajan, R. P., Hong, C., Downey, S. L., Johnson, B. E., et al. (2010). Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nature biotechnology, 28(10), 1097-105. Kerick, M., Fischer, A., & Schweiger, M.-ruth. (2012). Bioinformatics for High Throughput Sequencing. (N. Rodríguez-Ezpeleta, M. Hackenberg, & A. M. Aransay, Eds.), 151-167. New York, NY: Springer New York. Laird, P. W. (2010). Principles and challenges of genomewide DNA methylation analysis. Nature reviews. Genetics, 11(3), 191-203. Weber, M., Davies, J. J., Wittig, D., Oakeley, E. J., Haase, M., Lam, W. L., & Schübeler, D. (2005). Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nature genetics, 37(8), 853-62. doi:10.1038/ng1598

GENETIC ANALYSIS of Complex Human Diseases Thank you Web: bioinformatics.virginia.edu
E-mail: [email protected] Blog: www.GettingGeneticsDone.com Twitter: twitter.com/genetics_blog

Examining Gene Expression and Methylation with ...

Examining Gene Expression and Methylation with Next-Gen Sequencing

More Decks by Stephen Turner

Other Decks in Education

Featured

Transcript