Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Examining Gene Expression and Methylation with Next-Gen Sequencing

Examining Gene Expression and Methylation with Next-Gen Sequencing

Lecture on RNA-seq and methods for examining methylation with NGS data. Given at Genetic Analysis of Complex Disease short course, University of Miami, May 2013.

Stephen Turner

May 22, 2013
Tweet

More Decks by Stephen Turner

Other Decks in Education

Transcript

  1. GENETIC ANALYSIS of Complex Human Diseases Examining Gene Expression and

    Methylation with Next-Gen Sequencing Stephen Turner, Ph.D. Bioinformatics Core Director bioinformatics.virginia.edu University of Virginia
  2. GENETIC ANALYSIS of Complex Human Diseases Advantages of RNA-Seq n 

    No reference necessary n  Low background (no cross-hybridization) n  Unlimited dynamic range (FC 9000 Science 320:1344) n  Direct counting (microarrays: indirect – hybridization) n  Can characterize full transcriptome u mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc) u Differential gene expression u Differential coding output u Differential TSS usage u Differential isoform expression
  3. GENETIC ANALYSIS of Complex Human Diseases Is it accurate? n 

    Marioni et al. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 2008 18:1509.
  4. GENETIC ANALYSIS of Complex Human Diseases RNA-Seq Challenges n  Library

    construction u  Size selection (messenger, small) u  Strand specificity? n  Bioinformatic challenges u  Spliced alignment u  Transcript deconvolution n  Statistical Challenges u  Highly variable abundance u  Sample size: never, ever, plan n=1 u  Normalization (RPKM) ►  More reads from longer transcripts, higher sequencing depth ►  Want to compare features of different lengths ►  Want to compare conditions with different total sequence depth
  5. GENETIC ANALYSIS of Complex Human Diseases RNA-Seq Overview Condi&on  1

      (normal  colon)   Condi&on  2   (colon  tumor)   Samples  of  interest   AAAAA mRNA AAAAA mRNA TTTTT Library @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***- @HWUSI-EAS100R:6:73:941:1973#0/1 CATCGACGTAGATCGACTACATGAACTGCTCG +HWUSI-EAS100R:6:73:941:1973#0/1 !'’*+(*+!+(*!+*(((***!%%%%!%%(+-
  6. GENETIC ANALYSIS of Complex Human Diseases Common question #1: Depth

    n  Question: how much sequence do I need? n  Answer: it’s complicated. n  Oversimplified answer: 20-50 million PE reads / sample (mouse/human). n  Depends on: u  Size & complexity of transcriptome u  Application: differential gene expression, transcript discovery u  Tissue type, RNA quality, library preparation u  Sequencing type: length, paired-end vs single-end, etc. n  Find a publication in your field with similar goals. n  Good news: ¼ HiSeq lane usually sufficient.
  7. GENETIC ANALYSIS of Complex Human Diseases Common question #2: Sample

    Size n Question: How many samples should I sequence? n Oversimplified Answer: At least 3 biological replicates per condition. n Depends on: u Sequencing depth u Application u Goals (prioritization, biomarker discovery, etc.) u Effect size, desired power, statistical significance n Find a publication with similar goals
  8. GENETIC ANALYSIS of Complex Human Diseases Common question #3: Workflow

    n  How do I analyze the data? n  No standards! u  Unspliced aligners: BWA, Bowtie, Bowtie2, MANY others! u  Spliced aligners: STAR, Rum, Tophat, Tophat2-Bowtie1, Tophat2-Bowtie2, GSNAP, MANY others. u  Reference builds & annotations: UCSC, Entrez, Ensembl u  Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS u  Quantification: Cufflinks, RSEM, eXpress, MISO, etc. u  Differential expression: Cuffdiff, Cuffdiff2, DegSeq, DESeq, EdgeR, Myrna n  Like early microarray days: lots of excitement, lots of tools, little knowledge of integrating tools in pipeline! n  Benchmarks u  Microarray: Spike-ins (Irizarry) u  RNA-Seq: ???, simulation, ???
  9. GENETIC ANALYSIS of Complex Human Diseases Common question #3: Workflow

    Eyras et al. Methods to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993 Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782
  10. GENETIC ANALYSIS of Complex Human Diseases Phases  of  NGS  Analysis

      n  Primary   u  Conversion  of  raw  machine  signal  into  sequence  and  quali8es   n  secondary   u  Alignment  of  reads  to  reference  genome  or  transcriptome   u  or  de  novo  assembly  of  reads  into  con8gs   n  Ter8ary   u  SNP  discovery/genotyping   u  Peak  discovery/quan8fica8on  (ChIP,  MeDIP)   u  Transcript  assembly/quan8fica8on  (RNA-­‐seq)   n  Quaternary   u  Differen8al  expression   u  Enrichment,  pathways,  correla8on,  clustering,  visualiza8on,  etc.     u  hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-­‐analysis-­‐for-­‐high-­‐throughput.html   u  hKp://www.slideshare.net/turnersd/pathway-­‐analysis-­‐2012-­‐17947529  
  11. GENETIC ANALYSIS of Complex Human Diseases Primary  Analysis:  Get  FASTQ

     file   @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***-
  12. GENETIC ANALYSIS of Complex Human Diseases “Phred-­‐scaled”  base  quali&es  

    #  $p  is  probability  base  is  erroneous   $Q  =  -­‐10  *  log($p)  /  log(10);  #  Phred  Q   $q  =  chr(($Q<=40?  $Q  :  40)  +  33);  #  FASTQ  quality  character   $Q  =  ord($q)  -­‐  33;  #  33  offset   SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)
  13. GENETIC ANALYSIS of Complex Human Diseases Secondary  analysis   n Alignment

     back  to  the  reference   u Computa8onally  demanding  –  can’t  use  BLAST   u Many  algorithms  (Maq,  BWA,  bow8e,  bow8e2,   Mosaik,  NovoAlign,  SOAP2,  SSAHA,  …)   u  hKp://en.wikipedia.org/wiki/List_of_sequence_alignment_sokware     u Sensi8vity  to  sequencing  errors,  polymorphisms,   indels,  rearrangements   u Tradeoffs  in  8me  vs.  memory  vs.  performance    
  14. GENETIC ANALYSIS of Complex Human Diseases Download data & software

    n  Public data from GEO. E.g. GSE32038 u  http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE32038 u  Trapnell et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 2012: 7:562. n  Sequence, annotation, indexes (Ensembl) u  iGenomes: http://tophat.cbcb.umd.edu/igenomes.html u  Genes: /Annotation/Genes/genes.gtf u  Indexes: /Sequence/BowtieIndex/genome.* n  Software: u  Samtools: http://samtools.sourceforge.net/ u  FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ u  Bowtie: http://bowtie-bio.sourceforge.net/index.shtml u  Tophat: http://tophat.cbcb.umd.edu/ u  HTSeq: http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html u  R: http://www.r-project.org/ u  DESeq2: http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html u  Cufflinks: http://cufflinks.cbcb.umd.edu/ u  cummeRbund: http://compbio.mit.edu/cummeRbund/
  15. GENETIC ANALYSIS of Complex Human Diseases Do some quality assessment

    Software: Picard picard.sourceforge.net FastQC bioinformatics.bbsrc.ac.uk/projects/fastqc RSeQC code.google.com/p/rseqc FastX Toolkit hannonlab.cshl.edu/fastx_toolkit R/ShortRead bioconductor.org/packages/bioc/html/ShortRead.html
  16. GENETIC ANALYSIS of Complex Human Diseases Mapping across splice junctions:

    tophat 1.  Map reads to genome 2.  Collect unmappable reads 3.  Break reads into segments. Small segments often independently align. If align 100bp-kbs apart, infer splice. tophat –G genes.gtf –o C1_R1_tophatout /path/bowtieindex/genome C1_R1_1.fq C1_R1_2.fq tophat –G genes.gtf –o C1_R2_tophatout /path/bowtieindex/genome C1_R2_1.fq C1_R2_2.fq tophat –G genes.gtf –o C1_R3_tophatout /path/bowtieindex/genome C1_R3_1.fq C1_R3_2.fq tophat –G genes.gtf –o C2_R1_tophatout /path/bowtieindex/genome C2_R1_1.fq C2_R1_2.fq tophat –G genes.gtf –o C2_R2_tophatout /path/bowtieindex/genome C2_R2_1.fq C2_R2_2.fq tophat –G genes.gtf –o C2_R3_tophatout /path/bowtieindex/genome C2_R3_1.fq C2_R3_2.fq Gene Annotation Output Directory Bowtie Index Read 1 Read 2
  17. GENETIC ANALYSIS of Complex Human Diseases Workflow 1: Differential Gene

    Expression Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression
  18. GENETIC ANALYSIS of Complex Human Diseases Workflow 1: Differential Gene

    Expression Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression Software: HTSeq http://www-huber.embl.de/users/anders/HTSeq Run htseq-count on each of the alignments: htseq-count <sam_file> <gtf_file> First convert binary .bam file to text .sam file using samtools: samtools view accepted_hits.bam > C1_R1.sam
  19. GENETIC ANALYSIS of Complex Human Diseases Workflow 1: Differential Gene

    Expression Step 1: Align to Genome Step 2: Count Reads overlapping genes Step 3: Differential expression Software: DESeq2 http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html > library(DESeq2) > sampleFiles <- c("C1_R1.counts.txt", "C1_R2.counts.txt", "C1_R3.counts.txt", "C2_R1.counts.txt", "C2_R2.counts.txt", "C2_R3.counts.txt") > sampleCondition <- factor(substr(sampleFiles, 1, 2)) > sampleTable <- data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition) > sampleTable sampleName fileName condition 1 C1_R1.counts.txt C1_R1.counts.txt C1 2 C1_R2.counts.txt C1_R2.counts.txt C1 3 C1_R3.counts.txt C1_R3.counts.txt C1 4 C2_R1.counts.txt C2_R1.counts.txt C2 5 C2_R2.counts.txt C2_R2.counts.txt C2 6 C2_R3.counts.txt C2_R3.counts.txt C2 dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=".", design=~condition) dds <- DESeq(dds) results <- results(dds) results <- results[order(results$FDR), ] plotMA(dds) ...
  20. GENETIC ANALYSIS of Complex Human Diseases Changes in fragment count

    for a gene does not necessarily equal a change in expression. Trapnell, Cole, et al. "Differential analysis of gene regulation at transcript resolution with RNA-seq." Nature biotechnology 31.1 (2012): 46-53.
  21. GENETIC ANALYSIS of Complex Human Diseases Workflow 2a: Assemble transcripts

    for each sample: cufflinks n Cufflinks u Identifies mutually incompatible fragments u Identify minimal set of transcripts to explain all the fragments. cufflinks -o C1_R1_cufflinksout C1_R1_tophatout/accepted_hits.bam cufflinks -o C1_R2_cufflinksout C1_R2_tophatout/accepted_hits.bam cufflinks -o C1_R3_cufflinksout C1_R3_tophatout/accepted_hits.bam cufflinks -o C2_R1_cufflinksout C2_R1_tophatout/accepted_hits.bam cufflinks -o C2_R2_cufflinksout C2_R2_tophatout/accepted_hits.bam cufflinks -o C2_R3_cufflinksout C2_R3_tophatout/accepted_hits.bam Output Directory Path to alignment
  22. GENETIC ANALYSIS of Complex Human Diseases Merge assemblies: cuffmerge n 

    Merge assemblies to create single merged transcriptome annotation. u  Option 1: Pool alignments and assemble all at once. ►  Computationally demanding ►  Assembler will be faced complex mixture of isoforms à more error u  Option 2: Assemble alignments individually, merge resulting assemblies ►  Cuffmerge: meta-assembler using parsimony. ►  Genes with low expression à insufficient coverage for reconstruction. ►  Merging often recovers complete gene. ►  Newly discovered isoforms integrated w/ known ones (RABT).
  23. GENETIC ANALYSIS of Complex Human Diseases Merge assemblies: cuffmerge n Create

    “manifest” of location of all assemblies n Run Cuffmerge on assemblies using RABT cuffmerge –g /path/to/annotation/genes.gtf –s /path/to/refgenome/genome.fa assemblies.txt Reference Gene Annotation ./C1_R1_cufflinksout/transcripts.gtf ./C1_R2_cufflinksout/transcripts.gtf ./C1_R3_cufflinksout/transcripts.gtf ./C2_R1_cufflinksout/transcripts.gtf ./C2_R2_cufflinksout/transcripts.gtf ./C2_R3_cufflinksout/transcripts.gtf Assemblies.txt: location of assemblies Reference Genome Sequence Manifest from above
  24. GENETIC ANALYSIS of Complex Human Diseases Differential expression: cuffdiff n Identify

    differentially expressed genes & transcripts cuffdiff –o cuffdiff_out –b genome.fa –u merged.gtf \ ./C1_R1_tophatout/accepted_hits.bam,\ ./C1_R2_tophatout/accepted_hits.bam,\ ./C1_R3_tophatout/accepted_hits.bam \ ./C2_R1_tophatout/accepted_hits.bam,\ ./C2_R2_tophatout/accepted_hits.bam,\ ./C2_R3_tophatout/accepted_hits.bam Reference Sequence Output directory Merged assembly Location of alignments •  1 gene •  2 TSS •  2 CDS •  3 Isoforms
  25. GENETIC ANALYSIS of Complex Human Diseases Visualization with cummeRbund n Install

    cummeRbund: u Install from BioConductor: ►  source("http://bioconductor.org/biocLite.R") ►  biocLite("cummeRbund") u Download and install latest version from http://compbio.mit.edu/cummeRbund/ n Load the package u library(cummeRbund) n Read in the data u  cuff <- readCufflinks(“/path/to/cuffdiff/output”)
  26. GENETIC ANALYSIS of Complex Human Diseases Visualization with cummeRbund csDensity(genes(cuff))

    csBoxplot(genes(cuff)) csScatter(genes(cuff), "C1", "C2", smooth=T) csVolcano(genes(cuff), "C1", "C2")
  27. GENETIC ANALYSIS of Complex Human Diseases Visualization with cummeRbund mygene2

    <- getGene(cuff, "Rala") expressionBarplot(mygene2) expressionBarplot(isoforms(mygene2))
  28. GENETIC ANALYSIS of Complex Human Diseases DEXSeq n  Differential Gene

    Expression (E.g. DESeq) n  Differential Isoform Expression (E.g. Cufflinks) n  Differential Exon Usage n  What’s different about DEXSeq? u  Doesn’t do full transcript assembly (Cufflinks) u  Doesn’t count fragments mapping to genes (DESeq) u  Avoids assembly and looks for differences in reads mapping to individual exons. u  Uses counts (negative binomial)
  29. GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Installation n Installation

    & load: u  source("http://bioconductor.org/biocLite.R") u  biocLite(“DEXSeq”) u  library(DEXSeq) n Installation comes bundled with useful python scripts in the python_scripts directory of the library. Put these in your PATH.
  30. GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Data preparation

    n First, prepare “flattened” GFF: n Create sorted SAM files n Count reads overlapping counting bins dexseq_prepare_annotation.py input.gtf exons.gff Reference Annotation Script comes with DEXSeq samtools view C1_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R1.sam samtools view C1_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R2.sam samtools view C1_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C1_R3.sam samtools view C2_R1-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R1.sam samtools view C2_R2-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R2.sam samtools view C2_R3-tophat-out/accepted_hits.bam | sort –k 1,1 –k2,2n > C2_R3.sam dexseq_count.py -p no -s no exons.gff C1_R1.sam C1_R1.counts.txt dexseq_count.py -p no -s no exons.gff C1_R2.sam C1_R2.counts.txt dexseq_count.py -p no -s no exons.gff C1_R3.sam C1_R3.counts.txt dexseq_count.py -p no -s no exons.gff C2_R1.sam C2_R1.counts.txt dexseq_count.py -p no -s no exons.gff C2_R2.sam C2_R2.counts.txt dexseq_count.py -p no -s no exons.gff C2_R3.sam C2_R3.counts.txt Script comes with DEXSeq Flattened Annotation Alignment Output file Output file
  31. GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Data import

    n  The pasilla package vignette gives detailed instructions on how to do this: http://www.bioconductor.org/packages/release/data/experiment/html/pasilla.html > design <- data.frame(condition=c(rep("C1",3), rep("C2",3)), replicate=rep(1:3,2)) > rownames(design) <- with(design, paste(condition, "_R", replicate, sep="")) > design condition replicate C1_R1 C1 1 C1_R2 C1 2 C1_R3 C1 3 C2_R1 C2 1 C2_R2 C2 2 C2_R3 C2 3 > countfiles <- file.path(".", paste(rownames(design), ".counts.txt", sep="")) > countfiles [1] "./C1_R1.counts.txt" "./C1_R2.counts.txt" "./C1_R3.counts.txt" "./C2_R1.counts.txt" [5] "./C2_R2.counts.txt" "./C2_R3.counts.txt" > flattenedfile <- "/Users/sdt5z/smb/u/genomes/dexseq/exons_dme_ens_bdgp525.gff" > exons <- read.HTSeqCounts(countfiles=countfiles, design=design, flattenedfile=flattenedfile) > sampleNames(exons) <- rownames(design)
  32. GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: Data Analysis

    # Estimate size factors (normalizes for sequencing depth) exons <- estimateSizeFactors(exons) sizeFactors(exons) # Estimate dispersion exons <- estimateDispersions(exons) exons <- fitDispersionFunction(exons) # Test for Differential Exon Usage exons <- testForDEU(exons) exons <- estimatelog2FoldChanges(exons) result <- DEUresultTable(exons) # How many are significant at FDR 0.001? table(res$padjust<0.0001) # M vs A plot plot(result$meanBase, result[, "log2fold(C2/C1)"], log="x”)
  33. GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: visualization plotDEXSeq(exons,

    "FBgn0030362", cex.axis=1.2, cex=1.3, lwd=2, legend=T, displayTranscripts=T)
  34. GENETIC ANALYSIS of Complex Human Diseases Using DEXSeq: HTML Report

    library(biomaRt) mart <- useMart("ensembl", dataset="dmelanogaster_gene_ensembl") listAttributes(mart)[1:25,] attributes <- c("ensembl_gene_id", "external_gene_id", "description") DEXSeqHTML(exons, FDR=0.0001, mart=mart, filter="ensembl_gene_id", attributes=attributes)
  35. GENETIC ANALYSIS of Complex Human Diseases Downstream analysis n  Now

    you have a list of: u Genes u Isoforms (genes) u Exons (genes) n  How to place in functional context? n  Pathway / functional analysis! u Gene Ontology over-representation u Gene Set Enrichment Analysis u Signaling Pathway Impact Analysis u Many more… n  Resources: u  hKp://geMnggene8csdone.blogspot.com/2012/03/pathway-­‐analysis-­‐for-­‐high-­‐throughput.html   u  hKp://www.slideshare.net/turnersd/pathway-­‐analysis-­‐2012-­‐17947529  
  36. GENETIC ANALYSIS of Complex Human Diseases Workflow Management: Taverna n 

    Taverna: http://www.taverna.org.uk/ n  TavernaPBS: http://sourceforge.net/projects/tavernapbs/
  37. GENETIC ANALYSIS of Complex Human Diseases Further Reading n  RNA-Seq:

    u  Garber, M., Grabherr, M. G., Guttman, M., & Trapnell, C. (2011). Computational methods for transcriptome annotation and quantification using RNA-seq. Nature methods, 8(6), 469-77. u  Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research, 18(9), 1509-17. u  Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621-8. u  Ozsolak, F., & Milos, P. M. (2011). RNA sequencing: advances, challenges and opportunities. Nature reviews. Genetics, 12(2), 87-98. u  Toung, J. M., Morley, M., Li, M., & Cheung, V. G. (2011). RNA-sequence analysis of human B-cells. Genome research, 991-998. u  Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics, 10(1), 57-63. n  Bowtie/Tophat: u  Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3), R25. u  Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. n  Cufflinks: u  Roberts, A., Pimentel, H., Trapnell, C., & Pachter, L. (2011). Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics (Oxford, England), 27(17), 2325-9. u  Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., et al. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562-578. u  Trapnell, C., Williams, B. a, Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5), 511-5. n  DEXSeq: u  Vignette: http://watson.nci.nih.gov/bioc_mirror/packages/2.9/bioc/html/DEXSeq.html. u  Pre-pub manuscript: Anders, S., Reyes, A., Huber, W. (2012). Detecting differential usage of exons from RNA-Seq data. Nautre Precedings, DOI: 10.1038/npre.2012.6837.2.
  38. GENETIC ANALYSIS of Complex Human Diseases Online Community Forum and

    Discussion n Seqanswers u  http://SEQanswers.com u  Format: Forum u  Li et al. SEQanswers : An open access community for collaboratively decoding genomes. Bioinformatics (2012). n BioStar: u  http://biostar.stackexchange.com u  Format: Q&A u  Parnell et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comp Bio (2011). n  Other Bioinformatics Resources: stephenturner.us/p/edu
  39. GENETIC ANALYSIS of Complex Human Diseases DNA Methylation: Importance n Occurs

    most frequently at CpG sites n High methylation at promoters ≈ silencing n Methylation perturbed in cancer n Methylation associated with many other complex diseases: neural, autoimmune, response to env. n Mapping DNA methylation à new disease genes & drug targets.
  40. GENETIC ANALYSIS of Complex Human Diseases DNA Methylation: Challenges n Dynamic

    and tissue-specific n DNA à Collection of cells which vary in 5meC patterns à 5meC pattern is complex. n Further, uneven distribution of CpG targets n Multiple classes of methods: u Bisulfite, sequence-based: Assay methylated target sequences across individual DNAs. u Affinity enrichment, count-based: Assay methylation level across many genomic loci.
  41. GENETIC ANALYSIS of Complex Human Diseases DNA Methylation: Mapping BS-Seq

    Whole-genome bisulfite sequencing RRBS-Seq Reduced representation bisulfite sequencing BC-Seq Bisulfite capture sequencing BSPP Bisulfite specific padlock probes Methyl-Seq Restriction enzyme based methyl-seq MSCC Methyl sensitive cut counting HELP-Seq HpaII fragment enrichment by ligation PCR MCA-Seq Methylated CpG island amplification MeDIP-Seq Methylated DNA immunoprecipitation MBP-Seq Methyl-binding protein sequencing MethylCap-seq Methylated DNA capture by affinity purification MIRA-Seq Methylated CpG island recovery assay RNA-Seq High-throughput cDNA sequencing DNA Methylation Gene Expression
  42. GENETIC ANALYSIS of Complex Human Diseases Methylation: REs and PCR

    n Restriction enzyme digest u Isoschizomers HpaII and MspI both recognize same sequence: 5’-CCGG-3’ u MspI digests regardless of methylation u HpaII only digests at unmethylated sites n PCR à gel electrophoresis à southern blot n Pros: Highly sensitive n Cons: Low-throughput, high false positive rate because of incomplete digestion (for reasons other than methylation).
  43. GENETIC ANALYSIS of Complex Human Diseases Bisulfite sequencing n  Sodium

    bisulfite converts unmethylated (but not methylated) C’s into U’s. n  This introduces a methylation-specific “SNP”. n  RRBS – library enriched for CpG-dense regions by digesting with MspI.
  44. GENETIC ANALYSIS of Complex Human Diseases MeDIP-Seq n MeDIP-Seq = Methylated

    DNA immunoprecipitation n Uses antibody against 5- methylcytosine to retrieve methylated fragments from sonicated DNA. n Enrichment method = count number of reads
  45. GENETIC ANALYSIS of Complex Human Diseases MethylCap-Seq n Uses methyl-binding domain

    (MBD) protein to obtain DNA with similar methylation levels. n Also a counting method.
  46. GENETIC ANALYSIS of Complex Human Diseases Methylation: Accuracy n  Bock

    et al. Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14. n  MeDIP, MethylCap, RRBS largely concordant with Illumina Infinium assay
  47. GENETIC ANALYSIS of Complex Human Diseases Methylation: Bioinformatics Resources Resource

      Purpose   URL  Refs   Batman   MeDIP  DNA  methyla8on  analysis  tool   hKp://td-­‐blade.gurdon.cam.ac.uk/sokware/batman   BDPC   DNA  methyla8on  analysis  plalorm   hKp://biochem.jacobs-­‐university.de/BDPC   BSMAP   Whole-­‐genome  bisulphite  sequence  mapping   hKp://code.google.com/p/bsmap   CpG  Analyzer   Windows-­‐based  program  for  bisulphite  DNA   -­‐   CpGcluster   CpG  island  iden8fica8on   hKp://bioinfo2.ugr.es/CpGcluster   CpGFinder   Online  program  for  CpG  island  iden8fica8on   hKp://linux1.sokberry.com   CpG  Island  Explorer   Online  program  for  CpG  Island  iden8fica8on   hKp://bioinfo.hku.hk/cpgieintro.html   CpG  Island  Searcher   Online  program  for  CpG  Island  iden8fica8on   hKp://cpgislands.usc.edu   CpG  PaKernFinder   Windows-­‐based  program  for  bisulphite  DNA   -­‐   CpG  Promoter   Large-­‐scale  promoter  mapping  using  CpG  islands   hKp://www.cshl.edu/OTT/html/cpg_promoter.html   CpG  ra8o  and  GC  content  PloKer   Online  program  for  ploMng  the  observed:expected  ra8o  of  CpG   hKp://mwsross.bms.ed.ac.uk/public/cgi-­‐bin/cpg.pl   CpGviewer   Bisulphite  DNA  sequencing  viewer   hKp://dna.leeds.ac.uk/cpgviewer   CyMATE   Bisulphite-­‐based  analysis  of  plant  genomic  DNA   hKp://www.gmi.oeaw.ac.at/en/cymate-­‐index/   EMBOSS  CpGPlot/  CpGReport   Online  program  for  ploMng  CpG-­‐rich  regions   hKp://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html   Epigenomics  Roadmap   NIH  Epigenomics  Roadmap  Ini8a8ve  homepage   hKp://nihroadmap.nih.gov/epigenomics   Epinexus   DNA  methyla8on  analysis  tools   hKp://epinexus.net/home.html   MEDME   Sokware  package  (using  R)  for  modelling  MeDIP  experimental  data   hKp://espresso.med.yale.edu/medme   methBLAST   Similarity  search  program  for  bisulphite-­‐modified  DNA   hKp://medgen.ugent.be/methBLAST   MethDB   Database  for  DNA  methyla8on  data   hKp://www.methdb.de   MethPrimer   Primer  design  for  bisulphite  PCR   hKp://www.urogene.org/methprimer   methPrimerDB   PCR  primers  for  DNA  methyla8on  analysis   hKp://medgen.ugent.be/methprimerdb   MethTools   Bisulphite  sequence  data  analysis  tool   hKp://www.methdb.de   MethyCancer  Database   Database  of  cancer  DNA  methyla8on  data   hKp://methycancer.psych.ac.cn   Methyl  Primer  Express   Primer  design  for  bisulphite  PCR   hKp://www.appliedbiosystems.com/   Methylumi   Bioconductor  pkg  for  DNA  methyla8on  data  from  Illumina   hKp://www.bioconductor.org/packages/bioc/html/   Methylyzer   Bisulphite  DNA  sequence  visualiza8on  tool   hKp://ubio.bioinfo.cnio.es/Methylyzer/main/index.html   mPod   DNA  methyla8on  viewer  integrated  w/  Ensembl  genome  browser   hKp://www.compbio.group.cam.ac.uk/Projects/   PubMeth   Database  of  DNA  methyla8on  literature   hKp://www.pubmeth.org   QUMA   Quan8fica8on  tool  for  methyla8on  analysis   hKp://quma.cdb.riken.jp   TCGA  Data  Portal   Database  of  TCGA  DNA  methyla8on  data   hKp://cancergenome.nih.gov/dataportal  
  48. GENETIC ANALYSIS of Complex Human Diseases Methylation: Further Reading Bock,

    C., Tomazou, E. M., Brinkman, A. B., Müller, F., Simmer, F., Gu, H., Jäger, N., et al. (2010). Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14. Brinkman, A. B., Simmer, F., Ma, K., Kaan, A., Zhu, J., & Stunnenberg, H. G. (2010). Whole-genome DNA methylation profiling using MethylCap-seq. Methods (San Diego, Calif.), 52(3), 232-6. Brunner, A. L., Johnson, D. S., Kim, S. W., Valouev, A., Reddy, T. E., et al. (2009). Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver, 1044-1056. Gu, H., Bock, C., Mikkelsen, T. S., Jäger, N., Smith, Z. D., Tomazou, E., Gnirke, A., et al. (2010). Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nature methods, 7(2), 133-6. Harris, R. A., Wang, T., Coarfa, C., Nagarajan, R. P., Hong, C., Downey, S. L., Johnson, B. E., et al. (2010). Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nature biotechnology, 28(10), 1097-105. Kerick, M., Fischer, A., & Schweiger, M.-ruth. (2012). Bioinformatics for High Throughput Sequencing. (N. Rodríguez-Ezpeleta, M. Hackenberg, & A. M. Aransay, Eds.), 151-167. New York, NY: Springer New York. Laird, P. W. (2010). Principles and challenges of genomewide DNA methylation analysis. Nature reviews. Genetics, 11(3), 191-203. Weber, M., Davies, J. J., Wittig, D., Oakeley, E. J., Haase, M., Lam, W. L., & Schübeler, D. (2005). Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nature genetics, 37(8), 853-62. doi:10.1038/ng1598
  49. GENETIC ANALYSIS of Complex Human Diseases Thank you Web: bioinformatics.virginia.edu

    E-mail: [email protected] Blog: www.GettingGeneticsDone.com Twitter: twitter.com/genetics_blog