Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2014 UVA BME Intro Bioinformatics

Stephen Turner
February 20, 2014

2014 UVA BME Intro Bioinformatics

Introduction to bioinformatics lecture for Biomedical Engineering class

Stephen Turner

February 20, 2014
Tweet

More Decks by Stephen Turner

Other Decks in Education

Transcript

  1. Introduction to Bioinformatics Stephen D. Turner, Ph.D. Bioinformatics Core Director

    !1 February 20, 2014 Slides available at stephenturner.us/slides
  2. Outline • Bioinformatics origins & definition • Bioinformatics subdisciplines -

    Alignment - Assembly - Metagenomics - Comparative Genomics (case study) - Genetic Epidemiology - Gene Expression - Gene Regulation - Systems Biology • Resources for further learning !2
  3. The Central Dogma DNA RNA Protein Reverse Transcription RNA Silencing

    Prions Post- translational modification !4
  4. !5 DNA provides assembly instructions for proteins Protein folding determines

    molecular function Networks of interacting proteins determine tissue/ organ function
  5. !6 DNA provides assembly instructions for proteins Protein folding determines

    molecular function Networks of interacting proteins determine tissue/ organ function DNA Variant Analysis Gene Expression Analysis Genome Annotation Epigenetics miRNA Analysis Quantitative MS Proteomics Pathway Analysis Systems Biology Biomarker ID’n Etc.
  6. Bioinformatics Origins • Rooted in sequence analysis • Driven by

    the need to: - Collect - Annotate - Analyze !7
  7. Margaret Dayhoff (1925-1983) • Collected all known protein sequences •

    Published in 1965 • Pioneered algorithm development for - Comparison of protein sequences - Derivation of evolutionary histories from alignments “In this paper we shall describe a completed computer program for the IBM 7090, which to our knowledge is the first successful attempt at aiding the analysis of the amino acid chain structure of protein.” !8
  8. M. Dayhoff, February 27 1967 “There is a tremendous amount

    of information regarding evolutionary history and biochemical function implicit in each sequence and the number of known sequences is growing explosively. We feel it is important to collect this significant information, correlate it into a unified whole and interpret it.” !10
  9. W W W invented Internet invented What is bioinformatics? !12

    1960 1970 1980 1990 2000 2010 Dayhoff Atlas Sanger Sequencing G enBank EBI-EM BL Next-G en Sequencing ARPAnet
  10. Definition • From Wikipedia: Bioinformatics is a branch of biological

    science which deals with the study of methods for storing, retrieving and analyzing biological data, such as nucleic acid (DNA/RNA) and protein sequence, structure, function, pathways and genetic interactions. It generates new knowledge that is useful in such fields as drug design and development of new software tools to create that knowledge. Bioinformatics also deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, structural biology, software engineering, data mining, image processing, modeling and simulation, discrete mathematics, control and system theory, circuit theory, and statistics. • Our definition: using computer science and statistics to answer biological questions. !14
  11. Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly

    • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !16
  12. Sequence Alignment Example 1 • Analogy: Given a sentence and

    a library of books, find which book the sentence came from. • Grow [cells/tissue/culture], isolate DNA, sequence. • BLAST (Basic Local Alignment Search Tool) • blast.ncbi.nlm.nih.gov !17 ATGGTTTACTCCTATACCGAGAAAAAACGTATTCGTAAGGATTTTGGTAAACGTCCACAAGTTCTGGATGTACCTTATCTCCTTTCTATCCAGCTTGACTCGTTTCAGAAATTTATCGAGCAAGATCCTGAAGGGCAGTATGGTCTGGAA GCTGCTTTCCGTTCCGTATTCCCGATTCAGAGCTACAGCGGTAATTCCGAGCTGCAATACGTCAGCTACCGCCTTGGCGAACCGGTGTTTGACGTCCAGGAATGTCAAATCCGTGGCGTGACCTATTCCGCACCGCTGCGCGTTAAACTG CGTCTGGTGATCTATGAGCGCGAAGCGCCGGAAGGCACCGTAAAAGACATTAAAGAACAAGAAGTCTACATGGGCGAAATTCCGCTCATGACAGACAACGGTACCTTTGTTATCAACGGTACTGAGCGTGTTATCGTTTCCCAGCTGCAC CGTAGTCCGGGCGTCTTCTTTGACTCCGACAAAGGTAAAACCCACTCTTCGGGTAAAGTGCTGTATAACGCGCGTATCATCCCTTACCGTGGTTCCTGGCTGGACTTCGAATTCGATCCGAAGGACAACCTGTTCGTACGTATCGACCGT CGCCGTAAACTGCCTGCGACCATCATTCTGCGCGCCCTGAACTACACCACAGAGCAGATCCTCGACCTGTTCTTTGAAAAAGTTATCTTTGAAATCCGTGATAACAAGCTGCAGATGGAACTGGTGCCGGAACGCCTGCGTGGTGAAACC GCATCTTTTGACATCGAAGCTAACGGTAAAGTGTACGTAGAAAAAGGCCGCCGTATCACTGCGCGCCACATTCGCCAGCTGGAAAAAGACGACGTCAAACTGATCGAAGTCCCGGTTGAGTACATCGCAGGTAAAGTGGTTGCTAAAGAC TATATTGATGAGTCTACCGGCGAGCTGATCTGCGCAGCGAACATGGAGCTGAGCCTGGATCTGCTGGCTAAGCTGAGCCAGTCTGGTCACAAGCGTATCGAAACGCTGTTCACCAACGATCTGGATCACGGCCCATATATCTCTGAAACC TTACGTGTCGACCCAACTAACGACCGTCTGAGCGCACTGGTAGAAATCTACCGCATGATGCGCCCTGGCGAGCCGCCGACTCGTGAAGCAGCTGAAAGCCTGTTCGAGAACCTGTTCTTCTCCGAAGACCGTTATGACTTGTCTGCGGTT GGTCGTATGAAGTTCAACCGTTCTCTGCTGCGCGAAGAAATCGAAGGTTCCGGTATCCTGAGCAAAGACGACATCATTGATGTTATGAAAAAGCTCATCGATATCCGTAACGGTAAAGGCGAAGTCGATGATATCGACCACCTCGGCAAC CGTCGTATCCGTTCCGTTGGCGAAATGGCGGAAAACCAGTTCCGCGTTGGCCTGGTACGTGTAGAGCGTGCGGTGAAAGAGCGTCTGTCTCTGGGCGATCTGGATACCCTGATGCCTCAGGATATGATCAACGCCAAGCCGATTTCCGCA GCAGTGAAAGAGTTCTTCGGTTCCAGCCAGCTGTCTCAGTTTATGGACCAGAACAACCCGCTGTCTGAGATTACGCACAAACGTCGTATCTCCGCACTCGGCCCAGGCGGTCTGACCCGTGAACGTGCAGGCTTCGAAGTTCGAGACGTA CACCCGACTCACTACGGTCGCGTATGTCCAATCGAAACCCCTGAAGGTCCGAACATCGGTCTGATCAACTCTCTGTCCGTGTACGCACAGACTAACGAATACGGCTTCCTTGAGACTCCGTATCGTAAAGTGACCGACGGTGTTGTAACT GACGAAATTCACTACCTGTCTGCTATCGAAGAAGGCAACTACGTTATCGCCCAGGCGAACTCCAACCTGGATGAAGAAGGCCACTTCGTAGAAGACCTGGTAACTTGCCGTAGCAAAGGCGAATCCAGCTTGTTCAGCCGCGACCAGGTT GACTACATGGACGTATCCACCCAGCAGGTGGTATCCGTCGGTGCGTCCCTGATCCCGTTCCTGGAACACGATGACGCCAACCGTGCATTGATGGGTGCGAACATGCAACGTCAGGCCGTTCCGACTCTGCGCGCTGATAAGCCGCTGGTT GGTACTGGTATGGAACGTGCTGTTGCCGTTGACTCCGGTGTAACTGCGGTAGCTAAACGTGGTGGTGTCGTTCAGTACGTGGATGCTTCCCGTATCGTTATCAAAGTTAACGAAGACGAGATGTATCCGGGTGAAGCAGGTATCGACATC TACAACCTGACCAAATACACCCGTTCTAACCAGAACACCTGTATCAACCAGATGCCGTGTGTGTCTCTGGGTGAACCGGTTGAACGTGGCGACGTGCTGGCAGACGGTCCGTCCACCGACCTCGGTGAACTGGCGCTTGGTCAGAACATG CGCGTAGCGTTCATGCCGTGGAATGGTTACAACTTCGAAGACTCCATCCTCGTATCCGAGCGTGTTGTTCAGGAAGACCGTTTCACCACCATCCACATTCAGGAACTGGCGTGTGTGTCCCGTGACACCAAGCTGGGGCCGGAAGAGATC ACCGCTGACATCCCGAACGTGGGTGAAGCTGCGCTCTCCAAACTGGATGAATCCGGTATCGTTTACATTGGTGCGGAAGTGACCGGTGGCGACATTCTGGTTGGTAAGGTAACGCCGAAAGGTGAAACTCAGCTGACCCCAGAAGAAAAA CTGCTGCGTGCGATCTTCGGTGAGAAAGCCTCTGACGTTAAAGACTCTTCTCTGCGCGTACCAAACGGTGTATCCGGTACGGTTATCGACGTTCAGGTCTTTACTCGCGATGGCGTAGAAAAAGACAAACGTGCGCTGGAAATCGAAGAA ATGCAGCTCAAACAGGCGAAGAAAGACCTGTCTGAAGAACTGCAGATCCTCGAAGCTGGTCTGTTCAGCCGTATCCGTGCTGTGCTGGTAGCCGGTGGCGTTGAAGCTGAGAAGCTCGACAAACTGCCGCGCGATCGCTGGCTGGAGCTA GGCCTGACAGACGAAGAGAAACAAAATCAGCTGGAACAGCTGGCTGAGCAGTATGACGAACTGAAACACGAGTTCGAGAAGAAACTCGAAGCGAAACGCCGCAAAATCACCCAGGGCGACGATCTGGCACCGGGCGTGCTGAAGATTGTT AAGGTATATCTGGCGGTTAAACGCCGTATCCAGCCTGGTGACAAGATGGCAGGTCGTCACGGTAACAAGGGTGTAATTTCTAAGATCAACCCGATCGAAGATATGCCTTACGATGAAAACGGTACGCCGGTAGACATCGTACTGAACCCG CTGGGCGTACCGTCTCGTATGAACATCGGTCAGATCCTCGAAACCCACCTGGGTATGGCTGCGAAAGGTATCGGCGACAAGATCAACGCCATGCTGAAACAGCAGCAAGAAGTCGCGAAACTGCGCGAATTCATCCAGCGTGCGTACGAT CTGGGCGCTGACGTTCGTCAGAAAGTTGACCTGAGTACCTTCAGCGATGAAGAAGTTATGCGTCTGGCTGAAAACCTGCGCAAAGGTATGCCAATCGCAACGCCGGTGTTCGACGGTGCGAAAGAAGCAGAAATTAAAGAGCTGCTGAAA CTTGGCGACCTGCCGACTTCCGGTCAGATCCGCCTGTACGATGGTCGCACTGGTGAACAGTTCGAGCGTCCGGTAACCGTTGGTTACATGTACATGCTGAAACTGAACCACCTGGTCGACGACAAGATGCACGCGCGTTCCACCGGTTCT TACAGCCTGGTTACTCAGCAGCCGCTGGGTGGTAAGGCACAGTTCGGTGGTCAGCGTTTCGGGGAGATGGAAGTGTGGGCGCTGGAAGCATACGGCGCAGCATACACCCTGCAGGAAATGCTCACCGTTAAGTCTGATGACGTGAACGGT
  13. !18

  14. !19

  15. Sequence Alignment Example 2 • Illumina HiSeq 2500: - 600,000,000,000

    bases sequenced in single run. - 6,000,000,000 x 100-bp (short) reads with errors. !20
  16. Primary Analysis: Get FASTQ File !22 @HWI-EAS367_0010:3:1:2380:6567#0/1 TAATTTCCATTCATCATGACAGCCCTCCAGAGGTTAGACAAC +HWI-EAS367_0010:3:1:2380:6567#0/1 GGG?GEG@GB?<EBE>C8E8ECCEEGGDD>CC89AGD8BBA8

    @HWI-EAS367_0010:3:1:2585:6567#0/1 ACAGTATTCTGGGGAGGATTAAATTAGATAAACATGCAAGAA +HWI-EAS367_0010:3:1:2585:6567#0/1 EGGGGGGGFECG+CDADDGD4BAB9D<G+G?/?;7E,8CAGG Sequence info Sequence Sequence info Quality in ASCII Sequence info Sequence Sequence info Quality in ASCII …… × 6,000,000,000+ Read 1 Read 2
  17. Primary Analysis: Get FASTQ File • Q=-10log10(p) where p=probability of

    error. • Probability of error=10%, p=0.1, Q=10 • Probability of error=1%, p=0.01, Q=30 • Probability of error=0.1%, p=0.001, Q=40 • Quality between 0-40 (usually) represented as ASCII 33-73. !23 @HWI-EAS367_0010:3:1:2380:6567#0/1 TAATTTCCATTCATCATGACAGCCCTCCAGAGGTTAGACAAC +HWI-EAS367_0010:3:1:2380:6567#0/1 GGG?GEG@GB?<EBE>C8E8ECCEEGGDD>CC89AGD8BBA8 SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)
  18. Sequence Alignment Example 2 • Illumina HiSeq 2500: - 600,000,000,000

    bases sequenced in single run. - 6,000,000,000 x 100-bp (short) reads with errors. • BLAST slow & computationally expensive. • BWA: burrows wheeler aligner (fast) • Bowtie: fast, memory-efficient (aligns 25,000,000 35-bp reads per hour per CPU). • Many others... MAQ, Eland, RMAP, SOAP, SHRiMP, BFAST, Mosaik, Novoalign, BLAT, GMAP, GSNAP, MOM, QPalma, SeqMap, VelociMapper, Stampy, mrFAST, etc. !24
  19. Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly

    • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !25
  20. Genome Assembly • DNA (or RNA) is randomly fragmented and

    computationally put back together based on overlapping reads and coverage - Coverage = avg # reads that overlapping each position in a sequence ! ! ! ! ! ! ! ! • Analogy: - Alignment: “Given a sentence and a library of books, find which book the sentence came from.” - Assembly: “Given short sentence fragments with typos, put the paragraphs/ books back together.” !26 Consensus Sequence (Contig) ----ACTGATT GCTAACT---- --TAACTGATT
  21. Genome Assembly !27 isdom, it wds the age of foolisXness

    , it was the worVt of times, it was the It was the Gest of times, it was the mor mes, it was Ahe age of wisdon, it was th It w8s the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tle age of foolishness , it was the wooorst of timZs, it ws the It was the best of times, it was the worst of times, ! it was the age of wisdom, it was the age of foolishness Adapted from slideshare.net/c.titus.brown
  22. Genome Assembly • Alignment: “Given a sentence and a library

    of books, find which book the sentence came from.” • Reference-guided genome assembly: “Given millions of short sentence fragments from a book, and a full copy of the book they came from, put the book back together again.” • De novo genome assembly: “Given millions of short sentence fragments from a single book, without a copy of the book they came from, put the book back together again.” • Metagenome assembly: “Given millions of short sentence fragments taken randomly from thousands of different books, put the separate books back together again.” !28
  23. Genome Assembly !29 Single genome assembly is hard enough… Metagenomes

    are a different ball game! Adapted from slideshare.net/c.titus.brown Avg size bacterial genome: 4,000,000 base pairs One gram soil can contain roughly 1,000,000,000 cells from over 1 million species. ≈ 4,000,000,000,000,000 base pairs in gram of soil! Knight et al. 2012 Nat Biotech.
  24. Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly

    • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !30
  25. Metagenomics • Given an environmental sample: - Who’s there? -

    What are they doing? • Sequencing reads (~100bp) can give you rough answers to these questions (family, genus, maybe species). • Need longer contiguous sequences (contigs) to get high-resolution answer - Phylogenetic analysis - Functional analysis: virulence genes, antibiotic susceptibility, … !31 Wooley, John C., Adam Godzik, and Iddo Friedberg. "A primer on metagenomics." PLoS computational biology 6.2 (2010): e1000667.
  26. Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly

    • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !32
  27. Comparative Genomics Example • Go to genome.ucsc.edu • Search for

    POLR2A • Turn on some conservation tracks !33
  28. Phylogenetic Trees • Multiple sequence alignments can be used to

    generate phylogenetic trees. - Nucleotide or translated amino acids - Sequence similarity ≈ evolutionary distance !34 http://en.wikipedia.org/wiki/Multiple_sequence_alignment http://en.wikipedia.org/wiki/Bacillus_anthracis
  29. Phylogenetic Trees • Have query sequence (your sample of interest)

    • Have known sequences (publicly available database) • Compare query to known* • Phylogenetic tree places unknown/query sequence in the context of known organisms. !36 Cheung et al 2011 BMC Res Notes 2011 German outbreak 2001 strain Enteroaggregative E. coli Enterohemorrhagic E. coli
  30. Case study: 2011 German E. coli outbreak • E. coli:

    normally commensal, can be pathogenic: - Enteroaggregative (EAEC): persistent diarrhea - Enterohemorrhagic (EHEC): produces Shiga toxin (Stx). • May-June, 2011, Germany: - 4,000 cases bloody diarrhea - 850 cases HUS - 50 deaths • Serotype O104:H4 - Normally not associated with high rate of HUS. - But high proportion of patients developed HUS & other complications. - Strain indistinguishable from previous strains based on molecular evidence (serotyping, MLST, PFGE, optical mapping, etc.). • Three groups independently sequenced the outbreak strain. !37
  31. Case study: 2011 German E. coli outbreak • H. Rohde

    et al., Open-source genomic analysis of Shiga- toxin-producing E. coli O104:H4., NEJM 365, 718–24 (2011). • Used benchtop NGS, public data release, rapid crowd- sourcing of analysis to bioinformaticians worldwide: - Genome sequence obtained in 3 days, released with CC-0 license at https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki - 24 hours: genome assembled - 2 days: assigned to existing sequence type. - 5 days: design/release strain-specific primer sequences - 7 days: >24 reports filed on GitHub wiki above • Conclusions: Outbreak strain belonged to EAEC lineage that acquired Stx and antibiotic resistance through horizontal gene transfer. !38
  32. Case study: 2011 German E. coli outbreak • D. A.

    Rasko et al., Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany., NEJM 365, 709–17 (2011). • Used 3rd-generation NGS (PacBio) to sequence 4 outbreak strains, comparing with >40 other O104:H4 strains. • Conclusions: German outbreak distinguished from other 0104:H4 strains because it contained prophage encoding Stx and distinct set of virulence & antibiotic resistance markers, likely aqcuired via horizontal gene transfer. !39
  33. Case study: 2011 German E. coli outbreak • A. Mellmann

    et al., Prospective genomic characterization of the German enterohemorrhagic Escherichia coli O104:H4 outbreak by rapid next generation sequencing technology., PloS ONE 6, e22751 (2011). • Used NGS (Ion Torrent) to sequence and complete reference-guided assembly in 62 hours. • Conclusions: - HUS-associated strains carried genes from both EAEC and EHEC. - Model: EAEC and EHEC O104:H4 evolved from common EHEC ancestor, and had stepwise gain/loss of virulence factors, leading to a highly pathogenic hybrid that emerged as the outbreak clone. !40
  34. Case study: 2011 German E. coli outbreak • Germany initially

    blamed the infection on cucumbers from Spain. • Epidemiological evidence linked O104:H4 to fenugreek sprouts imported from Egypt. • Spain lost ~$200M USD/week. Russia banned import of all fresh vegetables from EU. • Illustrates vulnerability of food products to malevolent tampering, and widespread economic consequences crossing international borders that can occur from even limited product contamination. !41
  35. Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly

    • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !42
  36. Genetic Epidemiology • Linkage: finding genetic loci that segregate with

    the disease in families. • Association: finding alleles that co-occur with disease in populations. - Common disease - common variant hypothesis: - Common variants (e.g. >1-5% in the population) contribute to common, complex disease). - Common disease - rare variant hypothesis: - Polymorphisms that cause disease are under purifying selection, and will thus be rare. - Really, it's a mix of both !43
  37. Candidate Gene Study • Select candidate genes based on: -

    Known biology - Previous linkage/association evidence - Pathways - Evidence from model organisms • Genotype variants (SNPs) in those genes • Statistical association !44 Genotype at position rs12345: A/T Genotype at position rs12345: A/A Genotype at position rs12345: T/T
  38. Genome-Wide Association Study • Genotype >500,000 SNPs • Statistical test

    at each one • Manhattan plot of results • GWAS does not inform: - Which gene affected - How gene function perturbed - How biological function altered !45
  39. Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly

    • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !46
  40. Advantages of RNA-seq • No reference necessary • Low background

    (no cross-hybridization) • Unlimited dynamic range (FC 9000 Science 320:1344) • Direct counting (microarrays: indirect – hybridization) • Can characterize full transcriptome - mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc) - Differential gene expression - Differential coding output - Differential TSS usage - Differential isoform expression !48
  41. RNA-seq challenges • Library construction - Size selection (messenger, small)

    - Strand specificity? • Bioinformatic challenges - Spliced alignment - Transcript deconvolution • Statistical Challenges - Highly variable abundance - Sample size: never, ever, plan n=1 - Normalization (RPKM) ‣ More reads from longer transcripts, higher sequencing depth ‣ Want to compare features of different lengths ‣ Want to compare conditions with different total sequence depth !52
  42. RNA-seq overview !53 Condition 1 (normal colon) Condition 2 (colon

    tumor) Samples of interest AAAAA mRNA AAAAA mRNA TTTTT Library @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***- @HWUSI-EAS100R:6:73:941:1973#0/1 CATCGACGTAGATCGACTACATGAACTGCTCG +HWUSI-EAS100R:6:73:941:1973#0/1 !'’*+(*+!+(*!+*(((***!%%%%!%%(+-
  43. RNA-seq common question #1: Depth • Question: how much sequence

    do I need? • Answer: it’s complicated. • Oversimplified answer: 20-50 million PE reads / sample (mouse/human). • Depends on: - Size & complexity of transcriptome - Application: differential gene expression, transcript discovery - Tissue type, RNA quality, library preparation - Sequencing type: length, paired-end vs single-end, etc. • Find a publication in your field with similar goals. • Good news: ¼ HiSeq lane usually sufficient. !54
  44. RNA-seq common question #2: sample size • Question: How many

    samples should I sequence? • Oversimplified Answer: At least 3 biological replicates per condition. • Depends on: - Sequencing depth - Application - Goals (prioritization, biomarker discovery, etc.) - Effect size, desired power, statistical significance • Find a publication with similar goals !55
  45. RNA-seq common question #3: Workflow • How do I analyze

    the data? • No standards! - Unspliced aligners: BWA, Bowtie, Bowtie2, MANY others! - Spliced aligners: STAR, Rum, Tophat, Tophat2-Bowtie1, Tophat2- Bowtie2, GSNAP, MANY others. - Reference builds & annotations: UCSC, Entrez, Ensembl - Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS - Quantification: Cufflinks, RSEM, eXpress, MISO, etc. - Differential expression: Cuffdiff, Cuffdiff2, DegSeq, DESeq, EdgeR, Myrna • Like early microarray days: lots of excitement, lots of tools, little knowledge of integrating tools in pipeline! • Benchmarks - Microarray: Spike-ins (Irizarry) - RNA-Seq: ???, simulation, ??? !56
  46. RNA-seq common question #3: Workflow !57 Eyras et al. Methods

    to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993 Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782
  47. RNA-seq exercises !60 #1: Examining Gene Expression and Methylation with

    Next-Gen Sequencing ! #2:“Galaxy CME Class” http://stephenturner.us/slides
  48. RNA-seq: Further Reading • RNA-Seq: – Garber, M., Grabherr, M.

    G., Guttman, M., & Trapnell, C. (2011). Computational methods for transcriptome annotation and quantification using RNA-seq. Nature methods, 8(6), 469-77. – Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research, 18(9), 1509-17. – Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621-8. – Ozsolak, F., & Milos, P. M. (2011). RNA sequencing: advances, challenges and opportunities. Nature reviews. Genetics, 12(2), 87-98. – Toung, J. M., Morley, M., Li, M., & Cheung, V. G. (2011). RNA-sequence analysis of human B-cells. Genome research, 991-998. – Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics, 10(1), 57-63. • Bowtie/Tophat: – Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3), R25. – Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. • Cufflinks: – Roberts, A., Pimentel, H., Trapnell, C., & Pachter, L. (2011). Identification of novel transcripts in annotated genomes using RNA- Seq. Bioinformatics (Oxford, England), 27(17), 2325-9. – Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., et al. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562-578. – Trapnell, C., Williams, B. a, Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5), 511-5. • DEXSeq: – Vignette: http://watson.nci.nih.gov/bioc_mirror/packages/2.9/bioc/html/DEXSeq.html. – Pre-pub manuscript: Anders, S., Reyes, A., Huber, W. (2012). Detecting differential usage of exons from RNA-Seq data. Nautre Precedings, DOI: 10.1038/npre.2012.6837.2. !61
  49. Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly

    • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !62
  50. How are genes regulated? • Transcription factors (ChIP-seq) • Micro-RNAs

    (RNA-seq) • Chromatin accessibility (DNAse-Seq) • DNA Methylation (RRBS-seq, MeDIP-seq, etc.) • RNA processing • RNA transport • Translation • Post-translational modification !63
  51. DNA Methylation: Importance • Occurs most frequently at CpG sites

    • High methylation at promoters ≈ silencing • Methylation perturbed in cancer • Methylation associated with many other complex diseases: neural, autoimmune, response to env. • Mapping DNA methylation — new disease genes & drug targets. !64
  52. DNA Methylation: Challenges • Dynamic and tissue-specific • 5meC pattern

    is complex. • Uneven distribution of CpG targets • Multiple classes of methods: - Bisulfite, sequence-based: Assay methylated target sequences across individual CpGs. - Affinity enrichment, count-based: Assay methylation level across many genomic regions. !65
  53. DNA Methylation: Methods !66 BS-Seq Whole-genome bisulfite sequencing RRBS-Seq Reduced

    representation bisulfite sequencing BC-Seq Bisulfite capture sequencing BSPP Bisulfite specific padlock probes Methyl-Seq Restriction enzyme based methyl-seq MSCC Methyl sensitive cut counting HELP-Seq HpaII fragment enrichment by ligation PCR MCA-Seq Methylated CpG island amplification MeDIP-Seq Methylated DNA immunoprecipitation MBP-Seq Methyl-binding protein sequencing MethylCap-seq Methylated DNA capture by affinity purification MIRA-Seq Methylated CpG island recovery assay
  54. Methylation: Bioinformatics Resources !68 Resource Purpose URL Refs Batman MeDIP

    DNA methylation analysis tool http://td-blade.gurdon.cam.ac.uk/software/batman BDPC DNA methylation analysis platform http://biochem.jacobs-university.de/BDPC BSMAP Whole-genome bisulphite sequence mapping http://code.google.com/p/bsmap CpG Analyzer Windows-based program for bisulphite DNA - CpGcluster CpG island identification http://bioinfo2.ugr.es/CpGcluster CpGFinder Online program for CpG island identification http://linux1.softberry.com CpG Island Explorer Online program for CpG Island identification http://bioinfo.hku.hk/cpgieintro.html CpG Island Searcher Online program for CpG Island identification http://cpgislands.usc.edu CpG PatternFinder Windows-based program for bisulphite DNA - CpG Promoter Large-scale promoter mapping using CpG islands http://www.cshl.edu/OTT/html/cpg_promoter.html CpG ratio and GC content Plotter Online program for plotting the observed:expected ratio of CpG http://mwsross.bms.ed.ac.uk/public/cgi-bin/cpg.pl CpGviewer Bisulphite DNA sequencing viewer http://dna.leeds.ac.uk/cpgviewer CyMATE Bisulphite-based analysis of plant genomic DNA http://www.gmi.oeaw.ac.at/en/cymate-index/ EMBOSS CpGPlot/ CpGReport Online program for plotting CpG-rich regions http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html Epigenomics Roadmap NIH Epigenomics Roadmap Initiative homepage http://nihroadmap.nih.gov/epigenomics Epinexus DNA methylation analysis tools http://epinexus.net/home.html MEDME Software package (using R) for modelling MeDIP experimental data http://espresso.med.yale.edu/medme methBLAST Similarity search program for bisulphite-modified DNA http://medgen.ugent.be/methBLAST MethDB Database for DNA methylation data http://www.methdb.de MethPrimer Primer design for bisulphite PCR http://www.urogene.org/methprimer methPrimerDB PCR primers for DNA methylation analysis http://medgen.ugent.be/methprimerdb MethTools Bisulphite sequence data analysis tool http://www.methdb.de MethyCancer Database Database of cancer DNA methylation data http://methycancer.psych.ac.cn Methyl Primer Express Primer design for bisulphite PCR http://www.appliedbiosystems.com/ Methylumi Bioconductor pkg for DNA methylation data from Illumina http://www.bioconductor.org/packages/bioc/html/ Methylyzer Bisulphite DNA sequence visualization tool http://ubio.bioinfo.cnio.es/Methylyzer/main/index.html mPod DNA methylation viewer integrated w/ Ensembl genome browser http://www.compbio.group.cam.ac.uk/Projects/ PubMeth Database of DNA methylation literature http://www.pubmeth.org QUMA Quantification tool for methylation analysis http://quma.cdb.riken.jp TCGA Data Portal Database of TCGA DNA methylation data http://cancergenome.nih.gov/dataportal
  55. Methylation: Further Reading • Bock, C., Tomazou, E. M., Brinkman,

    A. B., Müller, F., Simmer, F., Gu, H., Jäger, N., et al. (2010). Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14. • Brinkman, A. B., Simmer, F., Ma, K., Kaan, A., Zhu, J., & Stunnenberg, H. G. (2010). Whole-genome DNA methylation profiling using MethylCap-seq. Methods (San Diego, Calif.), 52(3), 232-6. • Brunner, A. L., Johnson, D. S., Kim, S. W., Valouev, A., Reddy, T. E., et al. (2009). Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver, 1044-1056. • Gu, H., Bock, C., Mikkelsen, T. S., Jäger, N., Smith, Z. D., Tomazou, E., Gnirke, A., et al. (2010). Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nature methods, 7(2), 133-6. • Harris, R. A., Wang, T., Coarfa, C., Nagarajan, R. P., Hong, C., Downey, S. L., Johnson, B. E., et al. (2010). Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nature biotechnology, 28(10), 1097-105. • Kerick, M., Fischer, A., & Schweiger, M.-ruth. (2012). Bioinformatics for High Throughput Sequencing. (N. Rodríguez-Ezpeleta, M. Hackenberg, & A. M. Aransay, Eds.), 151-167. New York, NY: Springer New York. • Laird, P. W. (2010). Principles and challenges of genomewide DNA methylation analysis. Nature reviews. Genetics, 11(3), 191-203. • Weber, M., Davies, J. J., Wittig, D., Oakeley, E. J., Haase, M., Lam, W. L., & Schübeler, D. (2005). Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nature genetics, 37(8), 853-62. doi:10.1038/ng1598 !69
  56. Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly

    • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / “systems biology” • Literature analysis / text-mining !70
  57. One gene, one enzyme, one function? !71 Jeong, H. et

    al.. (2001) Nature 411:41–42. Ptacek, J. et al. (2005) Nature 438:679–684. Guimera and Amaral. (2005). Nature 433:895-900. Tong, A.H. et al. (2001). Science 294:2364-2368. Zhu X. et al. (2007). Genes & Dev 21:1010-1024.
  58. Pathway Analysis • Data is cheap and diverse. - Genetic

    variation: GWAS, next-gen sequencing - Gene expression: Microarray, RNA-seq - Proteomics: Y2H, CoAP/MS • Cellular components interact in a network with other cellular components. • Disease is the result of an abnormality in that network. • Integrate multiple data types, understand network, understand disease. !72
  59. Pathway Analysis • You’ve done your microarray/RNA-Seq experiment - You

    have a list of genes - Want to put these into functional context - What biological processes are perturbed? - What pathways are being dysregulated? - Data reduction: hundreds or thousands of genes can be reduced to 10s of pathways - Identifying active pathways = more explanatory power • “Pathway analysis” encompasses many, many techniques: - 1st Generation: Overrepresentation Analysis (E.g. GO ORA) - 2nd Generation: Functional Class Scoring (e.g. GSEA) - 3rd Generation: Pathway Topology (E.g. SPIA) - http://stephenturner.us/slides: “Pathway Analysis 2012” !73
  60. Pathway Analysis • Pathway analysis gives you more biological insight

    than staring at lists of genes. • Pathway analysis is complex, and has many limitations. • Pathway analysis is still more of an exploratory procedure rather than a pure statistical endpoint. • The best conclusions are made by viewing enrichment analysis results through the lens of the investigator’s expert biological knowledge. !74
  61. Bioinformatics Workshops & Online Training !76 http://stephenturner.us/edu! ! Regularly updated,

    comprehensive list of in-person and free online workshops in bioinformatics, programming, statistics, genetics, etc.
  62. Publicly Available Data: NCBI • Genbank: http://www.ncbi.nlm.nih.gov/genbank/ – Collection of

    all publicly available DNA sequences. – Feb 2013: 150,141,354,858 bases from 162,886,727 sequences. • NCBI Genomes: http://www.ncbi.nlm.nih.gov/genome/ – Public repository for sequenced genomes. – March 2013: 3,005 eukaryotes, 19,125 prokaryotes, 3,570 viruses. • NCBI Taxonomy: http://www.ncbi.nlm.nih.gov/taxonomy – Publicly available classification and nomenclature database for all organisms in the public sequences database. – Phylogenetic lineages for >160,000 organisms (est. ~10% life on the planet) • GEO: http://www.ncbi.nlm.nih.gov/geo/ – Public repository of sequence- and array-based gene expression data, free for the taking. – 900,000+ samples, 3,200+ datasets. • dbGaP: http://www.ncbi.nlm.nih.gov/gap – Public repository for genetic studies. – 2,500+ datasets, 100,000+ variables. • SRA: http://www.ncbi.nlm.nih.gov/sra – Public repository for raw sequencing data from NGS platforms. – 3,500,000,000,000,000 bases sequenced. !77
  63. Publicly Available Data: Databases • 2014 Nucleic Acids Research Database

    Issue – http://nar.oxfordjournals.org/content/42/D1.toc – 188 articles describing new/updated molecular biology databases. • NAR Molecular Biology Database Collection – http://www.oxfordjournals.org/nar/database/a/ – >1,500 molecular biology databases – Categories: DNA/RNA/Protein sequences, structures, metabolic/signaling pathways, genes & genomes, human diseases, microarray/other gene expression data, proteomics, organelles, plants, immunological, cell bio, … !78
  64. Publicly Available Data: Web Servers • 2012 NAR Web Server

    Issue – http://nar.oxfordjournals.org/content/41/W1.toc – 100 articles/webservers featured • Bioinformatics Links Directory – http://bioinformatics.ca/links_directory/ – Includes all the NAR resources above. – 1,300 tools, 600 databases, 160 other resources – Topics: computer-related, DNA, education, expression, genomics, literature, model organisms, RNA, protein, other molecules, sequence comparison, … !79
  65. Online Community & Discussion Forum • Seqanswers - http://SEQanswers.com -

    Twitter: @SEQquestions - Format: Forum - Li et al. SEQanswers : An open access community for collaboratively decoding genomes. Bioinformatics (2012). • BioStar: - http://biostar.stackexchange.com - Twitter: @BioStarQuestion - Format: Q&A - Parnell et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comp Bio (2011) 7:e1002216. !80