Slide 1

Slide 1 text

Introduction to Bioinformatics Stephen D. Turner, Ph.D. Bioinformatics Core Director !1 February 20, 2014 Slides available at stephenturner.us/slides

Slide 2

Slide 2 text

Outline • Bioinformatics origins & definition • Bioinformatics subdisciplines - Alignment - Assembly - Metagenomics - Comparative Genomics (case study) - Genetic Epidemiology - Gene Expression - Gene Regulation - Systems Biology • Resources for further learning !2

Slide 3

Slide 3 text

Bioinformatics origins + definition !3

Slide 4

Slide 4 text

The Central Dogma DNA RNA Protein Reverse Transcription RNA Silencing Prions Post- translational modification !4

Slide 5

Slide 5 text

!5 DNA provides assembly instructions for proteins Protein folding determines molecular function Networks of interacting proteins determine tissue/ organ function

Slide 6

Slide 6 text

!6 DNA provides assembly instructions for proteins Protein folding determines molecular function Networks of interacting proteins determine tissue/ organ function DNA Variant Analysis Gene Expression Analysis Genome Annotation Epigenetics miRNA Analysis Quantitative MS Proteomics Pathway Analysis Systems Biology Biomarker ID’n Etc.

Slide 7

Slide 7 text

Bioinformatics Origins • Rooted in sequence analysis • Driven by the need to: - Collect - Annotate - Analyze !7

Slide 8

Slide 8 text

Margaret Dayhoff (1925-1983) • Collected all known protein sequences • Published in 1965 • Pioneered algorithm development for - Comparison of protein sequences - Derivation of evolutionary histories from alignments “In this paper we shall describe a completed computer program for the IBM 7090, which to our knowledge is the first successful attempt at aiding the analysis of the amino acid chain structure of protein.” !8

Slide 9

Slide 9 text

IBM 7090 !9

Slide 10

Slide 10 text

M. Dayhoff, February 27 1967 “There is a tremendous amount of information regarding evolutionary history and biochemical function implicit in each sequence and the number of known sequences is growing explosively. We feel it is important to collect this significant information, correlate it into a unified whole and interpret it.” !10

Slide 11

Slide 11 text

What is bioinformatics? Mofified from @drewconway !11

Slide 12

Slide 12 text

W W W invented Internet invented What is bioinformatics? !12 1960 1970 1980 1990 2000 2010 Dayhoff Atlas Sanger Sequencing G enBank EBI-EM BL Next-G en Sequencing ARPAnet

Slide 13

Slide 13 text

!13 Illumina HiSeq X Ten

Slide 14

Slide 14 text

Definition • From Wikipedia: Bioinformatics is a branch of biological science which deals with the study of methods for storing, retrieving and analyzing biological data, such as nucleic acid (DNA/RNA) and protein sequence, structure, function, pathways and genetic interactions. It generates new knowledge that is useful in such fields as drug design and development of new software tools to create that knowledge. Bioinformatics also deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, structural biology, software engineering, data mining, image processing, modeling and simulation, discrete mathematics, control and system theory, circuit theory, and statistics. • Our definition: using computer science and statistics to answer biological questions. !14

Slide 15

Slide 15 text

Bioinformatics Subdisciplines !15

Slide 16

Slide 16 text

Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !16

Slide 17

Slide 17 text

Sequence Alignment Example 1 • Analogy: Given a sentence and a library of books, find which book the sentence came from. • Grow [cells/tissue/culture], isolate DNA, sequence. • BLAST (Basic Local Alignment Search Tool) • blast.ncbi.nlm.nih.gov !17 ATGGTTTACTCCTATACCGAGAAAAAACGTATTCGTAAGGATTTTGGTAAACGTCCACAAGTTCTGGATGTACCTTATCTCCTTTCTATCCAGCTTGACTCGTTTCAGAAATTTATCGAGCAAGATCCTGAAGGGCAGTATGGTCTGGAA GCTGCTTTCCGTTCCGTATTCCCGATTCAGAGCTACAGCGGTAATTCCGAGCTGCAATACGTCAGCTACCGCCTTGGCGAACCGGTGTTTGACGTCCAGGAATGTCAAATCCGTGGCGTGACCTATTCCGCACCGCTGCGCGTTAAACTG CGTCTGGTGATCTATGAGCGCGAAGCGCCGGAAGGCACCGTAAAAGACATTAAAGAACAAGAAGTCTACATGGGCGAAATTCCGCTCATGACAGACAACGGTACCTTTGTTATCAACGGTACTGAGCGTGTTATCGTTTCCCAGCTGCAC CGTAGTCCGGGCGTCTTCTTTGACTCCGACAAAGGTAAAACCCACTCTTCGGGTAAAGTGCTGTATAACGCGCGTATCATCCCTTACCGTGGTTCCTGGCTGGACTTCGAATTCGATCCGAAGGACAACCTGTTCGTACGTATCGACCGT CGCCGTAAACTGCCTGCGACCATCATTCTGCGCGCCCTGAACTACACCACAGAGCAGATCCTCGACCTGTTCTTTGAAAAAGTTATCTTTGAAATCCGTGATAACAAGCTGCAGATGGAACTGGTGCCGGAACGCCTGCGTGGTGAAACC GCATCTTTTGACATCGAAGCTAACGGTAAAGTGTACGTAGAAAAAGGCCGCCGTATCACTGCGCGCCACATTCGCCAGCTGGAAAAAGACGACGTCAAACTGATCGAAGTCCCGGTTGAGTACATCGCAGGTAAAGTGGTTGCTAAAGAC TATATTGATGAGTCTACCGGCGAGCTGATCTGCGCAGCGAACATGGAGCTGAGCCTGGATCTGCTGGCTAAGCTGAGCCAGTCTGGTCACAAGCGTATCGAAACGCTGTTCACCAACGATCTGGATCACGGCCCATATATCTCTGAAACC TTACGTGTCGACCCAACTAACGACCGTCTGAGCGCACTGGTAGAAATCTACCGCATGATGCGCCCTGGCGAGCCGCCGACTCGTGAAGCAGCTGAAAGCCTGTTCGAGAACCTGTTCTTCTCCGAAGACCGTTATGACTTGTCTGCGGTT GGTCGTATGAAGTTCAACCGTTCTCTGCTGCGCGAAGAAATCGAAGGTTCCGGTATCCTGAGCAAAGACGACATCATTGATGTTATGAAAAAGCTCATCGATATCCGTAACGGTAAAGGCGAAGTCGATGATATCGACCACCTCGGCAAC CGTCGTATCCGTTCCGTTGGCGAAATGGCGGAAAACCAGTTCCGCGTTGGCCTGGTACGTGTAGAGCGTGCGGTGAAAGAGCGTCTGTCTCTGGGCGATCTGGATACCCTGATGCCTCAGGATATGATCAACGCCAAGCCGATTTCCGCA GCAGTGAAAGAGTTCTTCGGTTCCAGCCAGCTGTCTCAGTTTATGGACCAGAACAACCCGCTGTCTGAGATTACGCACAAACGTCGTATCTCCGCACTCGGCCCAGGCGGTCTGACCCGTGAACGTGCAGGCTTCGAAGTTCGAGACGTA CACCCGACTCACTACGGTCGCGTATGTCCAATCGAAACCCCTGAAGGTCCGAACATCGGTCTGATCAACTCTCTGTCCGTGTACGCACAGACTAACGAATACGGCTTCCTTGAGACTCCGTATCGTAAAGTGACCGACGGTGTTGTAACT GACGAAATTCACTACCTGTCTGCTATCGAAGAAGGCAACTACGTTATCGCCCAGGCGAACTCCAACCTGGATGAAGAAGGCCACTTCGTAGAAGACCTGGTAACTTGCCGTAGCAAAGGCGAATCCAGCTTGTTCAGCCGCGACCAGGTT GACTACATGGACGTATCCACCCAGCAGGTGGTATCCGTCGGTGCGTCCCTGATCCCGTTCCTGGAACACGATGACGCCAACCGTGCATTGATGGGTGCGAACATGCAACGTCAGGCCGTTCCGACTCTGCGCGCTGATAAGCCGCTGGTT GGTACTGGTATGGAACGTGCTGTTGCCGTTGACTCCGGTGTAACTGCGGTAGCTAAACGTGGTGGTGTCGTTCAGTACGTGGATGCTTCCCGTATCGTTATCAAAGTTAACGAAGACGAGATGTATCCGGGTGAAGCAGGTATCGACATC TACAACCTGACCAAATACACCCGTTCTAACCAGAACACCTGTATCAACCAGATGCCGTGTGTGTCTCTGGGTGAACCGGTTGAACGTGGCGACGTGCTGGCAGACGGTCCGTCCACCGACCTCGGTGAACTGGCGCTTGGTCAGAACATG CGCGTAGCGTTCATGCCGTGGAATGGTTACAACTTCGAAGACTCCATCCTCGTATCCGAGCGTGTTGTTCAGGAAGACCGTTTCACCACCATCCACATTCAGGAACTGGCGTGTGTGTCCCGTGACACCAAGCTGGGGCCGGAAGAGATC ACCGCTGACATCCCGAACGTGGGTGAAGCTGCGCTCTCCAAACTGGATGAATCCGGTATCGTTTACATTGGTGCGGAAGTGACCGGTGGCGACATTCTGGTTGGTAAGGTAACGCCGAAAGGTGAAACTCAGCTGACCCCAGAAGAAAAA CTGCTGCGTGCGATCTTCGGTGAGAAAGCCTCTGACGTTAAAGACTCTTCTCTGCGCGTACCAAACGGTGTATCCGGTACGGTTATCGACGTTCAGGTCTTTACTCGCGATGGCGTAGAAAAAGACAAACGTGCGCTGGAAATCGAAGAA ATGCAGCTCAAACAGGCGAAGAAAGACCTGTCTGAAGAACTGCAGATCCTCGAAGCTGGTCTGTTCAGCCGTATCCGTGCTGTGCTGGTAGCCGGTGGCGTTGAAGCTGAGAAGCTCGACAAACTGCCGCGCGATCGCTGGCTGGAGCTA GGCCTGACAGACGAAGAGAAACAAAATCAGCTGGAACAGCTGGCTGAGCAGTATGACGAACTGAAACACGAGTTCGAGAAGAAACTCGAAGCGAAACGCCGCAAAATCACCCAGGGCGACGATCTGGCACCGGGCGTGCTGAAGATTGTT AAGGTATATCTGGCGGTTAAACGCCGTATCCAGCCTGGTGACAAGATGGCAGGTCGTCACGGTAACAAGGGTGTAATTTCTAAGATCAACCCGATCGAAGATATGCCTTACGATGAAAACGGTACGCCGGTAGACATCGTACTGAACCCG CTGGGCGTACCGTCTCGTATGAACATCGGTCAGATCCTCGAAACCCACCTGGGTATGGCTGCGAAAGGTATCGGCGACAAGATCAACGCCATGCTGAAACAGCAGCAAGAAGTCGCGAAACTGCGCGAATTCATCCAGCGTGCGTACGAT CTGGGCGCTGACGTTCGTCAGAAAGTTGACCTGAGTACCTTCAGCGATGAAGAAGTTATGCGTCTGGCTGAAAACCTGCGCAAAGGTATGCCAATCGCAACGCCGGTGTTCGACGGTGCGAAAGAAGCAGAAATTAAAGAGCTGCTGAAA CTTGGCGACCTGCCGACTTCCGGTCAGATCCGCCTGTACGATGGTCGCACTGGTGAACAGTTCGAGCGTCCGGTAACCGTTGGTTACATGTACATGCTGAAACTGAACCACCTGGTCGACGACAAGATGCACGCGCGTTCCACCGGTTCT TACAGCCTGGTTACTCAGCAGCCGCTGGGTGGTAAGGCACAGTTCGGTGGTCAGCGTTTCGGGGAGATGGAAGTGTGGGCGCTGGAAGCATACGGCGCAGCATACACCCTGCAGGAAATGCTCACCGTTAAGTCTGATGACGTGAACGGT

Slide 18

Slide 18 text

!18

Slide 19

Slide 19 text

!19

Slide 20

Slide 20 text

Sequence Alignment Example 2 • Illumina HiSeq 2500: - 600,000,000,000 bases sequenced in single run. - 6,000,000,000 x 100-bp (short) reads with errors. !20

Slide 21

Slide 21 text

!21 Cluster Generation / Bridge Amplification Sequencing by Synthesis http://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

Slide 22

Slide 22 text

Primary Analysis: Get FASTQ File !22 @HWI-EAS367_0010:3:1:2380:6567#0/1 TAATTTCCATTCATCATGACAGCCCTCCAGAGGTTAGACAAC +HWI-EAS367_0010:3:1:2380:6567#0/1 GGG?GEG@GB?C8E8ECCEEGGDD>CC89AGD8BBA8 @HWI-EAS367_0010:3:1:2585:6567#0/1 ACAGTATTCTGGGGAGGATTAAATTAGATAAACATGCAAGAA +HWI-EAS367_0010:3:1:2585:6567#0/1 EGGGGGGGFECG+CDADDGD4BAB9D

Slide 23

Slide 23 text

Primary Analysis: Get FASTQ File • Q=-10log10(p) where p=probability of error. • Probability of error=10%, p=0.1, Q=10 • Probability of error=1%, p=0.01, Q=30 • Probability of error=0.1%, p=0.001, Q=40 • Quality between 0-40 (usually) represented as ASCII 33-73. !23 @HWI-EAS367_0010:3:1:2380:6567#0/1 TAATTTCCATTCATCATGACAGCCCTCCAGAGGTTAGACAAC +HWI-EAS367_0010:3:1:2380:6567#0/1 GGG?GEG@GB?C8E8ECCEEGGDD>CC89AGD8BBA8 SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)

Slide 24

Slide 24 text

Sequence Alignment Example 2 • Illumina HiSeq 2500: - 600,000,000,000 bases sequenced in single run. - 6,000,000,000 x 100-bp (short) reads with errors. • BLAST slow & computationally expensive. • BWA: burrows wheeler aligner (fast) • Bowtie: fast, memory-efficient (aligns 25,000,000 35-bp reads per hour per CPU). • Many others... MAQ, Eland, RMAP, SOAP, SHRiMP, BFAST, Mosaik, Novoalign, BLAT, GMAP, GSNAP, MOM, QPalma, SeqMap, VelociMapper, Stampy, mrFAST, etc. !24

Slide 25

Slide 25 text

Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !25

Slide 26

Slide 26 text

Genome Assembly • DNA (or RNA) is randomly fragmented and computationally put back together based on overlapping reads and coverage - Coverage = avg # reads that overlapping each position in a sequence ! ! ! ! ! ! ! ! • Analogy: - Alignment: “Given a sentence and a library of books, find which book the sentence came from.” - Assembly: “Given short sentence fragments with typos, put the paragraphs/ books back together.” !26 Consensus Sequence (Contig) ----ACTGATT GCTAACT---- --TAACTGATT

Slide 27

Slide 27 text

Genome Assembly !27 isdom, it wds the age of foolisXness , it was the worVt of times, it was the It was the Gest of times, it was the mor mes, it was Ahe age of wisdon, it was th It w8s the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tle age of foolishness , it was the wooorst of timZs, it ws the It was the best of times, it was the worst of times, ! it was the age of wisdom, it was the age of foolishness Adapted from slideshare.net/c.titus.brown

Slide 28

Slide 28 text

Genome Assembly • Alignment: “Given a sentence and a library of books, find which book the sentence came from.” • Reference-guided genome assembly: “Given millions of short sentence fragments from a book, and a full copy of the book they came from, put the book back together again.” • De novo genome assembly: “Given millions of short sentence fragments from a single book, without a copy of the book they came from, put the book back together again.” • Metagenome assembly: “Given millions of short sentence fragments taken randomly from thousands of different books, put the separate books back together again.” !28

Slide 29

Slide 29 text

Genome Assembly !29 Single genome assembly is hard enough… Metagenomes are a different ball game! Adapted from slideshare.net/c.titus.brown Avg size bacterial genome: 4,000,000 base pairs One gram soil can contain roughly 1,000,000,000 cells from over 1 million species. ≈ 4,000,000,000,000,000 base pairs in gram of soil! Knight et al. 2012 Nat Biotech.

Slide 30

Slide 30 text

Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !30

Slide 31

Slide 31 text

Metagenomics • Given an environmental sample: - Who’s there? - What are they doing? • Sequencing reads (~100bp) can give you rough answers to these questions (family, genus, maybe species). • Need longer contiguous sequences (contigs) to get high-resolution answer - Phylogenetic analysis - Functional analysis: virulence genes, antibiotic susceptibility, … !31 Wooley, John C., Adam Godzik, and Iddo Friedberg. "A primer on metagenomics." PLoS computational biology 6.2 (2010): e1000667.

Slide 32

Slide 32 text

Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !32

Slide 33

Slide 33 text

Comparative Genomics Example • Go to genome.ucsc.edu • Search for POLR2A • Turn on some conservation tracks !33

Slide 34

Slide 34 text

Phylogenetic Trees • Multiple sequence alignments can be used to generate phylogenetic trees. - Nucleotide or translated amino acids - Sequence similarity ≈ evolutionary distance !34 http://en.wikipedia.org/wiki/Multiple_sequence_alignment http://en.wikipedia.org/wiki/Bacillus_anthracis

Slide 35

Slide 35 text

Sequence similarity Evolutionary distance !35

Slide 36

Slide 36 text

Phylogenetic Trees • Have query sequence (your sample of interest) • Have known sequences (publicly available database) • Compare query to known* • Phylogenetic tree places unknown/query sequence in the context of known organisms. !36 Cheung et al 2011 BMC Res Notes 2011 German outbreak 2001 strain Enteroaggregative E. coli Enterohemorrhagic E. coli

Slide 37

Slide 37 text

Case study: 2011 German E. coli outbreak • E. coli: normally commensal, can be pathogenic: - Enteroaggregative (EAEC): persistent diarrhea - Enterohemorrhagic (EHEC): produces Shiga toxin (Stx). • May-June, 2011, Germany: - 4,000 cases bloody diarrhea - 850 cases HUS - 50 deaths • Serotype O104:H4 - Normally not associated with high rate of HUS. - But high proportion of patients developed HUS & other complications. - Strain indistinguishable from previous strains based on molecular evidence (serotyping, MLST, PFGE, optical mapping, etc.). • Three groups independently sequenced the outbreak strain. !37

Slide 38

Slide 38 text

Case study: 2011 German E. coli outbreak • H. Rohde et al., Open-source genomic analysis of Shiga- toxin-producing E. coli O104:H4., NEJM 365, 718–24 (2011). • Used benchtop NGS, public data release, rapid crowd- sourcing of analysis to bioinformaticians worldwide: - Genome sequence obtained in 3 days, released with CC-0 license at https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki - 24 hours: genome assembled - 2 days: assigned to existing sequence type. - 5 days: design/release strain-specific primer sequences - 7 days: >24 reports filed on GitHub wiki above • Conclusions: Outbreak strain belonged to EAEC lineage that acquired Stx and antibiotic resistance through horizontal gene transfer. !38

Slide 39

Slide 39 text

Case study: 2011 German E. coli outbreak • D. A. Rasko et al., Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany., NEJM 365, 709–17 (2011). • Used 3rd-generation NGS (PacBio) to sequence 4 outbreak strains, comparing with >40 other O104:H4 strains. • Conclusions: German outbreak distinguished from other 0104:H4 strains because it contained prophage encoding Stx and distinct set of virulence & antibiotic resistance markers, likely aqcuired via horizontal gene transfer. !39

Slide 40

Slide 40 text

Case study: 2011 German E. coli outbreak • A. Mellmann et al., Prospective genomic characterization of the German enterohemorrhagic Escherichia coli O104:H4 outbreak by rapid next generation sequencing technology., PloS ONE 6, e22751 (2011). • Used NGS (Ion Torrent) to sequence and complete reference-guided assembly in 62 hours. • Conclusions: - HUS-associated strains carried genes from both EAEC and EHEC. - Model: EAEC and EHEC O104:H4 evolved from common EHEC ancestor, and had stepwise gain/loss of virulence factors, leading to a highly pathogenic hybrid that emerged as the outbreak clone. !40

Slide 41

Slide 41 text

Case study: 2011 German E. coli outbreak • Germany initially blamed the infection on cucumbers from Spain. • Epidemiological evidence linked O104:H4 to fenugreek sprouts imported from Egypt. • Spain lost ~$200M USD/week. Russia banned import of all fresh vegetables from EU. • Illustrates vulnerability of food products to malevolent tampering, and widespread economic consequences crossing international borders that can occur from even limited product contamination. !41

Slide 42

Slide 42 text

Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !42

Slide 43

Slide 43 text

Genetic Epidemiology • Linkage: finding genetic loci that segregate with the disease in families. • Association: finding alleles that co-occur with disease in populations. - Common disease - common variant hypothesis: - Common variants (e.g. >1-5% in the population) contribute to common, complex disease). - Common disease - rare variant hypothesis: - Polymorphisms that cause disease are under purifying selection, and will thus be rare. - Really, it's a mix of both !43

Slide 44

Slide 44 text

Candidate Gene Study • Select candidate genes based on: - Known biology - Previous linkage/association evidence - Pathways - Evidence from model organisms • Genotype variants (SNPs) in those genes • Statistical association !44 Genotype at position rs12345: A/T Genotype at position rs12345: A/A Genotype at position rs12345: T/T

Slide 45

Slide 45 text

Genome-Wide Association Study • Genotype >500,000 SNPs • Statistical test at each one • Manhattan plot of results • GWAS does not inform: - Which gene affected - How gene function perturbed - How biological function altered !45

Slide 46

Slide 46 text

Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !46

Slide 47

Slide 47 text

Gene Expression pre-2008 !47 PCR Microarrays

Slide 48

Slide 48 text

Advantages of RNA-seq • No reference necessary • Low background (no cross-hybridization) • Unlimited dynamic range (FC 9000 Science 320:1344) • Direct counting (microarrays: indirect – hybridization) • Can characterize full transcriptome - mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc) - Differential gene expression - Differential coding output - Differential TSS usage - Differential isoform expression !48

Slide 49

Slide 49 text

Isoform-level data !49

Slide 50

Slide 50 text

Isoform-level data !50

Slide 51

Slide 51 text

Differential Splicing & TSS Use !51

Slide 52

Slide 52 text

RNA-seq challenges • Library construction - Size selection (messenger, small) - Strand specificity? • Bioinformatic challenges - Spliced alignment - Transcript deconvolution • Statistical Challenges - Highly variable abundance - Sample size: never, ever, plan n=1 - Normalization (RPKM) ‣ More reads from longer transcripts, higher sequencing depth ‣ Want to compare features of different lengths ‣ Want to compare conditions with different total sequence depth !52

Slide 53

Slide 53 text

RNA-seq overview !53 Condition 1 (normal colon) Condition 2 (colon tumor) Samples of interest AAAAA mRNA AAAAA mRNA TTTTT Library @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATA +HWUSI-EAS100R:6:73:941:1973#0/1 !''*((((***+))%%%++)(%%%%).1***- @HWUSI-EAS100R:6:73:941:1973#0/1 CATCGACGTAGATCGACTACATGAACTGCTCG +HWUSI-EAS100R:6:73:941:1973#0/1 !'’*+(*+!+(*!+*(((***!%%%%!%%(+-

Slide 54

Slide 54 text

RNA-seq common question #1: Depth • Question: how much sequence do I need? • Answer: it’s complicated. • Oversimplified answer: 20-50 million PE reads / sample (mouse/human). • Depends on: - Size & complexity of transcriptome - Application: differential gene expression, transcript discovery - Tissue type, RNA quality, library preparation - Sequencing type: length, paired-end vs single-end, etc. • Find a publication in your field with similar goals. • Good news: ¼ HiSeq lane usually sufficient. !54

Slide 55

Slide 55 text

RNA-seq common question #2: sample size • Question: How many samples should I sequence? • Oversimplified Answer: At least 3 biological replicates per condition. • Depends on: - Sequencing depth - Application - Goals (prioritization, biomarker discovery, etc.) - Effect size, desired power, statistical significance • Find a publication with similar goals !55

Slide 56

Slide 56 text

RNA-seq common question #3: Workflow • How do I analyze the data? • No standards! - Unspliced aligners: BWA, Bowtie, Bowtie2, MANY others! - Spliced aligners: STAR, Rum, Tophat, Tophat2-Bowtie1, Tophat2- Bowtie2, GSNAP, MANY others. - Reference builds & annotations: UCSC, Entrez, Ensembl - Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS - Quantification: Cufflinks, RSEM, eXpress, MISO, etc. - Differential expression: Cuffdiff, Cuffdiff2, DegSeq, DESeq, EdgeR, Myrna • Like early microarray days: lots of excitement, lots of tools, little knowledge of integrating tools in pipeline! • Benchmarks - Microarray: Spike-ins (Irizarry) - RNA-Seq: ???, simulation, ??? !56

Slide 57

Slide 57 text

RNA-seq common question #3: Workflow !57 Eyras et al. Methods to Study Splicing from RNA-Seq. http://dx.doi.org/10.6084/m9.figshare.679993 Turner SD. RNA-seq Workflows and Tools. http://dx.doi.org/10.6084/m9.figshare.662782

Slide 58

Slide 58 text

RNA-seq workflow #1: Differential Gene Expression !58

Slide 59

Slide 59 text

RNA-seq workflow #2: Differential Isoform Expression, Exon Usage !59

Slide 60

Slide 60 text

RNA-seq exercises !60 #1: Examining Gene Expression and Methylation with Next-Gen Sequencing ! #2:“Galaxy CME Class” http://stephenturner.us/slides

Slide 61

Slide 61 text

RNA-seq: Further Reading • RNA-Seq: – Garber, M., Grabherr, M. G., Guttman, M., & Trapnell, C. (2011). Computational methods for transcriptome annotation and quantification using RNA-seq. Nature methods, 8(6), 469-77. – Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research, 18(9), 1509-17. – Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621-8. – Ozsolak, F., & Milos, P. M. (2011). RNA sequencing: advances, challenges and opportunities. Nature reviews. Genetics, 12(2), 87-98. – Toung, J. M., Morley, M., Li, M., & Cheung, V. G. (2011). RNA-sequence analysis of human B-cells. Genome research, 991-998. – Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics, 10(1), 57-63. • Bowtie/Tophat: – Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3), R25. – Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. • Cufflinks: – Roberts, A., Pimentel, H., Trapnell, C., & Pachter, L. (2011). Identification of novel transcripts in annotated genomes using RNA- Seq. Bioinformatics (Oxford, England), 27(17), 2325-9. – Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., et al. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562-578. – Trapnell, C., Williams, B. a, Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5), 511-5. • DEXSeq: – Vignette: http://watson.nci.nih.gov/bioc_mirror/packages/2.9/bioc/html/DEXSeq.html. – Pre-pub manuscript: Anders, S., Reyes, A., Huber, W. (2012). Detecting differential usage of exons from RNA-Seq data. Nautre Precedings, DOI: 10.1038/npre.2012.6837.2. !61

Slide 62

Slide 62 text

Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / "systems biology" • Literature analysis / text-mining !62

Slide 63

Slide 63 text

How are genes regulated? • Transcription factors (ChIP-seq) • Micro-RNAs (RNA-seq) • Chromatin accessibility (DNAse-Seq) • DNA Methylation (RRBS-seq, MeDIP-seq, etc.) • RNA processing • RNA transport • Translation • Post-translational modification !63

Slide 64

Slide 64 text

DNA Methylation: Importance • Occurs most frequently at CpG sites • High methylation at promoters ≈ silencing • Methylation perturbed in cancer • Methylation associated with many other complex diseases: neural, autoimmune, response to env. • Mapping DNA methylation — new disease genes & drug targets. !64

Slide 65

Slide 65 text

DNA Methylation: Challenges • Dynamic and tissue-specific • 5meC pattern is complex. • Uneven distribution of CpG targets • Multiple classes of methods: - Bisulfite, sequence-based: Assay methylated target sequences across individual CpGs. - Affinity enrichment, count-based: Assay methylation level across many genomic regions. !65

Slide 66

Slide 66 text

DNA Methylation: Methods !66 BS-Seq Whole-genome bisulfite sequencing RRBS-Seq Reduced representation bisulfite sequencing BC-Seq Bisulfite capture sequencing BSPP Bisulfite specific padlock probes Methyl-Seq Restriction enzyme based methyl-seq MSCC Methyl sensitive cut counting HELP-Seq HpaII fragment enrichment by ligation PCR MCA-Seq Methylated CpG island amplification MeDIP-Seq Methylated DNA immunoprecipitation MBP-Seq Methyl-binding protein sequencing MethylCap-seq Methylated DNA capture by affinity purification MIRA-Seq Methylated CpG island recovery assay

Slide 67

Slide 67 text

Methylation: Features & Biases !67

Slide 68

Slide 68 text

Methylation: Bioinformatics Resources !68 Resource Purpose URL Refs Batman MeDIP DNA methylation analysis tool http://td-blade.gurdon.cam.ac.uk/software/batman BDPC DNA methylation analysis platform http://biochem.jacobs-university.de/BDPC BSMAP Whole-genome bisulphite sequence mapping http://code.google.com/p/bsmap CpG Analyzer Windows-based program for bisulphite DNA - CpGcluster CpG island identification http://bioinfo2.ugr.es/CpGcluster CpGFinder Online program for CpG island identification http://linux1.softberry.com CpG Island Explorer Online program for CpG Island identification http://bioinfo.hku.hk/cpgieintro.html CpG Island Searcher Online program for CpG Island identification http://cpgislands.usc.edu CpG PatternFinder Windows-based program for bisulphite DNA - CpG Promoter Large-scale promoter mapping using CpG islands http://www.cshl.edu/OTT/html/cpg_promoter.html CpG ratio and GC content Plotter Online program for plotting the observed:expected ratio of CpG http://mwsross.bms.ed.ac.uk/public/cgi-bin/cpg.pl CpGviewer Bisulphite DNA sequencing viewer http://dna.leeds.ac.uk/cpgviewer CyMATE Bisulphite-based analysis of plant genomic DNA http://www.gmi.oeaw.ac.at/en/cymate-index/ EMBOSS CpGPlot/ CpGReport Online program for plotting CpG-rich regions http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html Epigenomics Roadmap NIH Epigenomics Roadmap Initiative homepage http://nihroadmap.nih.gov/epigenomics Epinexus DNA methylation analysis tools http://epinexus.net/home.html MEDME Software package (using R) for modelling MeDIP experimental data http://espresso.med.yale.edu/medme methBLAST Similarity search program for bisulphite-modified DNA http://medgen.ugent.be/methBLAST MethDB Database for DNA methylation data http://www.methdb.de MethPrimer Primer design for bisulphite PCR http://www.urogene.org/methprimer methPrimerDB PCR primers for DNA methylation analysis http://medgen.ugent.be/methprimerdb MethTools Bisulphite sequence data analysis tool http://www.methdb.de MethyCancer Database Database of cancer DNA methylation data http://methycancer.psych.ac.cn Methyl Primer Express Primer design for bisulphite PCR http://www.appliedbiosystems.com/ Methylumi Bioconductor pkg for DNA methylation data from Illumina http://www.bioconductor.org/packages/bioc/html/ Methylyzer Bisulphite DNA sequence visualization tool http://ubio.bioinfo.cnio.es/Methylyzer/main/index.html mPod DNA methylation viewer integrated w/ Ensembl genome browser http://www.compbio.group.cam.ac.uk/Projects/ PubMeth Database of DNA methylation literature http://www.pubmeth.org QUMA Quantification tool for methylation analysis http://quma.cdb.riken.jp TCGA Data Portal Database of TCGA DNA methylation data http://cancergenome.nih.gov/dataportal

Slide 69

Slide 69 text

Methylation: Further Reading • Bock, C., Tomazou, E. M., Brinkman, A. B., Müller, F., Simmer, F., Gu, H., Jäger, N., et al. (2010). Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10), 1106-14. • Brinkman, A. B., Simmer, F., Ma, K., Kaan, A., Zhu, J., & Stunnenberg, H. G. (2010). Whole-genome DNA methylation profiling using MethylCap-seq. Methods (San Diego, Calif.), 52(3), 232-6. • Brunner, A. L., Johnson, D. S., Kim, S. W., Valouev, A., Reddy, T. E., et al. (2009). Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver, 1044-1056. • Gu, H., Bock, C., Mikkelsen, T. S., Jäger, N., Smith, Z. D., Tomazou, E., Gnirke, A., et al. (2010). Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nature methods, 7(2), 133-6. • Harris, R. A., Wang, T., Coarfa, C., Nagarajan, R. P., Hong, C., Downey, S. L., Johnson, B. E., et al. (2010). Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nature biotechnology, 28(10), 1097-105. • Kerick, M., Fischer, A., & Schweiger, M.-ruth. (2012). Bioinformatics for High Throughput Sequencing. (N. Rodríguez-Ezpeleta, M. Hackenberg, & A. M. Aransay, Eds.), 151-167. New York, NY: Springer New York. • Laird, P. W. (2010). Principles and challenges of genomewide DNA methylation analysis. Nature reviews. Genetics, 11(3), 191-203. • Weber, M., Davies, J. J., Wittig, D., Oakeley, E. J., Haase, M., Lam, W. L., & Schübeler, D. (2005). Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nature genetics, 37(8), 853-62. doi:10.1038/ng1598 !69

Slide 70

Slide 70 text

Subdisciplines • Sequence alignment (DNA, RNA, Protein) • Genome assembly • Metagenomics • Genome annotation • Evolutionary biology / comparative genomics • Analysis of gene expression • Analysis of gene regulation • Genotype-phenotype association • Mutation analysis • Structural biology • Biomarker identification • Pathway analysis / “systems biology” • Literature analysis / text-mining !70

Slide 71

Slide 71 text

One gene, one enzyme, one function? !71 Jeong, H. et al.. (2001) Nature 411:41–42. Ptacek, J. et al. (2005) Nature 438:679–684. Guimera and Amaral. (2005). Nature 433:895-900. Tong, A.H. et al. (2001). Science 294:2364-2368. Zhu X. et al. (2007). Genes & Dev 21:1010-1024.

Slide 72

Slide 72 text

Pathway Analysis • Data is cheap and diverse. - Genetic variation: GWAS, next-gen sequencing - Gene expression: Microarray, RNA-seq - Proteomics: Y2H, CoAP/MS • Cellular components interact in a network with other cellular components. • Disease is the result of an abnormality in that network. • Integrate multiple data types, understand network, understand disease. !72

Slide 73

Slide 73 text

Pathway Analysis • You’ve done your microarray/RNA-Seq experiment - You have a list of genes - Want to put these into functional context - What biological processes are perturbed? - What pathways are being dysregulated? - Data reduction: hundreds or thousands of genes can be reduced to 10s of pathways - Identifying active pathways = more explanatory power • “Pathway analysis” encompasses many, many techniques: - 1st Generation: Overrepresentation Analysis (E.g. GO ORA) - 2nd Generation: Functional Class Scoring (e.g. GSEA) - 3rd Generation: Pathway Topology (E.g. SPIA) - http://stephenturner.us/slides: “Pathway Analysis 2012” !73

Slide 74

Slide 74 text

Pathway Analysis • Pathway analysis gives you more biological insight than staring at lists of genes. • Pathway analysis is complex, and has many limitations. • Pathway analysis is still more of an exploratory procedure rather than a pure statistical endpoint. • The best conclusions are made by viewing enrichment analysis results through the lens of the investigator’s expert biological knowledge. !74

Slide 75

Slide 75 text

Resources !75

Slide 76

Slide 76 text

Bioinformatics Workshops & Online Training !76 http://stephenturner.us/edu! ! Regularly updated, comprehensive list of in-person and free online workshops in bioinformatics, programming, statistics, genetics, etc.

Slide 77

Slide 77 text

Publicly Available Data: NCBI • Genbank: http://www.ncbi.nlm.nih.gov/genbank/ – Collection of all publicly available DNA sequences. – Feb 2013: 150,141,354,858 bases from 162,886,727 sequences. • NCBI Genomes: http://www.ncbi.nlm.nih.gov/genome/ – Public repository for sequenced genomes. – March 2013: 3,005 eukaryotes, 19,125 prokaryotes, 3,570 viruses. • NCBI Taxonomy: http://www.ncbi.nlm.nih.gov/taxonomy – Publicly available classification and nomenclature database for all organisms in the public sequences database. – Phylogenetic lineages for >160,000 organisms (est. ~10% life on the planet) • GEO: http://www.ncbi.nlm.nih.gov/geo/ – Public repository of sequence- and array-based gene expression data, free for the taking. – 900,000+ samples, 3,200+ datasets. • dbGaP: http://www.ncbi.nlm.nih.gov/gap – Public repository for genetic studies. – 2,500+ datasets, 100,000+ variables. • SRA: http://www.ncbi.nlm.nih.gov/sra – Public repository for raw sequencing data from NGS platforms. – 3,500,000,000,000,000 bases sequenced. !77

Slide 78

Slide 78 text

Publicly Available Data: Databases • 2014 Nucleic Acids Research Database Issue – http://nar.oxfordjournals.org/content/42/D1.toc – 188 articles describing new/updated molecular biology databases. • NAR Molecular Biology Database Collection – http://www.oxfordjournals.org/nar/database/a/ – >1,500 molecular biology databases – Categories: DNA/RNA/Protein sequences, structures, metabolic/signaling pathways, genes & genomes, human diseases, microarray/other gene expression data, proteomics, organelles, plants, immunological, cell bio, … !78

Slide 79

Slide 79 text

Publicly Available Data: Web Servers • 2012 NAR Web Server Issue – http://nar.oxfordjournals.org/content/41/W1.toc – 100 articles/webservers featured • Bioinformatics Links Directory – http://bioinformatics.ca/links_directory/ – Includes all the NAR resources above. – 1,300 tools, 600 databases, 160 other resources – Topics: computer-related, DNA, education, expression, genomics, literature, model organisms, RNA, protein, other molecules, sequence comparison, … !79

Slide 80

Slide 80 text

Online Community & Discussion Forum • Seqanswers - http://SEQanswers.com - Twitter: @SEQquestions - Format: Forum - Li et al. SEQanswers : An open access community for collaboratively decoding genomes. Bioinformatics (2012). • BioStar: - http://biostar.stackexchange.com - Twitter: @BioStarQuestion - Format: Q&A - Parnell et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comp Bio (2011) 7:e1002216. !80

Slide 81

Slide 81 text

!81 Web: bioinformatics.virginia.edu E-Mail: [email protected] Blog: GettingGeneticsDone.com Twitter: @genetics_blog Facebook: facebook.com/UVABioinformaticsCore