Slide 1

Slide 1 text

Introduction to Bioinformatics Stephen Turner, Ph.D. Bioinformatics Core Director [email protected] Slides at bit.ly/phs7070-bioinfo

Slide 2

Slide 2 text

Contact Web: bioinformatics.virginia.edu E-mail: [email protected] Blog: GettingGeneticsDone.com Twitter: @genetics_blog

Slide 3

Slide 3 text

Bioinformatics Origins: Rooted in sequence analysis. Driven by the need to: ● Collect ● Annotate ● Analyze

Slide 4

Slide 4 text

Margaret Dayhoff (1925-1983) ● Collected all known protein structures & sequences ● Published Atlas in 1965 ● Pioneered algorithm development for: ○ Comparing protein sequences ○ Deriving evolutionary history from alignments “In this paper we shall describe a completed computer program for the IBM 7090, which to our knowledge is the first successful attempt at aiding the analysis of the amino acid chain structure of protein.”

Slide 5

Slide 5 text

IBM 7090

Slide 6

Slide 6 text

“There is a tremendous amount of information regarding evolutionary history and biochemical function implicit in each sequence and the number of known sequences is growing explosively. We feel it is important to collect this significant information, correlate it into a unified whole and interpret it.” M. Dayhoff, February 27, 1967

Slide 7

Slide 7 text

modified from @drewconway

Slide 8

Slide 8 text

1960 1970 1980 1990 2000 2010 D ayhoff Atlas Sanger Sequencing G enBank EBI-EM BL N ext-G en Sequencing Internet invented AR PAnet W W W invented

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Definition From Wikipedia: Bioinformatics is a branch of biological science which deals with the study of methods for storing, retrieving and analyzing biological data, such as nucleic acid (DNA/RNA) and protein sequence, structure, function, pathways and genetic interactions. It generates new knowledge that is useful in such fields as drug design and development of new software tools to create that knowledge. Bioinformatics also deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, structural biology, software engineering, data mining, image processing, modeling and simulation, discrete mathematics, control and system theory, circuit theory, and statistics. Our definition: using computer science and statistics to answer biological questions.

Slide 11

Slide 11 text

Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining

Slide 12

Slide 12 text

Central Dogma DNA RNA Protein Post-translational modification Prions Reverse transcription Methylation RNA Silencing

Slide 13

Slide 13 text

DNA provides assembly instructions for proteins Protein folding determines molecular function Networks of interacting proteins determine tissue/organ function

Slide 14

Slide 14 text

DNA provides assembly instructions for proteins Protein folding determines molecular function Networks of interacting proteins determine tissue/organ function DNA variant analysis Gene expression analysis Genome annotation Epigenetics Pathway analysis Systems biology Biomarker ID'n miRNA analysis Quantitative MS Proteomics

Slide 15

Slide 15 text

Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining

Slide 16

Slide 16 text

Outbreak: fever, characteristic skin lesions. Culture, isolate DNA, sequence (sanger): GTGAGTAATAATAATTCAAAACTGGAATTTGTACCTAATATACAGCTTAAAGAAGACTTAGGAGCTTTTAGCTATAAAGTCCAACTTTCT CCTGTAGAAAAAGGTATGGCTCATATCCTTGGTAACTCTATTAGAAGGGTTTTATTATCTTCACTATCAGGTGCATCTATAATTAAAGTA AACATCGCTAATGTACTACATGAGTATTCTACTTTAGAAGATGTAAAAGAAGATGTTGTTGAAATTGTTTCTAATTTGAAAAAGGTTGCG ATAAAGCTTGATACAGGTATAGATAGACTAGATTTAGAACTATCTGTAAATAAATCAGGTGTAGTTAGCGCTGGAGATTTTAAGACGACT CAAGGTGTAGAAATAATAAATAAAGATCAGCCAATAGCTACTTTGACAAACCAAAGAGCATTTAGCTTAACTGCTACAGTGAGTGTAGGT AGAAATGTCGGAATACTTTCTGCGATACCAACCGAGCTTGAGAGAGTTGGTGATATAGCTGTAGATGCTGATTTTAATCCTATTAAAAGA GTTGCTTTTGAGGTTTTTGATAATGGTGATAGTGAAACTTTAGAAGTATTTGTAAAGACAAATGGTACTATAGAACCACTAGCAGCTGTT ACGAAAGCTTTAGAGTATTTCTGTGAGCAAATATCAGTATTTGTATCTCTAAGAGTACCTAGTAATGGTAAAACAGGTGATGTATTAATA GATTCTAATATTGATCCTATCCTTCTTAAGCCGATTGATGATTTAGAGCTAACTGTCAGATCATCTAACTGTCTGCGTGCAGAAAACATT AAGTATCTTGGTGATTTGGTACAGTATTCTGAATCACAGCTTATGAAGATACCTAACTTAGGTAAGAAATCTCTCAATGAGATCAAACAA ATTTTAATAGATAATAACTTGTCTCTAGGTGTCCAAATTGACAATTTTAGAGAGCTAGTTGAAGGAAAATAA Sequence alignment, example 1

Slide 17

Slide 17 text

Sequence alignment, example 1 ● BLAST (Basic Local Alignment Search Tool) ● Go to blast.ncbi.nlm.nih.gov ● Click "Nucleotide BLAST" (blastn) ● Under "Choose Search Set", click the "Others" button, then search the entire nr/nt collection (you don't know what it is) GTGAGTAATAATAATTCAAAACTGGAATTTGTACCTAATATACAGCTTAAAGAAGACTTAGGAGCTTTTAGCTATAAAGTCCAACTTTCT CCTGTAGAAAAAGGTATGGCTCATATCCTTGGTAACTCTATTAGAAGGGTTTTATTATCTTCACTATCAGGTGCATCTATAATTAAAGTA AACATCGCTAATGTACTACATGAGTATTCTACTTTAGAAGATGTAAAAGAAGATGTTGTTGAAATTGTTTCTAATTTGAAAAAGGTTGCG ATAAAGCTTGATACAGGTATAGATAGACTAGATTTAGAACTATCTGTAAATAAATCAGGTGTAGTTAGCGCTGGAGATTTTAAGACGACT CAAGGTGTAGAAATAATAAATAAAGATCAGCCAATAGCTACTTTGACAAACCAAAGAGCATTTAGCTTAACTGCTACAGTGAGTGTAGGT AGAAATGTCGGAATACTTTCTGCGATACCAACCGAGCTTGAGAGAGTTGGTGATATAGCTGTAGATGCTGATTTTAATCCTATTAAAAGA GTTGCTTTTGAGGTTTTTGATAATGGTGATAGTGAAACTTTAGAAGTATTTGTAAAGACAAATGGTACTATAGAACCACTAGCAGCTGTT ACGAAAGCTTTAGAGTATTTCTGTGAGCAAATATCAGTATTTGTATCTCTAAGAGTACCTAGTAATGGTAAAACAGGTGATGTATTAATA GATTCTAATATTGATCCTATCCTTCTTAAGCCGATTGATGATTTAGAGCTAACTGTCAGATCATCTAACTGTCTGCGTGCAGAAAACATT AAGTATCTTGGTGATTTGGTACAGTATTCTGAATCACAGCTTATGAAGATACCTAACTTAGGTAAGAAATCTCTCAATGAGATCAAACAA ATTTTAATAGATAATAACTTGTCTCTAGGTGTCCAAATTGACAATTTTAGAGAGCTAGTTGAAGGAAAATAA

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Sequence alignment, example 2 ● Illumina HiSeq 2500: ○ 600,000,000,000 bases sequenced in single run. ○ 6,000,000,000 x 100-bp (short) reads ● BLAST way too slow. ● BWA: burrows wheeler aligner (fast) ● Bowtie: fast, memory-efficient (aligns 25,000,000 35-bp reads per hour per CPU). ● Many others... MAQ, Eland, RMAP, SOAP, SHRiMP, BFAST, Mosaik, Novoalign, BLAT, GMAP, GSNAP, MOM, QPalma, SeqMap, VelociMapper, Stampy, mrFAST, etc.

Slide 21

Slide 21 text

Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining

Slide 22

Slide 22 text

Comparative Genomics example ● Go to genome.ucsc.edu ● Search for POLR2A ● Turn on some conservation tracks

Slide 23

Slide 23 text

Sequence similarity Evolutionary distance

Slide 24

Slide 24 text

Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining

Slide 25

Slide 25 text

Genetic Epidemiology Epidemiology: the study of the patterns, causes, and effects of health and disease conditions in defined populations. Genetic epidemiology: the study of genetic factors in determining health and disease in families and populations.

Slide 26

Slide 26 text

DNA provides assembly instructions for proteins Protein folding determines molecular function Networks of interacting proteins determine tissue/organ function

Slide 27

Slide 27 text

Genetic epidemiology ● Linkage: finding genetic loci that segregate with the disease in families. ● Association: finding alleles that co-occur with disease in populations. ○ Common disease - common variant hypothesis: ■ Common variants (e.g. >1-5% in the population) contribute to common, complex disease). ○ Common disease - rare variant hypothesis: ■ Polymorphisms that cause disease are under purifying selection, and will thus be rare. ○ Really, it's a mix of both

Slide 28

Slide 28 text

Candidate gene study ● Select candidate genes based on: ○ Known biology ○ Previous linkage/association evidence ○ Pathways ○ Evidence from model organisms ● Genotype variants (SNPs) in those genes ● Statistical association Genotype at position rs12345: A/T Genotype at position rs12345: A/A Genotype at position rs12345: T/T

Slide 29

Slide 29 text

Genome-wide association study ● Genotype >500,000 SNPs ● Statistical test at each one ● Manhattan plot of results ● GWAS does not inform: ○ Which gene affected ○ How gene function perturbed ○ How biological function altered

Slide 30

Slide 30 text

Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining

Slide 31

Slide 31 text

Gene expression pre-2008 PCR Microarrays

Slide 32

Slide 32 text

RNA sequencing (RNA-seq) Condition 1 (normal colon) Condition 2 (colon tumor) Isolate RNAs Sequence ends 100s of millions of paired reads 10s of billions bases of sequence Generate cDNA, fragment, size select, add linkers Samples of interest Align to Genome Downstream analysis Image: www.bioinformatics.ca

Slide 33

Slide 33 text

RNA-seq advantages ● No reference necessary ● Low background (no cross-hybridization) ● Unlimited dynamic range (FC 9000 Science 320:1344) ● Direct counting (microarrays: indirect – hybridization) ● Can characterize full transcriptome ○ mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc) ○ Differential gene expression ○ Differential coding output ○ Differential TSS usage ○ Differential isoform expression

Slide 34

Slide 34 text

Isoform level data

Slide 35

Slide 35 text

Isoform level data

Slide 36

Slide 36 text

Differential splicing & TSS use

Slide 37

Slide 37 text

RNA-seq challenges ● Library construction ○ Size selection (messenger, small) ○ Strand specificity? ● Bioinformatic challenges ○ Spliced alignment ○ Transcript deconvolution ● Statistical Challenges ○ Highly variable abundance ○ Sample size: never, ever, plan n=1 ● Normalization (RPKM) ○ Compare features of different lengths ○ Compare conditions with different sequence depth

Slide 38

Slide 38 text

Common question #1: Depth ● Question: how much sequence do I need? ● Answer: it’s complicated. ● Depends on: ○ Size & complexity of transcriptome ○ Application: differential gene expression, transcript discovery, aberrant splicing, etc. ○ Tissue type, RNA quality, library preparation ○ Sequencing type: length, single-/paired-end, etc. ● Find publication in your field w/ similar goals. ● Good news: 1 GA or ½ HiSeq lane is sufficient for most applications

Slide 39

Slide 39 text

Common question #2: Sample Size ● Question: How many samples should I sequence? ● Oversimplified Answer: At least 3 biological replicates per condition. ● Depends on: ○ Sequencing depth ○ Application ○ Goals (prioritization, biomarker discovery, etc.) ○ Effect size, desired power, statistical significance ● Find a publication with similar goals

Slide 40

Slide 40 text

Common question #3: Workflow ● How do I analyze the data? ● No standards! ○ Unspliced aligners: BWA, Bowtie, Stampy, SHRiMP ○ Spliced aligners: Tophat, MapSplice, SpliceMap, GSNAP, QPALMA ○ Reference builds & annotations: UCSC, Entrez, Ensembl ○ Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS ○ Quantification: Cufflinks, RSEM, MISO, ERANGE, NEUMA, Alexa-Seq ○ Differential expression: Cuffdiff, DegSeq, DESeq, EdgeR, Myrna ● Like early microarray days: lots of excitement, lots of tools, little knowledge of integrating tools in pipeline! ● Benchmarks ● Microarray: Spike-ins (Irizarry) ● RNA-Seq: ???, simulation, ???

Slide 41

Slide 41 text

Phases of NGS analysis ● Primary ○ Conversion of raw machine signal into sequence and qualities ● Secondary ○ Alignment of reads to reference genome or transcriptome ○ De novo assembly of reads into contigs ● Tertiary ○ SNP discovery/genotyping ○ Peak discovery/quantification (ChIP, MeDIP) ○ Transcript assembly/quantification (RNA-seq) ● Quaternary ○ Differential expression ○ Enrichment, pathways, correlation, clustering, visualization, etc.

Slide 42

Slide 42 text

Extra credit (not really): RNA-seq http://bit.ly/galaxy-rnaseq ● #1: learn to use galaxy: bit.ly/uva-galaxy ● #2: Run through an RNA-seq exercise in 1 hour: ○ Read some background material on RNA-seq ○ Read the tophat/cufflinks method paper ○ Get some data (Illumina BodyMap) ○ QC / trim your reads ○ Map to hg19 with tophat ○ Visualize where reads map ○ Assemble with cufflinks ○ Differential expression with cuffdiff

Slide 43

Slide 43 text

Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining

Slide 44

Slide 44 text

How are genes regulated? ● Transcription factors (ChIP-seq) ● Micro-RNAs (RNA-seq) ● Chromatin accessibility (DNAse-Seq) ● DNA Methylation (RRBS-seq, MeDIP-seq) ● RNA processing ● RNA transport ● Translation ● Post-translational modification

Slide 45

Slide 45 text

Importance of DNA methylation ● Occurs most frequently at CpG sites ● High methylation at promoters ≈ silencing ● Methylation perturbed in cancer ● Methylation associated with many other complex diseases: neural, autoimmune, response to env. ● Mapping DNA methylation → new disease genes & drug targets.

Slide 46

Slide 46 text

DNA Methylation Challenges ● Dynamic and tissue-specific ● DNA → Collection of cells which vary in 5meC patterns → 5meC pattern is complex. ● Further, uneven distribution of CpG targets ● Multiple classes of methods: ○ Bisulfite, sequence-based: Assay methylated target sequences across individual DNAs. ○ Affinity enrichment, count-based: Assay methylation level across many genomic loci. ● Many methods ● Many algorithms

Slide 47

Slide 47 text

Many methylation methods BS-Seq Whole-genome bisulfite sequencing RRBS-Seq Reduced representation bisulfite sequencing BC-Seq Bisulfite capture sequencing BSPP Bisulfite specific padlock probes Methyl-Seq Restriction enzyme based methyl-seq MSCC Methyl sensitive cut counting HELP-Seq HpaII fragment enrichment by ligation PCR MCA-Seq Methylated CpG island amplification MeDIP-Seq Methylated DNA immunoprecipitation MBP-Seq Methyl-binding protein sequencing MethylCap-seq Methylated DNA capture by affinity purification MIRA-Seq Methylated CpG island recovery assay RNA-Seq High-throughput cDNA sequencing DNA Methylation Gene Expression

Slide 48

Slide 48 text

Methylation methods: Features & biases

Slide 49

Slide 49 text

Methylation: Bioinformatics Resources Resource Purpose URL Refs Batman MeDIP DNA methylation analysis tool http://td-blade.gurdon.cam.ac.uk/software/batman BDPC DNA methylation analysis platform http://biochem.jacobs-university.de/BDPC BSMAP Whole-genome bisulphite sequence mapping http://code.google.com/p/bsmap CpG Analyzer Windows-based program for bisulphite DNA - CpGcluster CpG island identification http://bioinfo2.ugr.es/CpGcluster CpGFinder Online program for CpG island identification http://linux1.softberry.com CpG Island Explorer Online program for CpG Island identification http://bioinfo.hku.hk/cpgieintro.html CpG Island Searcher Online program for CpG Island identification http://cpgislands.usc.edu CpG PatternFinder Windows-based program for bisulphite DNA - CpG Promoter Large-scale promoter mapping using CpG islands http://www.cshl.edu/OTT/html/cpg_promoter.html CpG ratio and GC content Plotter Online program for plotting the observed:expected ratio of CpG http://mwsross.bms.ed.ac.uk/public/cgi-bin/cpg.pl CpGviewer Bisulphite DNA sequencing viewer http://dna.leeds.ac.uk/cpgviewer CyMATE Bisulphite-based analysis of plant genomic DNA http://www.gmi.oeaw.ac.at/en/cymate-index/ EMBOSS CpGPlot/ CpGReport Online program for plotting CpG-rich regions http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html Epigenomics Roadmap NIH Epigenomics Roadmap Initiative homepage http://nihroadmap.nih.gov/epigenomics Epinexus DNA methylation analysis tools http://epinexus.net/home.html MEDME Software package (using R) for modelling MeDIP experimental data http://espresso.med.yale.edu/medme methBLAST Similarity search program for bisulphite-modified DNA http://medgen.ugent.be/methBLAST MethDB Database for DNA methylation data http://www.methdb.de MethPrimer Primer design for bisulphite PCR http://www.urogene.org/methprimer methPrimerDB PCR primers for DNA methylation analysis http://medgen.ugent.be/methprimerdb MethTools Bisulphite sequence data analysis tool http://www.methdb.de MethyCancer Database Database of cancer DNA methylation data http://methycancer.psych.ac.cn Methyl Primer Express Primer design for bisulphite PCR http://www.appliedbiosystems.com/ Methylumi Bioconductor pkg for DNA methylation data from Illumina http://www.bioconductor.org/packages/bioc/html/ Methylyzer Bisulphite DNA sequence visualization tool http://ubio.bioinfo.cnio.es/Methylyzer/main/index.html mPod DNA methylation viewer integrated w/ Ensembl genome browser http://www.compbio.group.cam.ac.uk/Projects/ PubMeth Database of DNA methylation literature http://www.pubmeth.org QUMA Quantification tool for methylation analysis http://quma.cdb.riken.jp TCGA Data Portal Database of TCGA DNA methylation data http://cancergenome.nih.gov/dataportal

Slide 50

Slide 50 text

Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining

Slide 51

Slide 51 text

Jeong, H. et al.. (2001) Nature 411:41–42. Ptacek, J. et al. (2005) Nature 438:679–684. Guimera and Amaral. (2005). Nature 433:895-900. Tong, A.H. et al. (2001). Science 294:2364-2368. Zhu X. et al. (2007). Genes & Dev 21:1010-1024. One gene, one enzyme, one function?

Slide 52

Slide 52 text

Distribution of disease genes Diseases connected if same gene implicated in both. Genes connected if implicated in the same disorder. Goh et al. (2007). PNAS 104:8685.

Slide 53

Slide 53 text

Distribution of disease genes Genes connected if implicated in the same disorder. Goh et al. (2007). PNAS 104:8685. Overlay with PPI data Genes contributing to a common disease interact through protein- protein interactions.

Slide 54

Slide 54 text

Distribution of disease genes Seebacher and Gavin (2011). Cell 144:1000- 1001 k = degree = # interaction partners ● “Essential” genes ● Encode hubs ● Are expressed globally ● “Non-essential” disease genes ● Do not encode hubs ● Tissue specific expression

Slide 55

Slide 55 text

Distribution of disease genes ● Disease genes at functional periphery of cellular networks (Goh PNAS 2007). ● Genes contributing to a common disease interact through protein-protein interactions (Goh PNAS 2007). ● Diseaseome analysis: Pt 2x likely to develop another disease if that disease shares gene with pt’s primary disease (Park et al. 2009. The Impact of Cellular Networks on Disease Comorbidity. Mol Syst Biol 5:262). ● miRNA analysis: If connect diseases with associated genes regulated by common miRNA, get disease-class segregation. E.g. cancers share similar associations at miRNA level (Lu et al. 2009. An analysis of human microRNA and disease associations. PLoS ONE 3:e3420). Nonrandom placement of disease genes in interactome!

Slide 56

Slide 56 text

Distribution of disease genes Vidal et al, Cell 2011.

Slide 57

Slide 57 text

Distribution of disease genes ● Data is cheap and diverse. ○ Genetic variation: GWAS, next-gen sequencing ○ Gene expression: Microarray, RNA-seq ○ Proteomics: Y2H, CoAP/MS ● Cellular components interact in a network with other cellular components. ● Disease is the result of an abnormality in that network. ● Integrate multiple data types, understand network, understand disease.

Slide 58

Slide 58 text

Pathway Analysis ● You’ve done your microarray/RNA-Seq experiment ○ You have a list of genes ○ Want to put these into functional context ○ What biological processes are perturbed? ○ What pathways are being dysregulated? ○ Data reduction: hundreds or thousands of genes can be reduced to 10s of pathways ○ Identifying active pathways = more explanatory power ● “Pathway analysis” encompasses many, many techniques: ○ 1st Generation: Overrepresentation Analysis (E.g. GO ORA) ○ 2nd Generation: Functional Class Scoring (e.g. GSEA) ○ 3rd Generation (in development): Pathway Topology (E.g. SPIA) ● http://gettinggeneticsdone.com/2012/03/pathway-analysis-for-high-throughput.html

Slide 59

Slide 59 text

Pathway Analysis: Over- representation analysis ● Many variations on the same theme: statistically evaluates the fraction of genes in particular pathway that show changes in expression. ● Algorithm: ○ Create input list (e.g. “significant at p<0.05”) ○ For each gene set: ■ Count number of input genes ■ Count number of “background” genes (e.g. all genes on platform). ○ Test each pathway for over-representation of input genes ● Gene Set: typically gene ontology (GO) term.

Slide 60

Slide 60 text

Pathway analysis: over- representation analysis ● Ontology = formal representation of a knowledge domain. ● Gene ontology = cell biology. ● GO represented by directed acyclic graph (DAG). ○ Terms are nodes, relationships are edges. ○ Parent terms are more general than their child terms. ○ Unlike a simple tree, terms can have multiple parents. Rhee, S. Y., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology annotations. Nature Reviews Genetics, 9(7), 509-15.

Slide 61

Slide 61 text

Pathway analysis: Over-representation analysis ● Algorithm: ○ Create input list (e.g. “significant at p<0.05”) ○ For each gene set: ■ Count number of input genes ■ Count number of “background” genes (e.g. all genes on platform). ○ Test each pathway for over-representation of input genes ● Ex: GO “Purine Ribonucleotide Biosynthetic Process” ○ 1% of input (significant) genes are annotated with this term. ○ 1% of genes on the chip are annotated with this term. ○ Not significantly overrepresented. ● Ex: GO “V(D)J Recombination” ○ 20% of input (significant) genes are annotated with this term. ○ 1% of genes on the chip are annotated with this term. ○ Highly significantly over-represented!

Slide 62

Slide 62 text

Pathway analysis ● Pathway analysis gives you more biological insight than staring at lists of genes. ● Pathway analysis is complex, and has many limitations. ● Pathway analysis is still more of an exploratory procedure rather than a pure statistical endpoint. ● The best conclusions are made by viewing enrichment analysis results through the lens of the investigator’s expert biological knowledge.

Slide 63

Slide 63 text

Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining

Slide 64

Slide 64 text

● Seqanswers ○ http://SEQanswers.com ○ Twitter: @SEQquestions ○ Format: Forum ○ Li et al. SEQanswers : An open access community for collaboratively decoding genomes. Bioinformatics (2012). ● BioStar: ○ http://biostar.stackexchange.com ○ Twitter: @BioStarQuestion ○ Format: Q&A ○ Parnell et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comp Bio (2011) 7:e1002216. Resources: Online community & discussion forum

Slide 65

Slide 65 text

Bioinformatics Core Mission: help scientists publish their work and obtain new funding through service and training.

Slide 66

Slide 66 text

Services ● Gene expression: Microarray Analysis ● Gene expression: RNA-seq Analysis ● Pathway analysis ● DNA Variation (GWAS, NGS) ● DNA Binding / ChIP-Seq ● DNA Methylation ● Metagenomics ● Grant / Manuscript support ● Custom development (computing & stats) ● ... etc.

Slide 67

Slide 67 text

Contact Web: bioinformatics.virginia.edu E-mail: [email protected] Blog: GettingGeneticsDone.com Twitter: @genetics_blog