informa4on regarding evolu&onary history and biochemical func&on implicit in each sequence and the number of known sequences is growing explosively. We feel it is important to collect this significant informa4on, correlate it into a unified whole and interpret it.” M. Dayhoff, February 27, 1967
expert and 2mely bioinforma2cs consul2ng and data analysis. • Main goals: help you publish and get funding. – 1. Service – 2. Training September 10, 2013 bioinforma2cs.virginia.edu
expression: RNA-‐seq Analysis • Pathway analysis • DNA Varia2on (GWAS, NGS) • DNA Binding / ChIP-‐Seq • DNA Methyla2on • Grant / Manuscript support • Custom development September 10, 2013 bioinforma2cs.virginia.edu
and analysis of publicly available data (e.g. GEO, ArrayExpress). • Preprocessing: background subtrac2on, summariza2on, and quan2le normaliza2on using RMA (Robust Mul2chip Average) expression measure described in Irizarry et al. Biosta2s2cs 4:249-‐264. • Quality assessment: – Visualiza2on of signal intensity distribu2ons of each array using boxplots and density plots. – MA plots to visualize signal intensity over average intensity. – Principal components analysis to visualize the overall data (dis)similarity between arrays. • Analysis: – Es2ma2on of fold changes and standard errors using a linear model. – Empirical Bayes smoothing to standard errors. – Lists of top differen2ally expressed genes, fold changes, sta2s2cal significance, mul2ple tes2ng correc2on. • Visualiza2on: – Heatmaps and dendrograms. – Volcano plots to visualize sta2s2cal significance by fold change. • Biological context – Pathway/Func2onal Analysis. September 10, 2013 bioinforma2cs.virginia.edu
& power calcula2ons for SNP genotype-‐phenotype associa2on studies • Data management and quality control • PCA for popula2on stra2fica2on control • Imputa2on to a reference popula2on (e.g. HapMap, 1000 Genomes) • Analysis, interpreta2on, visualiza2on • Manuscript prepara2on • Grant support (compliance with NIH data sharing policies, methodology for data management, design, analysis, and interpreta2on) • Acquisi2on of publicly available data (dbGaP) DNA Varia2on: Next-‐Gen Sequencing • Alignment to a reference genome • Calibra2on of quality scores and duplicate read removal • Variant calling • Variant annota2on • SNP effect predic2on • De novo assembly • Any of the applicable analysis, interpreta2on, and visualiza2on services described above for genotyping data. September 10, 2013 bioinforma2cs.virginia.edu
experiment – You have a list of genes – Want to put these into func2onal context – What biological processes are perturbed? – What pathways are being dysregulated? – Data reduc2on: hundreds or thousands of genes can be reduced to 10s of pathways – Iden2fying ac2ve pathways = more explanatory power • “Pathway analysis” encompasses many, many techniques. 1. 1st Genera2on: Overrepresenta2on Analysis (E.g. GO ORA) 2. 2nd Genera2on: Func2onal Class Scoring (e.g. GSEA) 3. 3rd Genera2on (in development): Pathway Topology (E.g. SPIA) • bit.ly/pathway-‐analysis September 10, 2013 bioinforma2cs.virginia.edu
theme: sta2s2cally evaluates the frac2on of genes in par2cular pathway that show changes in expression. • Algorithm: 1. Create input list (e.g. “significant at p<0.05”) 2. For each gene set: a. Count number of input genes b. Count number of “background” genes (e.g. all genes on plaoorm). 3. Test each pathway for over-‐representa2on of input genes • Gene Set: typically gene ontology (GO) term. September 10, 2013 bioinforma2cs.virginia.edu
knowledge domain. • Gene ontology = cell biology. • GO represented by directed acyclic graph (DAG). – Terms are nodes, rela2onships are edges. – Parent terms are more general than their child terms. – Unlike a simple tree, terms can have mul2ple parents. September 10, 2013 bioinforma2cs.virginia.edu Rhee, S. Y., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology annota2ons. Nature reviews. Gene2cs, 9(7), 509-‐15. doi:10.1038/nrg2363
list (e.g. “significant at p<0.05”) 2. For each gene set: a. Count number of input genes b. Count number of “background” genes (e.g. all genes on plaoorm). 3. Test each pathway for over-‐representa2on of input genes • Ex: GO “Purine Ribonucleo2de Biosynthe2c Process” – 1% of input (significant) genes are annotated with this term. – 1% of genes on the chip are annotated with this term. – Not significantly overrepresented. • Ex: GO “V(D)J Recombina2on” – 20% of input (significant) genes are annotated with this term. – 1% of genes on the chip are annotated with this term. – Highly significantly over-‐represented!. September 10, 2013 bioinforma2cs.virginia.edu
they’re meaningless (e.g. “cellular process”). • ORA uses genes above a cutoff and discards everything else. • ORA only uses the number genes, and ignores their measured changes. • Two assump2ons violated – Genes are independent (NOT! Coexpression, interac2on, etc). – Pathways are independent (by defini2on violated by DAG). September 10, 2013 bioinforma2cs.virginia.edu
individual genes can have significant effects on pathways, weaker but coordinated changes in sets of func2onally related genes can also have significant effects. • General Algorithm: 1. Compute gene-‐level sta2s2c (e.g. Fold Change, student’s t). 2. Aggregate gene level sta2s2cs for all genes in pathway into single pathway-‐level sta2s2c. 3. Assess significance with permuta2on. September 10, 2013 bioinforma2cs.virginia.edu
a) Rank genes by their expression difference b) For each Gene Set*: i. Compute cumula2ve sum over ranked genes 1. Increase sum when gene is in set, decrease otherwise 2. Magnitude of increment depends on gene-‐phenotype correla2on ii. Record the maximum devia2on from zero as Enrichment Score (ES) 2. Assess significance a) Permute phenotype (or gene labels) 1000 2mes b) Compute ES score for each permuta2on (empiric null). c) Compare ES score for actual data to distribu2on of ES scores from permuted data. d) Normalize ES by accoun2ng for gene set size e) Control mul2ple tes2ng by calcula2ng FDR for each NES • * Gene sets: Come from MSigDB – hip://www.broadins2tute.org/gsea/msigdb/index.jsp – MSigDB is collec2on of annotated gene sets for use with GSEA sovware. – Posi2onal, curated, computa2onally predicted, GO. – Curated: KEGG, Reactome, STKE, etc. September 10, 2013 bioinforma2cs.virginia.edu
– Genes are independent – Pathways are independent • Only consider number/magnitude of genes, and ignore other informa2on in databases: – Direc4onality of the interac2on – Nature of the interac2on (ac2va2ng, inhibi2on, etc). – Where the interac2on occurs (nucleus, cytoplasm, etc). September 10, 2013 bioinforma2cs.virginia.edu
topology. • Computes two orthogonal p-‐values: – pNDE: Number of Differen2ally Expressed genes (E.g. like ORA). – pPERT: degree of perturba2on • pG is overall p-‐value (pNDE and pPERT combined) • pGFDR is overall FDR-‐ corrected p-‐value September 10, 2013 bioinforma2cs.virginia.edu
– pNDE: 6.5e-‐9 – pPERT: .29 – pGFDR : 1.2e-‐6 – Conclusion: many differen2ally expressed genes, but pathway may not be badly perturbed. September 10, 2013 bioinforma2cs.virginia.edu
need arbitrary “cutoff” e.g. top 500, or p<0.05, etc. • True topology is dependent on type of cell due to cell-‐ specific gene expression profiles. • Tissue-‐specific topology is rarely available and fragmented in databases, even if it’s fully understood. • Other general limita2ons of pathway analysis -‐-‐-‐ September 10, 2013 bioinforma2cs.virginia.edu
– E.g. RNA-‐seq studies have found >90% of transcriptome is alterna2vely spliced. – Different transcripts can have different or opposing func2ons. • Incomplete/inaccurate annota2ons. • Oct 2007: 95% GO annota2ons inferred electronically (i.e. not manually curated). • Missing condi2on-‐ and cell-‐specific informa2on. • Methodological challenge: lack of benchmarks. September 10, 2013 bioinforma2cs.virginia.edu
Pathway analysis gives you more biological insight than staring at lists of genes. Pathway analysis is complex, and has many limita2ons. Pathway analysis is s2ll more of an exploratory procedure rather than a pure sta2s2cal endpoint. The best conclusions are made by viewing enrichment analysis results through the lens of the inves4gator’s expert biological knowledge.
Asthma compared to normal state. • Ques2ons: Do data supported involvement of immune/ inflammatory responses and viral infec2on in the acute asthma aiack? • Tasks: – View Canonical pathways that contain significant numbers of genes from this dataset. – Overlay a Func2on/Disease state that shows how key signaling pathways for figh2ng off respiratory infec2ons overlapped with asthma2c inflamma2on. – Overlay Biomarkers that iden2fy genes in the infec2on signaling pathway that are also used for diagnosis and efficacy indicators for asthma treatments. – Search the Ingenuity Knowledge Base for literature references that support your findings. – Inves2gate a “weird” finding… September 10, 2013 bioinforma2cs.virginia.edu