Slide 1

Slide 1 text

Bioinformatics-as-a-Service: Applications, Opportunities, and Challenges with Large-Scale -Omics Data Stephen D. Turner, Ph.D. Bioinformatics Core Director [email protected] bioinformatics.virginia.edu Slides available at: stephenturner.us/slides

Slide 2

Slide 2 text

Today’s Talk 1.  My Background: Genetics, Statistics, & Bioinformatics 2.  Bioinformatics: Origins & Contemporary Applications 3.  Bioinformatics Core: Staying Relevant; Research Vignettes October 10, 2013 bioinformatics.virginia.edu

Slide 3

Slide 3 text

JMU 2002-2006 •  Gene expression in amphibian tail development •  Biosymposium 2006 slides: stephenturner.us/slides October 10, 2013 bioinformatics.virginia.edu

Slide 4

Slide 4 text

Grad School: 5 years in 5 minutes •  Ph.D. Human Genetics, M.S. Applied Statistics –  Research: Genetic Epidemiology –  Working Hypothesis: common disease, common variant •  Lipids: –  Risk factors for CVD •  1 mg/dL ↑ LDL = 1% ↑ risk for CV event. •  1 mg/dL ↓ HDL = 6% ↑ risk for CV event. –  Therapeutic targets –  Easy to phenotype –  Heritable (HDL ~70% heritable!) –  Finding genetic factors ~ HDL = new bio = new treatments. October 10, 2013 bioinformatics.virginia.edu

Slide 5

Slide 5 text

eMERGE •  The electronic Medical Records and GEnomics Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. •  Many phenotypes - no “ascertainment” necessary. •  Multiple sites = replication, joint analysis. •  One of the goals: assessment. Will this work? October 10, 2013 bioinformatics.virginia.edu

Slide 6

Slide 6 text

Genome-Wide Association Study Manolio TA. N Engl J Med 2010;363:166-176. October 10, 2013 bioinformatics.virginia.edu SNP (say “snip”) = Single Nucleotide Polymorphism. A common variant (mutation) in the population. Millions exist throughout the human genome.

Slide 7

Slide 7 text

Quality Control i j ji i i i y c G e µ β = + + + ∑ 2 ~ (0, ) e G MVN σ Φ ˆ ˆ ( ) i j ji i i e y c G µ β = − + + ∑ ˆ i i i i e g e µ β = + + Ancestry: PCA Relatedness: used linear mixed effects model October 10, 2013 bioinformatics.virginia.edu •  Marker call rate •  Sample call rate •  Mendelian errors •  Discordant calls •  Minor allele frequency •  Hardy-Weinberg equilibrium •  …

Slide 8

Slide 8 text

GWAS: HDL-Cholesterol Peripheral Cell Lipid Source ABCA1 FC CE FC CE LCAT Peripheral Cell Lipid Destination LIPC TGàFFA LIPG PLàFFA LPL TGàFFA TG CE CETP Hepatobiliary Elimination

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Epistasis October 10, 2013 bioinformatics.virginia.edu B/_; E/_ b/b; E/_ _/_; e/e

Slide 11

Slide 11 text

GxG Interaction •  “Missing heritability” may be found in GxG interactions. –  B. Maher 2008 Nature –  T. Manolio, F. Collins, et al. Nature 2009 –  T. Manolio 2010 NEJM –  Mouse, E. coli, S. cerevisiae •  GxG hard to test: –  GWAS: 1.25×1011 tests –  Computationally difficult –  Multiple testing •  Limit GxG tests to: –  SNPs with large main effect –  GxG with biological relevance October 10, 2013 bioinformatics.virginia.edu

Slide 12

Slide 12 text

Results •  Main effects of each SNP in each dataset reduce HDL. •  Interaction effect raises HDL. –  Joint effect is nonlinear. –  Epistasis – heterogeneity, not synergy. •  LPL mediates the release of FFA and TG from HDL particles. •  ABCA1 shuttles FC into HDL particles during intravascular remodeling. SNP 1 Gene 1 SNP 2 Gene 2 MF β1 MF β2 MF β3 MF P BioVU β1 BioVU β2 BioVU β3 BioVU P rs253 LPL rs2515614 ABCA1 - - + 0.006 - - + 0.001 rs253 LPL rs2472509 ABCA1 - - + 0.006 - - + 0.001

Slide 13

Slide 13 text

Results •  Main effects of each SNP in each dataset reduce HDL. •  Interaction effect coefficient is positive –  Joint effect is nonlinear. –  Epistasis – heterogeneity, not synergy. •  LPL mediates the release of FFA and TG from HDL particles. •  ABCA1 shuttles FC into HDL particles during intravascular remodeling. SNP 1 Gene 1 SNP 2 Gene 2 MF β1 MF β2 MF β3 MF P BioVU β1 BioVU β2 BioVU β3 BioVU P rs253 LPL rs2515614 ABCA1 - - + 0.006 - - + 0.001 rs253 LPL rs2472509 ABCA1 - - + 0.006 - - + 0.001 Peripheral Cell Lipid Source ABCA1 FC CE FC CE LCAT Peripheral Cell Lipid Destination LIPC TGàFFA LIPG PLàFFA LPL TGàFFA TG CE CETP Hepatobiliary Elimination

Slide 14

Slide 14 text

Grammatical Evolution of Neural Networks •  Turner SD, Ritchie MD, Bush WS. Conquering the Needle-in-a-Haystack: How Correlated Input Variables Beneficially Alter the Fitness Landscape for Neural Networks. Lec Notes Comp Sci. 5483:80-91 (2009). •  Turner SD, Dudek SK, Ritchie MD. Grammatical Evolution of Neural Networks for Discovering Epistasis among Quantitative Trait Loci. Lec Notes Comp Sci. 6023:86-97 (2010). •  Holzinger ER, Buchanan C, Turner SD, Dudek SM, Torstenson ES, Ritchie MD. Initialization Parameter Sweep in ATHENA: Optimizing Neural Networks for Detecting Gene-Gene Interactions in the Presence of Small Main Effects. Genetic and Evolutionary Computation Conference – GECCO 2010: 203-210. ACM Press (2010). •  Turner SD, Dudek SM, Ritchie MD. Incorporating Domain Knowledge into Evolutionary Computing for Discovering Gene- Gene Interaction. 11th Int’l Conference on Parallel Problem Solving From Nature (PPSN), Lecture Notes in Computer Science. 6238(I): 394-403 (2010). •  Turner SD, Dudek SM, Ritchie MD. ATHENA: A Knowledge-Based Hybrid Backpropagation- Grammatical Evolution Neural Network Algorithm for Discovering Epistasis among Quantitative Trait Loci. BMC BioData Mining. 3:5 (2010). y Σ x x x x x Σ Σ Σ Σ Σ Σ Σ

Slide 15

Slide 15 text

Postdoc: Obesity Epidemiology No Data <10% 10%–14% 15%–19% 20%–24% 25%–29% ≥30% 1990 1999 2009 !

Slide 16

Slide 16 text

Obesity No Data <10% 10%–14% 15%–19% 20%–24% 25%–29% ≥30% 1990 1999 2009 Central obesity Liver Fat

Slide 17

Slide 17 text

Obesity Central obesity Liver Fat DXA ($$) MRI ($$$)

Slide 18

Slide 18 text

Obesity Central obesity Liver Fat DXA ($$) MRI ($$$) Biomarkers ($) BMI WHR Lipidomics Adipokines Cytokines

Slide 19

Slide 19 text

Study Design •  MEC: >215,000 adults •  5 ethnic groups: AA, JA, H, NHW, NH. •  30 JA, 30 NHW postmenopausal women •  Anthro data + 60 biomarkers for adipocytokines, inflammation, insulin resistance & lipid profile. •  Random Forest to predict: –  Total body fat (DXA) –  Trunk:periphery fat ratio (DXA) –  Hepatic adiposity (MRI)

Slide 20

Slide 20 text

Random Forest: Results •  Automatic variable selection using RF: model explains fat distribution better with biomarkers (vs anthro alone). •  Important biomarkers varied by trait. •  RF >>> linear regression. October 10, 2013 bioinformatics.virginia.edu PLoS ONE Aug 2012 7(8):e43502

Slide 21

Slide 21 text

Other UHCC projects •  Obesity biomarkers / RF •  Rare variant analysis in IGF1 •  GWAS in MEC, pathway analysis •  Other statistical analysis, pathway analysis, etc. •  All of the above, and others at UHCC and USC: host genetics x microbiome U01 proposal (funded Aug 2012) October 10, 2013 bioinformatics.virginia.edu

Slide 22

Slide 22 text

Meanwhile… •  Democratization of next-gen sequencing •  UHCC role: Bioinformatics-as-a-service •  Interest in consulting / contract research October 10, 2013 bioinformatics.virginia.edu What does bioinformatics mean in 2013?

Slide 23

Slide 23 text

Bioinformatics Origins •  Rooted in sequence analysis •  Driven by need to: -  Collect -  Annotate -  Analyze

Slide 24

Slide 24 text

Margaret Dayhoff 1925-1983 •  Collected all known protein sequences •  Published in 1965 •  Pioneered algorithm development for –  Comparison of protein sequences –  Derivation of evolutionary histories from alignments “In this paper we shall describe a completed computer program for the IBM 7090, which to our knowledge is the first successful attempt at aiding the analysis of the amino acid chain structure of protein.” October 10, 2013 bioinformatics.virginia.edu

Slide 25

Slide 25 text

IBM 7090

Slide 26

Slide 26 text

Margaret Dayhoff 1925-1983 “There is a tremendous amount of information regarding evolutionary history and biochemical function implicit in each sequence and the number of known sequences is growing explosively. We feel it is important to collect this significant information, correlate it into a unified whole and interpret it.” M. Dayhoff, February 27, 1967 October 10, 2013 bioinformatics.virginia.edu

Slide 27

Slide 27 text

What is bioinformatics? Modified from @drewconway

Slide 28

Slide 28 text

What is bioinformatics? 1960 1970 1980 1990 2000 2010 October 10, 2013 bioinformatics.virginia.edu

Slide 29

Slide 29 text

Between April-October 2012: Cost of a human genome: +$717 (+12%) genome.gov/sequencingcosts

Slide 30

Slide 30 text

After the Gold Rush… •  Hall, N. “After the Gold Rush”. Genome Biol 2013. •  What if microscopes got 10x more powerful every year… –  Could do the same experiment every few months with the same slide. –  Make new discoveries! Publish interesting findings! •  Not too different from genomics… –  Sequence a Human Genome (HGP 2001) –  Sequence 1000 human genomes (1000genomes.org) –  Sequence 2000 human genomes (1000genomes.org) –  Sequence Human Microbiomes (hmpdacc.org) –  Sequence Earth (earthmicrobiome.org)

Slide 31

Slide 31 text

After the Gold Rush… •  What’s possible next year will be the same as what’s possible now. •  Fresh ideas needed! •  Stability will be good for us in the end.

Slide 32

Slide 32 text

Challenges in Bioinformatics •  Data integration (see data integration talk from 2012 ISMB at stephenturner.us/slides) •  Training: how to make scalable and sustainable? •  New technologies: how to best support new and emerging technologies? October 10, 2013 bioinformatics.virginia.edu

Slide 33

Slide 33 text

UVA Bioinformatics Core •  A centralized resource for providing expert and timely bioinformatics consulting and data analysis. •  Main goals: help collaborators publish and get funding. –  1. Service –  2. Training –  3. Infrastructure October 10, 2013 bioinformatics.virginia.edu

Slide 34

Slide 34 text

Sample prep Sequencing Raw data Differential expression Gene identification Novel Genes Discoveries …etc. This is the “stuff” we do in the bioinformatics core! Find out what this “stuff” is at bioinformatics.virginia.edu

Slide 35

Slide 35 text

bioinformatics.virginia.edu/services •  Gene expression: Microarray Analysis •  Gene expression: RNA-seq Analysis •  Pathway analysis •  DNA Methylation •  DNA Binding / ChIP-Seq •  DNA Variation •  Metagenomics •  Grant / Manuscript support •  Custom development October 10, 2013 bioinformatics.virginia.edu

Slide 36

Slide 36 text

Bioinformatics in a world of Genome Factories October 10, 2013 bioinformatics.virginia.edu •  Adaptation to the environment •  Bundled analysis – easy answers •  Collaboration •  Automation vs. innovation •  Downstream analysis •  New tech: no pre-built pipelines •  Training & Infrastructure: help collaborators help themselves!

Slide 37

Slide 37 text

BioConnector (bioconnector.virginia.edu) •  Partnership between –  Bioinformatics core –  Health Sciences Library –  Div. Clinical Informatics •  Mission: Get researchers connected to the tools and people they need. •  Tools: –  Galaxy server –  VIVO (collaboration) –  Wiki (documentation) –  CDR –  Awesome space October 10, 2013 bioinformatics.virginia.edu

Slide 38

Slide 38 text

Research Vignettes

Slide 39

Slide 39 text

Research Vignette #1: Valeria Mas •  Kidney transplant health (GFR) •  Integrate microarray analysis with clinical data using machine learning •  Made the cover of Transplantation (2012) October 10, 2013 bioinformatics.virginia.edu

Slide 40

Slide 40 text

Research Vignette #2: Gomez/Belyea •  Mouse model of leukemia October 10, 2013 bioinformatics.virginia.edu

Slide 41

Slide 41 text

Research Vignette #2: Gomez/Belyea •  Mouse model of leukemia –  How does gene knockout result in leukemia? –  What are the downstream molecular effects? •  Gene Expression Microarray: QC, differential gene expression, pathway analysis •  Results: –  KO De-represses a B cell specific gene program –  Increased cell-cycle progression •  Now: currently looking for mutations in human gene October 10, 2013 bioinformatics.virginia.edu

Slide 42

Slide 42 text

Research Vignette #3: Deb Lannigan •  Personalizing breast cancer chemotherapy •  Current state of the art: test survival in 2D culture October 10, 2013 bioinformatics.virginia.edu Excise cancer cells Survival assays in 2D culture

Slide 43

Slide 43 text

•  Goal: Develop 3D culture system to mimic in situ cancer •  Bioinformatics: –  How does “tumoroid” grown in 3D compare to tumor tissue? –  Gene expression profiling of multiple samples comparing tumoroids to tumors, and cells isolated from tumor margins. Research Vignette #3: Deb Lannigan October 10, 2013 bioinformatics.virginia.edu

Slide 44

Slide 44 text

Research vignette #4: U.S. Government •  Microbial forensics: analysis and interpretation of evidence for attribution of an act of bioterrorism, biocrime, hoax, or inadvertent release of a toxin or biological threat agent. October 10, 2013 bioinformatics.virginia.edu Figures from Turner et al 2013 Report to DOD, “Harnessing Next-Generation Sequencing Capabilities for Microbial Forensics.”

Slide 45

Slide 45 text

Other Current Projects •  Metagenomics & Microbial Forensics •  Microarray analysis •  RNA-seq •  MeDIP-seq •  ChIP-seq •  GWAS •  Predictive analysis & machine learning for biomarker discovery •  Acquisition and Analysis of public data (GEO, SRA, dbGaP, etc.) •  Grant preparation •  Literature & database searching for gene expression signatures •  Pathway analysis gettinggeneticsdone.blogspot.com/2012/03/pathway-analysis-for-high-throughput.html •  Gene ID conversion gettinggeneticsdone.blogspot.com/2012/03/video-tip-convert-gene-ids-with-biomart.html •  Array annotation gettinggeneticsdone.blogspot.com/2012/01/annotating-limma-results-with-gene.html October 10, 2013 bioinformatics.virginia.edu

Slide 46

Slide 46 text

Thank you Web: bioinformatics.virginia.edu E-mail: [email protected] Blog: www.GettingGeneticsDone.com Twitter: @genetics_blog October 10, 2013 bioinformatics.virginia.edu Slides available at: stephenturner.us/slides