Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2014 ABRF Talk: Bioinformatics Cores and New Technologies

2014 ABRF Talk: Bioinformatics Cores and New Technologies

Stephen Turner

June 13, 2013
Tweet

More Decks by Stephen Turner

Other Decks in Education

Transcript

  1. Bioinformatics Cores & New Technologies The evolving role of 21st

    Century Bioinformatics Core Facilities Stephen D. Turner, Ph.D. Bioinformatics Core Director [email protected] Twitter: @genetics_blog Slides available at: http://dx.doi.org/10.6084/m9.figshare.715242
  2. UVA Bioinformatics Core •  Established Oct, 2011. •  A centralized

    resource for providing expert and timely bioinformatics consulting and data analysis. •  Main goals: help collaborators publish and get funding. –  1. Service –  2. Training –  3. Infrastructure 5/29/14 2
  3. bioinformatics.virginia.edu/services •  Gene expression: Microarray –  Affymetrix, Illumina, Custom – 

    QA, analysis/visualization, pathway analysis, etc –  Deposit in GEO •  Gene expression: RNA-seq –  Differential gene expression –  Differential isoform expression, exon usage, splicing, etc. –  http://dx.doi.org/10.6084/m9.figshare.105157 •  Pathway analysis –  GO, GSEA, SPIA, IPA, oncomine, ... –  http://dx.doi.org/10.6084/m9.figshare.105155 •  DNA Methylation –  Infinium chips, MeDIP-seq, etc –  http://dx.doi.org/10.6084/m9.figshare.105157 •  DNA Binding / ChIP-Seq –  Peak calling, differential binding •  DNA Variation –  GWAS, NGS •  Metagenomics –  Microbiome –  Microbial forensics •  Acquisition / analysis of public data –  GEO, dbGaP, SRA, ArrayExpress, etc. –  Download and upload •  Grant / Manuscript support –  Letter of support, resources, etc. •  Custom development 5/29/14 3
  4. Bioinformatics Support for New Technologies How is new technology changing

    bioinformatics theory and practice? Data. Lots of it.
  5. New Tech / Big Data •  3 V’s of Big

    Data (Laney, 2001) –  Volume: how much data? •  HiSeq 2500: 600GB/run, 6 billion reads/run •  MiSeq 8 GB/run, 30 million reads/run •  LHC: 6x108 collisions/sec/detector; 1 PB/sec/detector –  Variety: maybe the most interesting? •  Integrating heterogeneous data types •  More on this later… –  Velocity •  Speed of data in/out (Oxford Nanopore real-time sequencing) •  Rate of change of the kinds of data you care about •  Others? –  Veracity: How trustworthy is the data? –  Viability: How meaningful is the data we collected? –  Value: What can we now do with the data? 5/29/14 5
  6. After the Gold Rush… •  Hall, N. “After the Gold

    Rush”. Genome Biol 2013. •  What if microscopes got 10x more powerful every year… –  Could do the same experiment every few months with the same slide. –  Make new discoveries! Publish interesting findings! •  Not too different from genomics… –  Sequence a Human Genome (HGP 2001) –  Sequence 1000 human genomes (1000genomes.org) –  Sequence 2000 human genomes (1000genomes.org) –  Sequence Human Microbiomes (hmpdacc.org) –  Sequence Earth (earthmicrobiome.org) 5/29/14 8
  7. After the Gold Rush… •  What’s possible next year will

    be the same as what’s possible now. •  Fresh ideas needed! •  Stability will be good for us in the end. 5/29/14 9
  8. Bioinformatics in a world of Genome Factories •  Adaptation to

    the environment •  Bundled analysis – easy answers •  Collaboration •  Downstream analysis •  Automation vs. innovation •  New tech: no pre-built pipelines –  Open question: how do we continue to support new technologies for our collaborations? (more next) •  Training & Infrastructure: help collaborators help themselves! (more later) 5/29/14 10
  9. How to support new tech? •  Not easy – takes

    lots of time (read: money). •  Read. Read a LOT. –  Blog post: How to stay current in Bioinformatics/ Genomics: •  Journals, blogs, forums, Twitter, listservs, etc. •  gettinggeneticsdone.blogspot.com/2012/05/how-to-stay-current-in.html •  Brute force: get some data and analyze it. –  Sequence data: http://www.ncbi.nlm.nih.gov/sra –  Gene Expression: http://www.ncbi.nlm.nih.gov/geo/ –  Code: github.com, sourceforge.net, code.google.com •  Continued education for staff: –  Journal clubs, symposia, “hack time” –  Scientific meetings –  Workshops, training, MOOCs: stephenturner.us/p/edu 5/29/14 11
  10. Infrastructure: BioConnector (bioconnector.virginia.edu) •  Partnership between –  Bioinformatics core – 

    Health Sciences Library –  Div. Clinical Informatics •  Mission: Get researchers connected to the tools and people they need. •  Tools: –  Galaxy server –  VIVO (collaboration) –  Wiki (documentation) –  CDR –  Meeting / workspace 5/29/14 12
  11. Training •  Recent events: –  March 5, 2013: Introduction to

    Galaxy: a Web-Based Bioinformatics Toolkit –  March 7-8, 2013: Software Skills Bootcamp (Unix, Python, version control, SQL, etc.) •  Training announcements: –  bioinformatics.virginia.edu/training –  stephenturner.us/p/edu •  Open questions: –  How can we better train the next generation of genomic scientists (not just computational)? –  How can this training be supported? Sustainable? Scalable? 5/29/14 13
  12. Bioinformatics as a Discipline •  MacLean & Kamoun 2012 Nat

    Biotech: “Big Data in Small Places.” •  Bioinformatics as a sub-discipline of molecular biology: –  Critical for molecular biologists to understand computational biology. –  Same brain considers both biology and bioinformatics. •  Biologists often approach computational genomics as if rules of experimental design don’t matter. –  N=1 –  No controls or improper controls –  False comfort provided by E-scores, P-values, etc.? 5/29/14 14
  13. Challenge: Data Integration •  Genome.gov/GWAStudies –  As of 04/19/13, the

    catalog includes 1571 publications and 9906 SNPs. •  GWAS does not inform: –  Which gene affected –  How gene function perturbed –  How biological processes altered •  How to integrate other data (expression, epigenetic, proteomic, ENCODE, etc) to put GWAS in functional context? 5/29/14 15
  14. Data Integration: Genetic Variation & Gene Expression + Are DNA

    variants that are associated with disease also associated with gene expression levels? 5/29/14 16
  15. Data Integration: 4 Dimensions Schadt et al. 2009. Network view

    of disease and compound screening. Nat Rev Drug Discovery 8:286. Probabilistic Bayesian Network Integrating: 1.  Genetic variation 2.  Gene expression 3.  Protein-protein interactions 4.  Transcript factor binding 5/29/14 18
  16. Data Integration: 6 Dimensions 5/29/14 1.  Metabolite concentrations 2.  RNA

    expression 3.  DNA Variation 4.  DNA-protein binding 5.  Protein-protein interaction 6.  Protein-metabolite interaction •  Metabolites linked to DNA variants (MetQTLs) •  MetQTLs co-localize with eQTLs •  Using a Bayesian network –  Nodes: DNA variation, gene expresion, metabolite concentration –  Priors: Protein-DNA binding, protein-protein interaction, metabolite-protein interaction –  Edges: Inferred relationships à mechanism Zhu J, … Schadt EE. 2012. Stitching together Multiple Data Dimensions Reveals Interacting Metabolomic and Transcriptomic Networks that Modulate Cell Regulation. PLoS Biol. Infer causality 19
  17. Data Integration: Mouse Cis-Regulatory Map •  RNA-Seq and ChIP-Seq for

    6 DNA-binding factors * 19 cell types –  ChIP: PolII, H3K4me3, H3K4me1, H3K27ac, P300, CTCF –  Adult Tissues: bone marrow, cerebellum, cortex, heart, intestine, kidney, liver, lung, olfactory bulb, placenta, spleen, testis, thymus –  Embryonic Tissues: brain, heart, limb, liver –  Cell lines: mESCs, MEFs •  Found 300,000 cis-reg features –  11% mouse genome –  70% conserved non-coding sequence 5/29/14 Shen et al. A map of the cis- regulatory sequences in the mouse genome. Nature, July 2012. 20
  18. Data Integration: Epigenome & Transcriptome •  Zhang JA, Mortazavi A,

    Williams BA, Wold BJ, Rothenberg EV. Dynamic Transformations of Genome-wide Epigenetic Marking and Transcriptional Control Establish T Cell Identity. Cell 2012. •  ChIP-Seq + RNA-Seq in sequential T-cell developmental stages •  Changes in gene expression co-occur w/ histone modification at cis-regulatory sites. 5/29/14 21
  19. Data Integration: ENCODE •  Forget about the “80% is functional”

    dispute. –  Graur et al. 2013 Genome Biol. Evol. –  Sean Eddy 2013 Curr. Biol. •  30 papers •  1640 datasets •  31 terabytes •  Open questions –  What are the current gaps in data integration approaches? –  How to best utilize the wealth of publicly available data like ENCODE, etc? 5/29/14 22
  20. Summary •  Bioinformatics moves fast, but we’re entering a period

    of stability. •  Many strategic reasons for having in-house bioinformatics expertise (supporting new tech is one of them). •  Training is important – users must understand to appreciate the value you provide. •  Data integration – challenge & opportunity. 5/29/14 23
  21. Other speakers •  Bob Settlage – Virginia Bioinformatics Institute – “Keeping Abreast

    of New Technologies in a Rapidly Advancing Field: a Case Study from a Small Analysis Core” •  Natalie Abrams – NCI/CCR Bioinformatics Core, Frederick Nat’l Lab for Cancer Research – “Addressing Validity Issues Associated with NGS Data Analysis” 5/29/14 25