Century Bioinformatics Core Facilities Stephen D. Turner, Ph.D. Bioinformatics Core Director [email protected] Twitter: @genetics_blog Slides available at: http://dx.doi.org/10.6084/m9.figshare.715242
resource for providing expert and timely bioinformatics consulting and data analysis. • Main goals: help collaborators publish and get funding. – 1. Service – 2. Training – 3. Infrastructure 5/29/14 2
Data (Laney, 2001) – Volume: how much data? • HiSeq 2500: 600GB/run, 6 billion reads/run • MiSeq 8 GB/run, 30 million reads/run • LHC: 6x108 collisions/sec/detector; 1 PB/sec/detector – Variety: maybe the most interesting? • Integrating heterogeneous data types • More on this later… – Velocity • Speed of data in/out (Oxford Nanopore real-time sequencing) • Rate of change of the kinds of data you care about • Others? – Veracity: How trustworthy is the data? – Viability: How meaningful is the data we collected? – Value: What can we now do with the data? 5/29/14 5
Rush”. Genome Biol 2013. • What if microscopes got 10x more powerful every year… – Could do the same experiment every few months with the same slide. – Make new discoveries! Publish interesting findings! • Not too different from genomics… – Sequence a Human Genome (HGP 2001) – Sequence 1000 human genomes (1000genomes.org) – Sequence 2000 human genomes (1000genomes.org) – Sequence Human Microbiomes (hmpdacc.org) – Sequence Earth (earthmicrobiome.org) 5/29/14 8
the environment • Bundled analysis – easy answers • Collaboration • Downstream analysis • Automation vs. innovation • New tech: no pre-built pipelines – Open question: how do we continue to support new technologies for our collaborations? (more next) • Training & Infrastructure: help collaborators help themselves! (more later) 5/29/14 10
lots of time (read: money). • Read. Read a LOT. – Blog post: How to stay current in Bioinformatics/ Genomics: • Journals, blogs, forums, Twitter, listservs, etc. • gettinggeneticsdone.blogspot.com/2012/05/how-to-stay-current-in.html • Brute force: get some data and analyze it. – Sequence data: http://www.ncbi.nlm.nih.gov/sra – Gene Expression: http://www.ncbi.nlm.nih.gov/geo/ – Code: github.com, sourceforge.net, code.google.com • Continued education for staff: – Journal clubs, symposia, “hack time” – Scientific meetings – Workshops, training, MOOCs: stephenturner.us/p/edu 5/29/14 11
Health Sciences Library – Div. Clinical Informatics • Mission: Get researchers connected to the tools and people they need. • Tools: – Galaxy server – VIVO (collaboration) – Wiki (documentation) – CDR – Meeting / workspace 5/29/14 12
Galaxy: a Web-Based Bioinformatics Toolkit – March 7-8, 2013: Software Skills Bootcamp (Unix, Python, version control, SQL, etc.) • Training announcements: – bioinformatics.virginia.edu/training – stephenturner.us/p/edu • Open questions: – How can we better train the next generation of genomic scientists (not just computational)? – How can this training be supported? Sustainable? Scalable? 5/29/14 13
Biotech: “Big Data in Small Places.” • Bioinformatics as a sub-discipline of molecular biology: – Critical for molecular biologists to understand computational biology. – Same brain considers both biology and bioinformatics. • Biologists often approach computational genomics as if rules of experimental design don’t matter. – N=1 – No controls or improper controls – False comfort provided by E-scores, P-values, etc.? 5/29/14 14
catalog includes 1571 publications and 9906 SNPs. • GWAS does not inform: – Which gene affected – How gene function perturbed – How biological processes altered • How to integrate other data (expression, epigenetic, proteomic, ENCODE, etc) to put GWAS in functional context? 5/29/14 15
dispute. – Graur et al. 2013 Genome Biol. Evol. – Sean Eddy 2013 Curr. Biol. • 30 papers • 1640 datasets • 31 terabytes • Open questions – What are the current gaps in data integration approaches? – How to best utilize the wealth of publicly available data like ENCODE, etc? 5/29/14 22
of stability. • Many strategic reasons for having in-house bioinformatics expertise (supporting new tech is one of them). • Training is important – users must understand to appreciate the value you provide. • Data integration – challenge & opportunity. 5/29/14 23
of New Technologies in a Rapidly Advancing Field: a Case Study from a Small Analysis Core” • Natalie Abrams – NCI/CCR Bioinformatics Core, Frederick Nat’l Lab for Cancer Research – “Addressing Validity Issues Associated with NGS Data Analysis” 5/29/14 25