Closing the Gap in Time: From Raw Data to Real Science

Efeae1af3dd915a9a77dc334cc04f0de?s=47 bioinfo
July 26, 2012

Closing the Gap in Time: From Raw Data to Real Science



July 26, 2012


  1. Closing the Gap in Time: From Raw Data to Real

    Science EdgeBio Gaithersburg, MD
  2. Agenda • Who We Are • Next Generation Challenges –

    Technical Expertise – Flexibility and Scalability – Sequence to Clinic • How We Facilitate Next Gen Science – Science as a Service • Applications • Real World Examples • XPrize
  3. Who We Are

  4. CLIA #21D2039005 MD State License 1853

  5. Edge BioServ • CLIA Lab • Illumina HiSeq, Life Technolgies

    SOLiD, and Ion Torrent PGM platforms to suit projects of all sizes and timelines • Consultation on experimental design • Sample QC and Library Prep • Data analysis and customized bioinformatics support • Multiple Sample Prep and Target Enrichment options • Automation and Robotics
  6. • Anju Varadarajan, Bioinformatics Engineer – May 2010 – Masters

    in Bioinformatics at Georgia Tech as well. – Handles data management, quality control, bioinformatics analysis of sequencing projects and pipeline creation and maintenance. – Archon Genomics Xprize project. – She WILL be the FIRST to clap after this presentation • David Jenkins, Bioinformatics Engineer – August 2011 – Bachelors of Science in Computational Biology from Brown University – He has worked to automate the data delivery and analysis pipelines for the Ion Torrent and ensures EdgeBio is up to date on the latest developments in this rapidly evolving technology. – Additionally David has presented his work at several local Ion Torrent user group meetings. – Archon Genomics Xprize project. • Karthik Kota, Bioinformatics Engineer – March 2012 – Masters in Bioinformatics from Georgia Tech – Responsibilities of developing and analyzing our Resequecing Pipeline for Whole Genome Shotgun, Exome or targeted sequencing analysis on Illumina and Solid platforms – Archon Genomics Xprize project. • Vani Rajan, Bioinformatics Engineer – January 2012. – Masters in Bioinformatics program from Georgia Institute of Technology. – Handles pipeline generation and management as well as the data management and quality control for our HiSeq system. • Phil Dagosto – Senior SW Engineer – Infrastructure, LIMS, Pipline SW Devlopment – All around nice guy.
  7. Edge BioServ Scientific Advisory Board Elaine Mardis, Ph.D. Co-Director, Genome

    Sequencing Center Washington University School of Medicine Sam Levy, Ph.D. Director of Genome Sciences Scripps Translational Science Institute Scripps Genomic Medicine Michael Zody, M.S. Chief Technologist Broad Institute Ken Dewar, Ph.D. Assistant Professor McGill University and Genome Quebec Steven Salzberg, Ph.D. Director, Center for Bioinformatics and Computational Biology University of Maryland Gabor Marth, Ph.D. Professor of Bioinformatics Boston College Elliott Margulies, Ph.D. Investigator Genome Informatics Section National Human Genome Research Institute National Institutes of Health
  8. What We Enable Evolving Sequencing Methods to Enable Genomic Research

  9. Genome - De Novo - Resequencing/ Mutation Discovery & Profiling

    - Exome Sequencing - Copy Number Variation - Ancient DNA RNA-Seq/ Whole Transcriptome - mRNA Expression & Discovery - Alternative Splicing - Allele-Specific Expression - microRNA Expression & Discovery Epigenome - Transcriptionally Active Sites - Protein-DNA Interactions - Methylation Analysis Metagenome - Microbial Diversity - Heterogeneous Samples Ultra High Throughput + Lower Cost = Broader Applications
  10. Challenges

  11. Challenges Technical Expertise

  12. Experimental Design Considerations  Sequencing Platform in Use  Choice

    of Library Construction  Depth of coverage  Re$ources  Number of Replicates  Number of Samples and Control  Etc…
  13. Machines and Vendors GnuBio

  14. NGS Exponential Growth & Infrastructure Nature Biotechnology Volume 26 Number10

  15. Bioinformatics Tools * BFAST - Blat-like Fast Accurate Search Tool.

    Written by Nils Homer, Stanley F. Nelson and Barry Merriman at UCLA. * Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Uses a Burrows- Wheeler-Transformed (BWT) index. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac OS X. * BWA - Heng Lee's BWT Alignment program - a progression from Maq. BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence. C++ source. * ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine. * Exonerate - Various forms of pairwise alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX. * GenomeMapper - GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. A tool created by the 1001 Genomes project. Source for POSIX. * GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix. * gnumap - The Genomic Next-generation Universal MAPper (gnumap) is a program designed to accurately map sequence data obtained from next-generation sequencing machines (specifically that of Solexa/Illumina) back to a genome of any size. It seeks to align reads from nonunique repeats using statistics. From authors at Brigham Young University. C source/Unix. * MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre. Features extensive supporting tools for DIP/SNP detection, etc. C++ source * MOSAIK - MOSAIK produces gapped alignments using the Smith-Waterman algorithm. Features a number of support tools. Support for Roche FLX, Illumina, SOLiD, and Helicos. Written by Michael Strömberg at Boston College. Win/Linux/MacOSX * MrFAST and MrsFAST - mrFAST & mrsFAST are designed to map short reads generated with the Illumina platform to reference genome assemblies; in a fast and memory-efficient manner. Robust to INDELs and MrsFAST has a bisulphite mode. Authors are from the University of Washington. C as source. * MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required. * Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq. Commercial. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X. * PASS - It supports Illumina, SOLiD and Roche-FLX data formats and allows the user to modulate very finely the sensitivity of the alignments. Spaced seed intial filter, then NW dynamic algorithm to a SW(like) local alignment. Authors are from CRIBI in Italy. Win/Linux. * RMAP - Assembles 20 - 64 bp Illumina reads to a FASTA reference genome. By Andrew D. Smith and Zhenyu Xuan at CSHL. (published in BMC Bioinformatics). POSIX OS required. * SeqMap - Supports up to 5 or more bp mismatches/INDELs. Highly tunable. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's. * SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem's colourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto. POSIX. * Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Authors are from BCGSC. Paper is here. * SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The updated version uses a BWT. Can call SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics Institute. C++, POSIX. * SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by Zemin Ning, Anthony Cox and James Mullikin. C++ for Linux/Alpha. * SOCS - Aligns SOLiD data. SOCS is built on an iterative variation of the Rabin-Karp string search algorithm, which uses hashing to reduce the set of possible matches, drastically increasing search speed. Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman NH. * SWIFT - The SWIFT suit is a software collection for fast index-based sequence comparison. It contains: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences. SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT) and Wolfgang Gerlach (SWIFT BALSAM) * SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent. * Vmatch - A versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Essentially a large string matching toolbox. POSIX. * Zoom - ZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysis. ZOOM is developed to be highly accurate, flexible, and user-friendly with speed being a critical priority. Commercial. Supports Illumina and SOLiD data. Courtesy of
  16. Challenges Flexibility & Scalability

  17. Flexibility and Scale • Traditionally (CE) – 10 Machines, 30

    Days, 1 Microbe, 8X • Now (NGS) – 1 Project • 1 Machine, 2 Week, 1 Human, 30X • .5 Machine, 1 Week, 16 Exome, 100X • .5 Machine, 1 Week, 6 Transcriptome, 80X • 1 Machine, 3 Days, 1 Microbe, 100X
  18. Challenges From Sequencer to Clinic

  19. From Sequence to the Clinic • Realizing the full potential

    for the application of genomic sequencing to routinely assess individual variation of medical relevance • How do we deal with: – Institutional Policies and Ethical Concerns – Standardization of Procedures – Integration of data into the clinical workflow
  20. Challenges

  21. The Sky Isn’t Falling

  22. The Challenge is Building… • 1 • 16 • 96

    • 384 • 100,000 • 350,000,000 • Tomorrow?
  23. …and distributing • Distributing the Problem – Exchanging Data –

    Refreshing Data – Building Repositories – Combining and Leverage Ontologies – Metadata
  24. How do we avoid the Perfect Storm?

  25. Life Vests? Science as a Service Saas Iaas Paas Haas

  26. Edge BioServ Services Library Construction Sample Preparation Amplification & Sequence

    Data Analysis Library Preparation Sample Capture Sample QC and Quantification Adaptor ligation Fragment amplification Next Generation Sequencing run Align sequence to reference genome Secondary and Tertiary Analysis Experiment and Project Design Project goals and timelines Number of samples Number of reads per sample Project Workflow
  27. How We Enable Evolving Sequencing Methods to Enable Genomic Research

  28. Technical experts in NGS and Bioinformatics Build expertise on the

    expansive knowledge of our clients Collate researchers’ needs to build lab and analysis pipelines Edge Bio Facilitate the Scientific Process
  29. How We Enable Evolving Sequencing Methods to Enable Genomic Research

  30. Project Design Sample QC Library Construction Amplification & Sequence Bioinformatics

    Edge Bio - Abstraction
  31. Project Design Sample QC Library Construction Amplification & Sequence Bioinformatics

    Edge Bio
  32. Project Design Goals Timeline Platforms Applications Samples/Replicates Bioinformatics Resources

  33. Experimental Design Considerations • Project Goals • Project Timeline •

    Sequencing Platform Recommendations • Application Recommendations – Exome vs. whole genome sequencing? – rRNA depletion vs. polyA selection • Choice of Library Construction • Number of Replicates • Number of Samples and Controls • Recommendations on depth of coverage • Re$ources
  34. None
  35. Project Design Sample QC Library Construction Amplification & Sequence Bioinformatics

    Edge Bio
  36. Ongoing R & D to Improve and Enhance our NGS

    Offering Gene Panels on PGMs and Miseq Alternative DNA Fragmentation Robotics and Automation
  37. Project Design Sample QC Library Construction Amplification & Sequence Bioinformatics

    Edge Bio
  38. Bioinformatics • Cloud Computing (Iaas, PaaS) –Amazon, Google, Others •

    NGS Software and Algorithms –Commercial and Open Source • Frameworks –(cloud)Biolinux, Hadoop and Chef • Data Sharing and Standards (GSC/M5)
  39. None
  40. Questions I’ll Try to Answer… • Who are the XPrize

    Foundation and EdgeBio? • What is the 10M Archon Genomics XPrize? • Whose genomes will be sequenced? • How will we “score” these genomes? • What is the “Public Phase”? • Why is the “Public Phase” Important?
  41. None
  42. • Non-profit organization, creating and managing high-profile, global incentivized competitions

    to solve the Grand Challenges facing humanity • Prize model leverages philanthropy, making it more efficient • Stimulates research & development investments worth far more than the prize purse • Partnered with top global brands and government including Google, Qualcomm, Cisco, Shell, NASA and the U.S. Department of Energy • World-class Board of Trustees &benefactors such as: Ratan Tata, James Cameron, Larry Page, J. Craig Venter, Dean Kamen, Jim Gianopulos, and Elon Musk
  43. • $10 million competition • 100 genomes of centenarians •

    $1,000 per genome in 30 days • 1 error in a million base pairs and fully phased / complete genome
  44. AGXP Validation Study

  45. AGXP Validation Study Analysis • 3 Major Phases –Technology Comparison

    and Bias Removal –Fosmid Reconstruction –Software Development
  46. AGXP Validation Study Software • Open Source • Automated and

    Hosted • Web Interface and RESTful API • GATK and Cloud Biolinux VM • Well Defined Data Formats (VCF)
  47. AGXP Validation Study Validation protocol available at

  48. Questions? Thank You