Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Quick Intro to InterMine within AIP and MTGD - JCVI Research WIP Meeting

Quick Intro to InterMine within AIP and MTGD - JCVI Research WIP Meeting

Presentation regarding InterMine and its adoption by the AIP and MTGD project, made at the Informatics Research WIPS meeting on 03 November 2014, conducted at J. Craig Venter Institute, Rockville, MD.

Presented by Vivek Krishnakumar

Vivek Krishnakumar

November 03, 2014
Tweet

More Decks by Vivek Krishnakumar

Other Decks in Programming

Transcript

  1. InterMine
    Integrated Data Warehouse
    Use Cases: Arabidopsis & Medicago Genome Projects
    Vivek Krishnakumar
    Plant Genomics Group (EUK)
    IFX Research WIPS Meeting, 03 October 2014

    View Slide

  2. Overview
    • Introduction
    • InterMine
    ¡
    Integrated data warehouse, Extensible data model,
    Flexible query system
    ¡
    Web and Programmatic Interface
    ¡
    Other InterMine instances
    • Use cases
    ¡
    Arabidopsis Information Portal (AIP)
    ¡
    Medicago truncatula Genome Database (MTGD)
    • Summary
    ¡
    Advantages
    ¡
    Caveats

    View Slide

  3. Introduction
    For genome projects that wish to expose their
    data via the web (query, visualize, warehouse)
    to foster scientific collaboration, there are
    several technologies available:
    • JCVI developed software
    ¡
    Manatee (backed by an RDBMS)
    • Externally developed software
    ¡
    BioMart (federated from various databases)
    ¡
    Tripal (powered by Drupal, backed by CHADOdb)
    ¡
    InterMine

    View Slide

  4. InterMine
    • Functions as a data warehouse for the integration of complex
    biological data. Integration across data types occurs based on
    a common identifier (e.g. gene primary ID)
    • Uses a flexible and extensible data model, controlled by XML
    files, driven by ontologies (Sequence [SO], Gene [SO], etc.)
    ¡
    Genomics, Proteomics, Interactions, Homology,
    Expression, Pathways (and more data types)
    ¡
    Parsers for commonly used biological data formats
    ¡
    Provides framework for adding your own data
    • Offers a flexible query system, optimized via precomputed
    tables (no need for schema denormalization)
    Smith, RN. et al. InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data
    Bioinformatics (2012) 28 (23): 3163-3165

    View Slide

  5. InterMine (contd.)
    • Provides a user-friendly web interface exposing
    powerful features:
    ¡
    Analysis of lists (facilitate enrichment studies)
    ¡
    Full-featured report pages (one-stop shop)
    ¡
    Interactive result tables (sort, filter, summarize)
    ¡
    Visual query builder (no need to write SQL!)
    ¡
    Quick search and Region-based search
    • Fosters development of external applications
    using data hosted within InterMine via Application
    Programming Interfaces (API):
    ¡
    RESTful
    ¡
    Perl, Python, Ruby, Java, JavaScript
    Kalderimis, A. et al. InterMine: extensive web services for modern biology
    Nucl. Acids Res. (1 July 2014) 42 (W1): W468-W472

    View Slide

  6. Public “Mines”
    • InterMine supports querying across mines
    for cross-database integration
    • Vast number of warehouses powered by
    InterMine already exist

    View Slide

  7. Arabidopsis Information Portal (AIP)
    • AIP origins
    ¡
    Funded by NSF in response to community needs, following
    termination of funding to TAIR
    • AIP objectives
    ¡
    Develop a community web resource that…
    – is sustainable and fundable and community-extensible
    – hosts analysis & visualization tools, user data spaces
    ¡
    Federation: integrate diverse data sets from distributed data
    sources; foster development of tools for and by the community
    ¡
    Maintenance of the Col-0 gold standard annotation
    • AIP methods
    ¡
    Assimilate TAIR data
    ¡
    Host an InterMine instance devoted to Arabidopsis (thale cress)
    ¡
    Offer and consume RESTful web services
    ¡
    Integrate and utilize iPlant resources

    View Slide

  8. ThaleMine
    https://apps.araport.org/thalemine
    • An InterMine interface
    to Arabidopsis genomic
    data
    • Integrates a wide
    variety of data types
    (A-E, H), some of
    which are warehoused
    and others are
    federated via web
    services
    • Embedded elements
    visualizing gene
    structure (JBrowse, not
    shown), interaction
    networks (F),
    expression patterns (G)

    View Slide

  9. Visual Query Builder
    Image created by Benjamin Rosen (Bioinformatics Analyst, Plant Genomics Group)

    View Slide

  10. Images created by Benjamin Rosen (Bioinformatics Analyst, Plant Genomics Group)
    Interactive Result Tables Region-based search

    View Slide

  11. MedicMine
    http://medicmine.jcvi.org
    • NSF funded project to
    assist with the curation
    of the Medicago
    truncatula Genome
    Assembly and
    Annotation (funding
    ended August 2014)
    • In order to warehouse
    and prolong the project
    data, an InterMine
    interface for Medicago
    was implemented
    (backed by a CHADO
    database)
    • Provides similar kind of
    functionality available via
    ThaleMine

    View Slide

  12. Summary
    • Advantages
    ¡
    InterMine is a powerful biological data warehouse
    ¡
    Performs complex data integration
    ¡
    Allows fast and flexible querying
    ¡
    Well documented programmatic interface
    ¡
    Cookie-cutter, user-friendly web interface
    ¡
    Facilitates cross-talk between “mines”
    • Caveats
    ¡
    Adding more data requires a full database rebuild (incremental loading
    is not possible) because of the integration step
    • About InterMine:
    ¡
    Developed by the Micklem Lab at the University of Cambridge, UK
    ¡
    Written in Java, backed by PostgreSQLdb, deployed under Tomcat.
    Documentation and downloads available at http://www.intermine.org

    View Slide

  13. Chris Town, PI
    Lisa McDonald
    Education and
    Outreach
    Coordinator
    Chris Nelson
    PM
    Jason Miller, Co-PI
    Technical Lead
    Erik Ferlanti
    SE
    Vivek Krishnakumar
    BE
    Svetlana Karamycheva
    BE
    Eva Huala
    Project lead, TAIR
    Bob Muller
    Technical lead, TAIR
    Gos Micklem, co-PI Sergio Contrino
    Software Engineer
    Matt Vaughn
    co-PI Steve Mock
    Advanced Computing
    Interfaces
    Rion Dooley,
    Web and Cloud
    Services
    Matt Hanlon,
    Web and Mobile
    Applications
    Maria Kim
    BE
    Ben Rosen
    BA

    View Slide