Quick Intro to InterMine within AIP and MTGD - JCVI Research WIP Meeting

Quick Intro to InterMine within AIP and MTGD - JCVI Research WIP Meeting

Presentation regarding InterMine and its adoption by the AIP and MTGD project, made at the Informatics Research WIPS meeting on 03 November 2014, conducted at J. Craig Venter Institute, Rockville, MD.

Presented by Vivek Krishnakumar

655ece370aa88ec83d11254234ded6ce?s=128

Vivek Krishnakumar

November 03, 2014
Tweet

Transcript

  1. InterMine Integrated Data Warehouse Use Cases: Arabidopsis & Medicago Genome

    Projects Vivek Krishnakumar Plant Genomics Group (EUK) IFX Research WIPS Meeting, 03 October 2014
  2. Overview • Introduction • InterMine ¡ Integrated data warehouse, Extensible

    data model, Flexible query system ¡ Web and Programmatic Interface ¡ Other InterMine instances • Use cases ¡ Arabidopsis Information Portal (AIP) ¡ Medicago truncatula Genome Database (MTGD) • Summary ¡ Advantages ¡ Caveats
  3. Introduction For genome projects that wish to expose their data

    via the web (query, visualize, warehouse) to foster scientific collaboration, there are several technologies available: • JCVI developed software ¡ Manatee (backed by an RDBMS) • Externally developed software ¡ BioMart (federated from various databases) ¡ Tripal (powered by Drupal, backed by CHADOdb) ¡ InterMine
  4. InterMine • Functions as a data warehouse for the integration

    of complex biological data. Integration across data types occurs based on a common identifier (e.g. gene primary ID) • Uses a flexible and extensible data model, controlled by XML files, driven by ontologies (Sequence [SO], Gene [SO], etc.) ¡ Genomics, Proteomics, Interactions, Homology, Expression, Pathways (and more data types) ¡ Parsers for commonly used biological data formats ¡ Provides framework for adding your own data • Offers a flexible query system, optimized via precomputed tables (no need for schema denormalization) Smith, RN. et al. InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data Bioinformatics (2012) 28 (23): 3163-3165
  5. InterMine (contd.) • Provides a user-friendly web interface exposing powerful

    features: ¡ Analysis of lists (facilitate enrichment studies) ¡ Full-featured report pages (one-stop shop) ¡ Interactive result tables (sort, filter, summarize) ¡ Visual query builder (no need to write SQL!) ¡ Quick search and Region-based search • Fosters development of external applications using data hosted within InterMine via Application Programming Interfaces (API): ¡ RESTful ¡ Perl, Python, Ruby, Java, JavaScript Kalderimis, A. et al. InterMine: extensive web services for modern biology Nucl. Acids Res. (1 July 2014) 42 (W1): W468-W472
  6. Public “Mines” • InterMine supports querying across mines for cross-database

    integration • Vast number of warehouses powered by InterMine already exist
  7. Arabidopsis Information Portal (AIP) • AIP origins ¡ Funded by

    NSF in response to community needs, following termination of funding to TAIR • AIP objectives ¡ Develop a community web resource that… – is sustainable and fundable and community-extensible – hosts analysis & visualization tools, user data spaces ¡ Federation: integrate diverse data sets from distributed data sources; foster development of tools for and by the community ¡ Maintenance of the Col-0 gold standard annotation • AIP methods ¡ Assimilate TAIR data ¡ Host an InterMine instance devoted to Arabidopsis (thale cress) ¡ Offer and consume RESTful web services ¡ Integrate and utilize iPlant resources
  8. ThaleMine https://apps.araport.org/thalemine • An InterMine interface to Arabidopsis genomic data

    • Integrates a wide variety of data types (A-E, H), some of which are warehoused and others are federated via web services • Embedded elements visualizing gene structure (JBrowse, not shown), interaction networks (F), expression patterns (G)
  9. Visual Query Builder Image created by Benjamin Rosen (Bioinformatics Analyst,

    Plant Genomics Group)
  10. Images created by Benjamin Rosen (Bioinformatics Analyst, Plant Genomics Group)

    Interactive Result Tables Region-based search
  11. MedicMine http://medicmine.jcvi.org • NSF funded project to assist with the

    curation of the Medicago truncatula Genome Assembly and Annotation (funding ended August 2014) • In order to warehouse and prolong the project data, an InterMine interface for Medicago was implemented (backed by a CHADO database) • Provides similar kind of functionality available via ThaleMine
  12. Summary • Advantages ¡ InterMine is a powerful biological data

    warehouse ¡ Performs complex data integration ¡ Allows fast and flexible querying ¡ Well documented programmatic interface ¡ Cookie-cutter, user-friendly web interface ¡ Facilitates cross-talk between “mines” • Caveats ¡ Adding more data requires a full database rebuild (incremental loading is not possible) because of the integration step • About InterMine: ¡ Developed by the Micklem Lab at the University of Cambridge, UK ¡ Written in Java, backed by PostgreSQLdb, deployed under Tomcat. Documentation and downloads available at http://www.intermine.org
  13. Chris Town, PI Lisa McDonald Education and Outreach Coordinator Chris

    Nelson PM Jason Miller, Co-PI Technical Lead Erik Ferlanti SE Vivek Krishnakumar BE Svetlana Karamycheva BE Eva Huala Project lead, TAIR Bob Muller Technical lead, TAIR Gos Micklem, co-PI Sergio Contrino Software Engineer Matt Vaughn co-PI Steve Mock Advanced Computing Interfaces Rion Dooley, Web and Cloud Services Matt Hanlon, Web and Mobile Applications Maria Kim BE Ben Rosen BA