Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to the InterMine Infrastructure - LegFed Meeting

Intro to the InterMine Infrastructure - LegFed Meeting

Overview of InterMine infrastructure, presented at the Legume Federation Project Meeting

Presented by Vivek Krishnakumar

Vivek Krishnakumar

April 28, 2015
Tweet

More Decks by Vivek Krishnakumar

Other Decks in Programming

Transcript

  1. Introduction to InterMine
    Infrastructure
    Vivek Krishnakumar
    LF Meeting 04/28/2015

    View Slide

  2. InterMine in a nutshell
    • Open-source data warehouse software
    • Integration of complex biological data
    • Parsers for common biological data formats
    • Extensible framework for custom data
    • Cookie-cutter interface, highly customizable
    • Interact using sophisticated web query tools
    • Programmatic access using web-service API

    View Slide

  3. Open-source Project
    • Source code available online
    • Distributed with the GNU
    LGPL license
    • GitHub Repo:
    https://github.com/intermine/int
    ermine
    • GitHub Organization:
    https://github.com/intermine
    intermine / intermine
    > bio
    > biotestmine
    > config
    > flymine
    > humanmine
    > imbuild
    > intermine
    > testmodel
    .gitignore
    .travis.yml
    LICENSE
    LICENSE.LIBS
    README.md
    RELEASE_NOTES

    View Slide

  4. Richard N. Smith et al. Bioinformatics 2012;28:3163-3165
    InterMine system architecture

    View Slide

  5. InterMine system architecture
    Web Application
    • Java Server Pages (JSP), HTML, JS, CSS
    • Interfaces with Java Servlets and IM web-services
    Web Server
    • Tomcat 7.0.x, serves Web application ARchive file
    • ant based build system using Java SDK
    Database Server
    • PostgreSQL 9.2 or above
    • range query, btree, gist enabled (refer docs here)
    http://intermine.readthedocs.org/en/latest/system-requirements/

    View Slide

  6. Data Model Overview
    • Object-oriented data model
    • Divided into classes, their attributes and
    their relationships; defined in XML
    • Represented as Java classes (pure Java
    beans); auto-generated from XML,
    automatically map to tables in schema
    • Core data model; based on Sequence
    Ontology (SO); refer: bio/core/core.xml
    and bio/core/genomic_additions.xml
    http://intermine.readthedocs.org/en/latest/data-model/overview/

    View Slide

  7. Data Model Overview













    Model expects standard Java names for classes and attributes
    • classes: start with an upper case letter and be CamelCase, no underscores or spaces.
    • fields (attributes, references, collections): should start with a lower case letter and be
    lowerCamelCase, no underscores or spaces.
    http://intermine.readthedocs.org/en/latest/data-model/model/

    View Slide

  8. Creating & configuring a mine
    • Build out scaffold for mine
    $ cd git/intermine
    $ bio/scripts/make_mine legumine
    • Configure data to load and
    post-processing steps to
    run by customizing
    project.xml
    • Data elements
    correspond to directory
    under bio/sources/*;
    defines parsers to retrieve
    data and encodes rules for
    integration
    intermine / intermine
    > bio
    > biotestmine
    > config
    > flymine
    > legumine
    >
    dbmodel
    >
    integrate
    >
    postprocess
    >
    webapp
    >
    default.intermine.integrate.properties
    >
    default.intermine.webapp.properties
    >
    project.xml
    > humanmine
    > imbuild
    > intermine
    > testmodel
    .gitignore
    .travis.yml
    LICENSE
    LICENSE.LIBS
    README.md
    RELEASE_NOTES
    http://intermine.readthedocs.org/en/latest/get-started/tutorial/#creating-a-new-mine

    View Slide

  9. Creating & configuring a mine















    :
    :





    :
    :


    project.xml
    http://intermine.readthedocs.org/en/latest/get-started/tutorial/#project-xml

    View Slide

  10. Data Sources and Sets
    • InterMine provides a vast library of data source parsers and
    loaders, covering data types not restricted to:
    genome sequence (fasta)
    annotation (gff)
    ontology (go, so)
    proteins (uniprot)
    interactions (psi-mi)
    pathway (kegg, reactome)
    homologs (panther, compara, homologene)
    publications (pubmed)
    chado (sequence, stock)
    • Custom sources can be written by following the tutorial:
    http://intermine.readthedocs.org/en/latest/database/data-
    sources/custom/ or by referring to code from other mines
    http://intermine.readthedocs.org/en/latest/database/data-sources/library/

    View Slide

  11. Building a mine
    • Each InterMine instance requires 3
    PostgreSQL databases:
    ¡
    legumine: core db mapping to data model
    ¡
    items-legumine: db for storing intermediate Items during load
    ¡
    userprofile-legumine: db for storing user specific data
    • Running build requires special config file in
    the users’ home area, containing db
    connection params and other mine
    specific configs to override
    ${HOME}/.intermine/legumine.properties
    http://intermine.readthedocs.org/en/latest/get-started/tutorial/#properties-file

    View Slide

  12. Model Merging & Data Integration
    Model Merging
    • Each source contributes
    towards the data model
    • bio/core/core.xml is
    always used as the base
    for model merging
    • The ant build-db
    command consumes the
    SOURCE_additions.xml
    • Model is used to generate
    tables, Java classes and
    the webapp
    http://intermine.readthedocs.org/en/latest/database/database-building/model-
    merging/
    Data Integration
    • Key(s) for class of object
    defines equivalence for
    objects of that class
    • Primary key defines
    field(s) used to search for
    equivalence
    • For objects which share
    same primary key, fields
    are merged and stored as
    single object
    http://intermine.readthedocs.org/en/latest/database/database-building/primary-keys/

    View Slide

  13. Post processing
    • Operations are
    performed on
    integrated data
    • Calculate/set fields
    difficult to work with
    while data loading,
    because they require 2
    or more sources to be
    loaded already
    • Order of steps is
    somewhat important
















    http://intermine.readthedocs.org/en/latest/database/database-building/post-processing/

    View Slide

  14. Building & deploying a mine
    Two types of build mechanisms:
    • Manual:
    $ cd dbmodel && ant clean build-db ## initialize db
    $ ant -Dsource=legumine-gff ## load data sources
    $ ant -Dsource=legumine-chr-fasta ## load more sources
    $ cd ../postprocess && ant ## run post-process steps
    $ cd ../webapp ## build mine webapp
    $ ant clean remove-webapp default release-webapp
    • Automated:
    $ ../bio/scripts/project_build -b -v localhost ~/legumine-dump
    http://intermine.readthedocs.org/en/latest/database/database-building/build-script/

    View Slide

  15. Lucene based search index
    • Post-process "create-search-index" runs the
    database indexing, zips and stores in db
    • On webapp (first) load, index is unpacked
    • By default, all id and text fields are ignored by the
    indexer
    • Uses the Apache Lucene whitespace analyzer to
    identify word boundaries
    • Control temp directory and classes/fields to be
    ignored by altering
    MINE_NAME/dbmodel/resources/keyword_sear
    ch.properties file
    http://intermine.readthedocs.org/en/latest/webapp/keyword-search/

    View Slide

  16. Alex Kalderimis et al. Nucl. Acids Res. 2014;42:W468-W472
    InterMine web services
    http://iodocs.labs.intermine.org

    View Slide

  17. Federated Authentication
    • Apart from the standard login scheme
    (username/password), InterMine supports industry
    standard OAuth2 based login flows, implemented
    by Google, GitHub, Agave, etc.
    • ThaleMine relies on this infrastructure to
    authenticate users against the araport.org tenant
    registered within the Agave infrastructure
    • Documentation available here:
    http://intermine.readthedocs.org/en/latest/webapp/
    properties/web-properties/#openauth2-settings-
    aka-openid-connect

    View Slide

  18. Friendly reference mines
    • FlyMine: https://github.com/intermine/intermine/
    • ThaleMine: https://github.com/Arabidopsis-
    Information-Portal/intermine/
    • MedicMine: https://github.com/jcvi-plant-
    genomics/intermine/
    • PhytoMine:
    https://github.com/JoeCarlson/intermine/

    View Slide

  19. Summary
    • Advantages
    ¡
    InterMine is a powerful biological data warehouse
    ¡
    Performs complex data integration
    ¡
    Allows fast and flexible querying
    ¡
    Well documented programmatic interface
    ¡
    Cookie-cutter, user-friendly web interface
    ¡
    Facilitates cross-talk between “mines”
    • Caveats
    ¡
    Adding more data requires a full database rebuild (incremental loading
    is not possible) because of the integration step
    • About InterMine:
    ¡
    Developed by the Micklem Lab at the University of Cambridge, UK
    ¡
    Written in Java, backed by PostgreSQLdb, deployed under Tomcat.
    Documentation and downloads available at http://www.intermine.org

    View Slide

  20. Acknowledgments
    • InterMine Team
    ¡
    Gos Micklem
    ¡
    Julie Sullivan
    ¡
    Alex Kalderimis
    ¡
    Richard Smith
    ¡
    Sergio Contrino
    ¡
    Josh Heimbach
    ¡
    et al.
    • Araport Team
    ¡
    Chris Town
    ¡
    Jason Miller
    ¡
    Matt Vaughn
    ¡
    Maria Kim
    ¡
    Svetlana
    Karamycheva
    ¡
    Erik Ferlanti
    ¡
    Chia-Yi Cheng
    ¡
    Benjamin Rosen
    ¡
    Irina Belyaeva

    View Slide

  21. THANK YOU

    View Slide