Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to the InterMine Infrastructure - LegFed ...

Intro to the InterMine Infrastructure - LegFed Meeting

Overview of InterMine infrastructure, presented at the Legume Federation Project Meeting

Presented by Vivek Krishnakumar

Vivek Krishnakumar

April 28, 2015
Tweet

More Decks by Vivek Krishnakumar

Other Decks in Programming

Transcript

  1. InterMine in a nutshell • Open-source data warehouse software •

    Integration of complex biological data • Parsers for common biological data formats • Extensible framework for custom data • Cookie-cutter interface, highly customizable • Interact using sophisticated web query tools • Programmatic access using web-service API
  2. Open-source Project • Source code available online • Distributed with

    the GNU LGPL license • GitHub Repo: https://github.com/intermine/int ermine • GitHub Organization: https://github.com/intermine intermine / intermine > bio > biotestmine > config > flymine > humanmine > imbuild > intermine > testmodel .gitignore .travis.yml LICENSE LICENSE.LIBS README.md RELEASE_NOTES
  3. InterMine system architecture Web Application • Java Server Pages (JSP),

    HTML, JS, CSS • Interfaces with Java Servlets and IM web-services Web Server • Tomcat 7.0.x, serves Web application ARchive file • ant based build system using Java SDK Database Server • PostgreSQL 9.2 or above • range query, btree, gist enabled (refer docs here) http://intermine.readthedocs.org/en/latest/system-requirements/
  4. Data Model Overview • Object-oriented data model • Divided into

    classes, their attributes and their relationships; defined in XML • Represented as Java classes (pure Java beans); auto-generated from XML, automatically map to tables in schema • Core data model; based on Sequence Ontology (SO); refer: bio/core/core.xml and bio/core/genomic_additions.xml http://intermine.readthedocs.org/en/latest/data-model/overview/
  5. Data Model Overview <?xml version="1.0"?> <model name="example" package="org.intermine.model.bio"> <class name="Protein"

    is-interface="true" extends="SequenceFeature"> <attribute name="name" type="java.lang.String"/> <attribute name="accession" type="java.lang.String"/> <collection name="features" referenced-type="NewFeature" reverse-reference="protein"/> </class> <class name="NewFeature" is-interface="true"> <attribute name="identifier" type="java.lang.String"/> <attribute name="confidence" type="java.lang.Double"/> <reference name="protein" referenced-type="Protein" reverse-reference="features"/> </class> </model> Model expects standard Java names for classes and attributes • classes: start with an upper case letter and be CamelCase, no underscores or spaces. • fields (attributes, references, collections): should start with a lower case letter and be lowerCamelCase, no underscores or spaces. http://intermine.readthedocs.org/en/latest/data-model/model/
  6. Creating & configuring a mine • Build out scaffold for

    mine $ cd git/intermine $ bio/scripts/make_mine legumine • Configure data to load and post-processing steps to run by customizing project.xml • Data <source /> elements correspond to directory under bio/sources/*; defines parsers to retrieve data and encodes rules for integration intermine / intermine > bio > biotestmine > config > flymine > legumine > dbmodel > integrate > postprocess > webapp > default.intermine.integrate.properties > default.intermine.webapp.properties > project.xml > humanmine > imbuild > intermine > testmodel .gitignore .travis.yml LICENSE LICENSE.LIBS README.md RELEASE_NOTES http://intermine.readthedocs.org/en/latest/get-started/tutorial/#creating-a-new-mine
  7. Creating & configuring a mine <project type="bio"> <property name="target.model" value="genomic"/>

    <property name="source.location" location="../bio/sources/"/> <property name="common.os.prefix" value="common"/> <property name="intermine.properties.file" value="legumine.properties"/> <property name="default.intermine.properties.file" location="../default.intermine.integrate.properties"/> <sources> <source name=”legumine-gff" type="legumine-gff"> <property name="gff3.taxonId" value="3880"/> <property name="gff3.seqDataSourceName" value="LF"/> <property name="gff3.dataSourceName" value="LF"/> <property name="gff3.seqClsName" value="Chromosome"/> <property name="gff3.dataSetTitle" value="Genome Annotation"/> <property name="src.data.dir" location="/path/to/legumine/genome/gff/" /> </source> : : </sources> <post-processing> <post-process name="create-references" /> <post-process name="create-chromosome-locations-and-lengths"/> <post-process name="create-gene-flanking-features" /> : : </post-processing> </project> project.xml http://intermine.readthedocs.org/en/latest/get-started/tutorial/#project-xml
  8. Data Sources and Sets • InterMine provides a vast library

    of data source parsers and loaders, covering data types not restricted to: genome sequence (fasta) annotation (gff) ontology (go, so) proteins (uniprot) interactions (psi-mi) pathway (kegg, reactome) homologs (panther, compara, homologene) publications (pubmed) chado (sequence, stock) • Custom sources can be written by following the tutorial: http://intermine.readthedocs.org/en/latest/database/data- sources/custom/ or by referring to code from other mines http://intermine.readthedocs.org/en/latest/database/data-sources/library/
  9. Building a mine • Each InterMine instance requires 3 PostgreSQL

    databases: ¡ legumine: core db mapping to data model ¡ items-legumine: db for storing intermediate Items during load ¡ userprofile-legumine: db for storing user specific data • Running build requires special config file in the users’ home area, containing db connection params and other mine specific configs to override ${HOME}/.intermine/legumine.properties http://intermine.readthedocs.org/en/latest/get-started/tutorial/#properties-file
  10. Model Merging & Data Integration Model Merging • Each source

    contributes towards the data model • bio/core/core.xml is always used as the base for model merging • The ant build-db command consumes the SOURCE_additions.xml • Model is used to generate tables, Java classes and the webapp http://intermine.readthedocs.org/en/latest/database/database-building/model- merging/ Data Integration • Key(s) for class of object defines equivalence for objects of that class • Primary key defines field(s) used to search for equivalence • For objects which share same primary key, fields are merged and stored as single object http://intermine.readthedocs.org/en/latest/database/database-building/primary-keys/
  11. Post processing • Operations are performed on integrated data •

    Calculate/set fields difficult to work with while data loading, because they require 2 or more sources to be loaded already • Order of steps is somewhat important <post-processing> <post-process name="create-references" /> <post-process name="create-chromosome- locations-and-lengths"/> <post-process name="create-gene-flanking- features" /> <post-process name="do-sources" /> <post-process name="create-intron- features"> <property name="organisms" value="3880"/> </post-process> <post-process name="transfer-sequences"/> <post-process name="populate-child- features"/> <post-process name="create-location-range- index" /> <post-process name="create-overlap-view" /> <post-process name="create-attribute- indexes"/> <post-process name="summarise- objectstore"/> <post-process name="create-search-index"/> </post-processing> http://intermine.readthedocs.org/en/latest/database/database-building/post-processing/
  12. Building & deploying a mine Two types of build mechanisms:

    • Manual: $ cd dbmodel && ant clean build-db ## initialize db $ ant -Dsource=legumine-gff ## load data sources $ ant -Dsource=legumine-chr-fasta ## load more sources $ cd ../postprocess && ant ## run post-process steps $ cd ../webapp ## build mine webapp $ ant clean remove-webapp default release-webapp • Automated: $ ../bio/scripts/project_build -b -v localhost ~/legumine-dump http://intermine.readthedocs.org/en/latest/database/database-building/build-script/
  13. Lucene based search index • Post-process "create-search-index" runs the database

    indexing, zips and stores in db • On webapp (first) load, index is unpacked • By default, all id and text fields are ignored by the indexer • Uses the Apache Lucene whitespace analyzer to identify word boundaries • Control temp directory and classes/fields to be ignored by altering MINE_NAME/dbmodel/resources/keyword_sear ch.properties file http://intermine.readthedocs.org/en/latest/webapp/keyword-search/
  14. Federated Authentication • Apart from the standard login scheme (username/password),

    InterMine supports industry standard OAuth2 based login flows, implemented by Google, GitHub, Agave, etc. • ThaleMine relies on this infrastructure to authenticate users against the araport.org tenant registered within the Agave infrastructure • Documentation available here: http://intermine.readthedocs.org/en/latest/webapp/ properties/web-properties/#openauth2-settings- aka-openid-connect
  15. Friendly reference mines • FlyMine: https://github.com/intermine/intermine/ • ThaleMine: https://github.com/Arabidopsis- Information-Portal/intermine/

    • MedicMine: https://github.com/jcvi-plant- genomics/intermine/ • PhytoMine: https://github.com/JoeCarlson/intermine/
  16. Summary • Advantages ¡ InterMine is a powerful biological data

    warehouse ¡ Performs complex data integration ¡ Allows fast and flexible querying ¡ Well documented programmatic interface ¡ Cookie-cutter, user-friendly web interface ¡ Facilitates cross-talk between “mines” • Caveats ¡ Adding more data requires a full database rebuild (incremental loading is not possible) because of the integration step • About InterMine: ¡ Developed by the Micklem Lab at the University of Cambridge, UK ¡ Written in Java, backed by PostgreSQLdb, deployed under Tomcat. Documentation and downloads available at http://www.intermine.org
  17. Acknowledgments • InterMine Team ¡ Gos Micklem ¡ Julie Sullivan

    ¡ Alex Kalderimis ¡ Richard Smith ¡ Sergio Contrino ¡ Josh Heimbach ¡ et al. • Araport Team ¡ Chris Town ¡ Jason Miller ¡ Matt Vaughn ¡ Maria Kim ¡ Svetlana Karamycheva ¡ Erik Ferlanti ¡ Chia-Yi Cheng ¡ Benjamin Rosen ¡ Irina Belyaeva