Intro to the InterMine Infrastructure - LegFed Meeting

Introduction to InterMine Infrastructure Vivek Krishnakumar LF Meeting 04/28/2015

InterMine in a nutshell • Open-source data warehouse software •
Integration of complex biological data • Parsers for common biological data formats • Extensible framework for custom data • Cookie-cutter interface, highly customizable • Interact using sophisticated web query tools • Programmatic access using web-service API

Open-source Project • Source code available online • Distributed with
the GNU LGPL license • GitHub Repo: https://github.com/intermine/int ermine • GitHub Organization: https://github.com/intermine intermine / intermine > bio > biotestmine > config > flymine > humanmine > imbuild > intermine > testmodel .gitignore .travis.yml LICENSE LICENSE.LIBS README.md RELEASE_NOTES

Richard N. Smith et al. Bioinformatics 2012;28:3163-3165 InterMine system architecture

InterMine system architecture Web Application • Java Server Pages (JSP),
HTML, JS, CSS • Interfaces with Java Servlets and IM web-services Web Server • Tomcat 7.0.x, serves Web application ARchive file • ant based build system using Java SDK Database Server • PostgreSQL 9.2 or above • range query, btree, gist enabled (refer docs here) http://intermine.readthedocs.org/en/latest/system-requirements/

Data Model Overview • Object-oriented data model • Divided into
classes, their attributes and their relationships; defined in XML • Represented as Java classes (pure Java beans); auto-generated from XML, automatically map to tables in schema • Core data model; based on Sequence Ontology (SO); refer: bio/core/core.xml and bio/core/genomic_additions.xml http://intermine.readthedocs.org/en/latest/data-model/overview/

Data Model Overview <?xml version="1.0"?> <model name="example" package="org.intermine.model.bio"> <class name="Protein"
is-interface="true" extends="SequenceFeature"> <attribute name="name" type="java.lang.String"/> <attribute name="accession" type="java.lang.String"/> <collection name="features" referenced-type="NewFeature" reverse-reference="protein"/> </class> <class name="NewFeature" is-interface="true"> <attribute name="identifier" type="java.lang.String"/> <attribute name="confidence" type="java.lang.Double"/> <reference name="protein" referenced-type="Protein" reverse-reference="features"/> </class> </model> Model expects standard Java names for classes and attributes • classes: start with an upper case letter and be CamelCase, no underscores or spaces. • fields (attributes, references, collections): should start with a lower case letter and be lowerCamelCase, no underscores or spaces. http://intermine.readthedocs.org/en/latest/data-model/model/

Creating & configuring a mine • Build out scaffold for
mine $ cd git/intermine $ bio/scripts/make_mine legumine • Configure data to load and post-processing steps to run by customizing project.xml • Data <source /> elements correspond to directory under bio/sources/*; defines parsers to retrieve data and encodes rules for integration intermine / intermine > bio > biotestmine > config > flymine > legumine > dbmodel > integrate > postprocess > webapp > default.intermine.integrate.properties > default.intermine.webapp.properties > project.xml > humanmine > imbuild > intermine > testmodel .gitignore .travis.yml LICENSE LICENSE.LIBS README.md RELEASE_NOTES http://intermine.readthedocs.org/en/latest/get-started/tutorial/#creating-a-new-mine

Creating & configuring a mine <project type="bio"> <property name="target.model" value="genomic"/>
<property name="source.location" location="../bio/sources/"/> <property name="common.os.prefix" value="common"/> <property name="intermine.properties.file" value="legumine.properties"/> <property name="default.intermine.properties.file" location="../default.intermine.integrate.properties"/> <sources> <source name=”legumine-gff" type="legumine-gff"> <property name="gff3.taxonId" value="3880"/> <property name="gff3.seqDataSourceName" value="LF"/> <property name="gff3.dataSourceName" value="LF"/> <property name="gff3.seqClsName" value="Chromosome"/> <property name="gff3.dataSetTitle" value="Genome Annotation"/> <property name="src.data.dir" location="/path/to/legumine/genome/gff/" /> </source> : : </sources> <post-processing> <post-process name="create-references" /> <post-process name="create-chromosome-locations-and-lengths"/> <post-process name="create-gene-flanking-features" /> : : </post-processing> </project> project.xml http://intermine.readthedocs.org/en/latest/get-started/tutorial/#project-xml

Data Sources and Sets • InterMine provides a vast library
of data source parsers and loaders, covering data types not restricted to: genome sequence (fasta) annotation (gff) ontology (go, so) proteins (uniprot) interactions (psi-mi) pathway (kegg, reactome) homologs (panther, compara, homologene) publications (pubmed) chado (sequence, stock) • Custom sources can be written by following the tutorial: http://intermine.readthedocs.org/en/latest/database/data- sources/custom/ or by referring to code from other mines http://intermine.readthedocs.org/en/latest/database/data-sources/library/

Building a mine • Each InterMine instance requires 3 PostgreSQL
databases: ¡ legumine: core db mapping to data model ¡ items-legumine: db for storing intermediate Items during load ¡ userprofile-legumine: db for storing user specific data • Running build requires special config file in the users’ home area, containing db connection params and other mine specific configs to override ${HOME}/.intermine/legumine.properties http://intermine.readthedocs.org/en/latest/get-started/tutorial/#properties-file

Model Merging & Data Integration Model Merging • Each source
contributes towards the data model • bio/core/core.xml is always used as the base for model merging • The ant build-db command consumes the SOURCE_additions.xml • Model is used to generate tables, Java classes and the webapp http://intermine.readthedocs.org/en/latest/database/database-building/model- merging/ Data Integration • Key(s) for class of object defines equivalence for objects of that class • Primary key defines field(s) used to search for equivalence • For objects which share same primary key, fields are merged and stored as single object http://intermine.readthedocs.org/en/latest/database/database-building/primary-keys/

Post processing • Operations are performed on integrated data •
Calculate/set fields difficult to work with while data loading, because they require 2 or more sources to be loaded already • Order of steps is somewhat important <post-processing> <post-process name="create-references" /> <post-process name="create-chromosome- locations-and-lengths"/> <post-process name="create-gene-flanking- features" /> <post-process name="do-sources" /> <post-process name="create-intron- features"> <property name="organisms" value="3880"/> </post-process> <post-process name="transfer-sequences"/> <post-process name="populate-child- features"/> <post-process name="create-location-range- index" /> <post-process name="create-overlap-view" /> <post-process name="create-attribute- indexes"/> <post-process name="summarise- objectstore"/> <post-process name="create-search-index"/> </post-processing> http://intermine.readthedocs.org/en/latest/database/database-building/post-processing/

Building & deploying a mine Two types of build mechanisms:
• Manual: $ cd dbmodel && ant clean build-db ## initialize db $ ant -Dsource=legumine-gff ## load data sources $ ant -Dsource=legumine-chr-fasta ## load more sources $ cd ../postprocess && ant ## run post-process steps $ cd ../webapp ## build mine webapp $ ant clean remove-webapp default release-webapp • Automated: $ ../bio/scripts/project_build -b -v localhost ~/legumine-dump http://intermine.readthedocs.org/en/latest/database/database-building/build-script/

Lucene based search index • Post-process "create-search-index" runs the database
indexing, zips and stores in db • On webapp (first) load, index is unpacked • By default, all id and text fields are ignored by the indexer • Uses the Apache Lucene whitespace analyzer to identify word boundaries • Control temp directory and classes/fields to be ignored by altering MINE_NAME/dbmodel/resources/keyword_sear ch.properties file http://intermine.readthedocs.org/en/latest/webapp/keyword-search/

Alex Kalderimis et al. Nucl. Acids Res. 2014;42:W468-W472 InterMine web
services http://iodocs.labs.intermine.org

Federated Authentication • Apart from the standard login scheme (username/password),
InterMine supports industry standard OAuth2 based login flows, implemented by Google, GitHub, Agave, etc. • ThaleMine relies on this infrastructure to authenticate users against the araport.org tenant registered within the Agave infrastructure • Documentation available here: http://intermine.readthedocs.org/en/latest/webapp/ properties/web-properties/#openauth2-settings- aka-openid-connect

Friendly reference mines • FlyMine: https://github.com/intermine/intermine/ • ThaleMine: https://github.com/Arabidopsis- Information-Portal/intermine/
• MedicMine: https://github.com/jcvi-plant- genomics/intermine/ • PhytoMine: https://github.com/JoeCarlson/intermine/

Summary • Advantages ¡ InterMine is a powerful biological data
warehouse ¡ Performs complex data integration ¡ Allows fast and flexible querying ¡ Well documented programmatic interface ¡ Cookie-cutter, user-friendly web interface ¡ Facilitates cross-talk between “mines” • Caveats ¡ Adding more data requires a full database rebuild (incremental loading is not possible) because of the integration step • About InterMine: ¡ Developed by the Micklem Lab at the University of Cambridge, UK ¡ Written in Java, backed by PostgreSQLdb, deployed under Tomcat. Documentation and downloads available at http://www.intermine.org

Acknowledgments • InterMine Team ¡ Gos Micklem ¡ Julie Sullivan
¡ Alex Kalderimis ¡ Richard Smith ¡ Sergio Contrino ¡ Josh Heimbach ¡ et al. • Araport Team ¡ Chris Town ¡ Jason Miller ¡ Matt Vaughn ¡ Maria Kim ¡ Svetlana Karamycheva ¡ Erik Ferlanti ¡ Chia-Yi Cheng ¡ Benjamin Rosen ¡ Irina Belyaeva

THANK YOU

Intro to the InterMine Infrastructure - LegFed ...

Intro to the InterMine Infrastructure - LegFed Meeting

Vivek Krishnakumar

More Decks by Vivek Krishnakumar

Other Decks in Programming

Featured

Transcript

Introduction to InterMine Infrastructure Vivek Krishnakumar LF Meeting 04/28/2015

InterMine in a nutshell • Open-source data warehouse software •

Open-source Project • Source code available online • Distributed with

Richard N. Smith et al. Bioinformatics 2012;28:3163-3165 InterMine system architecture

InterMine system architecture Web Application • Java Server Pages (JSP),

Data Model Overview • Object-oriented data model • Divided into

Data Model Overview <?xml version="1.0"?> <model name="example" package="org.intermine.model.bio"> <class name="Protein"

Creating & configuring a mine • Build out scaffold for

Creating & configuring a mine <project type="bio"> <property name="target.model" value="genomic"/>

Data Sources and Sets • InterMine provides a vast library

Building a mine • Each InterMine instance requires 3 PostgreSQL

Model Merging & Data Integration Model Merging • Each source

Post processing • Operations are performed on integrated data •

Building & deploying a mine Two types of build mechanisms:

Lucene based search index • Post-process "create-search-index" runs the database

Alex Kalderimis et al. Nucl. Acids Res. 2014;42:W468-W472 InterMine web

Federated Authentication • Apart from the standard login scheme (username/password),

Friendly reference mines • FlyMine: https://github.com/intermine/intermine/ • ThaleMine: https://github.com/Arabidopsis- Information-Portal/intermine/

Summary • Advantages ¡ InterMine is a powerful biological data

Acknowledgments • InterMine Team ¡ Gos Micklem ¡ Julie Sullivan

THANK YOU