Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tripal within the Arabidopsis Information Portal - PAG XXIII

Tripal within the Arabidopsis Information Portal - PAG XXIII

Araport plans to implement a Chado-backed data warehouse, fronted by Tripal, serving as as our core database, used to track multiple versions of genome annotation (TAIR10, Araport11, etc.), evidentiary data (used by our annotation update pipeline), metadata such as publications collated from multiple sources like TAIR, NCBI PubMed and UniProtKB (curated and unreviewed) and stock/germplasm data linked to AGI loci via their associated polymorphisms.

Presentation regarding Tripal and its adoption by the Araport project, made at Tripal Database Network and Initiatives workshop the on 11 Jan 2015, conducted at PAG XXIII in San Diego, CA.

Presented by Vivek Krishnakumar

Vivek Krishnakumar

January 11, 2015
Tweet

More Decks by Vivek Krishnakumar

Other Decks in Programming

Transcript

  1. araport.org @araport Tripal within the Arabidopsis Information Portal Vivek Krishnakumar

    J. Craig Venter Institute 12/11/2015 Tripal Database Network and Initiatives PAG XXIII, San Diego, CA
  2. araport.org @araport Overview •  About Araport •  Current architecture • 

    Planned implementation – Leverage Chado schema – Accommodate inherited data – Serve as point of integration – Facilitate data sharing via web services
  3. araport.org @araport About Araport •  Objectives –  Develop community web

    interface •  sustainable, fundable and community-extensible •  hosts analysis modules, visualization tools, user data spaces –  Practice data federation •  integrate diverse data sets from distributed sources •  consume and expose data via RESTful web services –  Maintain “gold standard” Col-0 annotation •  assemble tissue-specific transcripts from publicly available RNA-seq datasets •  incorporate novel coding and non-coding genes
  4. araport.org @araport Araport https://www.araport.org •  Explore data •  ThaleMine • 

    JBrowse •  Science Apps •  Search data •  Quick Search •  BLAST •  Raw data downloads •  Community •  News & Events •  Ask a question •  Job Postings •  Useful Links
  5. araport.org @araport Araport Architecture External programs Portal (www.araport.org) API (api.araport.org)

    Agave Core meta data user profile ADAMA service manage service enroll a b c d e f CGI Computing Storage Databases ThaleMine JBrowse Authentication, metering, logging, versioning, HTTPS, CORS a b c d e f Apps Jobs Systems CGI InterMine Others Tripal SOAP CGI REST Science Apps
  6. araport.org @araport Current implementation Araport data mart Combination of flat-files

    and databases •  TAIR datasets •  Ontologies (GO, PSI) •  Interactions (BAR) •  Orthologs (Panther) Data Mart •  InterMine schema, PostgreSQL DB •  Indexed and flattened for speed •  Rebuilt periodically Outputs •  ThaleMine WebApp •  ThaleMine web services publish Araport warehouse Web services InterMine loader live calls to… •  UniProt web services •  PubMed web services publish
  7. araport.org @araport Planned implementation Araport warehouse Araport data mart Warehouse

    •  Chado schema, PostgreSQL DB •  General purpose but slow •  Permanent host for core genomic datasets (assembly, annotation, metadata, etc.) Inputs •  Genome annotation pipeline •  Community curation data Outputs •  ThaleMine WebApp •  ThaleMine web services publish Data Mart •  InterMine schema, PostgreSQL DB •  Indexed and flattened for speed •  Rebuilt periodically
  8. araport.org @araport •  Functions as our low-level (core) Araport data

    warehouse –  Preserve legacy datasets with appropriate attributions –  Track any new datasets generated (annotation updates, community contributions) –  Serve as point of integration and de-duplication of certain data types –  Integrate with planned community curation interface •  Supports our pursuit of being open-source (and future-proof) http://gmod.org/wiki/Chado
  9. araport.org @araport •  Drupal CMS based modularized framework, exposing a

    user-friendly interface to Chado – provides standardized loaders for genomic datasets (FASTA, GFF3, GenBank, BLAST, GO, InterProScan, KEGG) – supports building custom templates and materialized views – exposes well documented API http://tripal.info
  10. araport.org @araport Integrate data inherited from TAIR •  Currently a

    combination of flat-files and TAIR’s Oracle database –  Genome Assembly (TAIR9) –  Genome Annotation (TAIR10): genes, pseudogenes, transposons, ncRNAs –  Annotation properties: gene symbols, confidence ranking, functional descriptions, curator summary –  GO Annotations (TAIR curated data at geneontology.org) –  Publications (curated gene à publication relationships) –  Variation data: Genetic markers, Polymorphisms (SNPs, TILLing) and T- DNA Insertions –  Stock data (lines, clones, germplasm) •  Chado backed Tripal will serve as the core repository for this data
  11. araport.org @araport Integrate publication data •  Existing sources for publication

    data –  TAIR locus to PubMed ID mapping –  NCBI gene2pubmed mapping –  UniProt curated Protein to PubMed ID mapping –  Publications missing PMIDs and/or DOIs •  Chado will act as point of integration –  Combine and de-duplicate publication data from 3 sources (more in the future) –  Collect and store metadata for publications with and without PMID and/or DOIs
  12. araport.org @araport Integrate Stock data •  TAIR stock related tables

    mapped to corresponding Chado counterpart •  Custom loaders developed to perform bulk update of Stock information, Phenotypes, Polymorphism data and mappings to AGI locus
  13. araport.org @araport Role of Tripal within Araport •  Tripal is

    under active development, with plans in place to begin developing rational web services (WS) as well as support interoperability •  Araport plans to be involved in this working group to satisfy the following needs of our project: –  Expose live data from future annotation update pipelines to the community directly via WS –  Expose stock data via WS in a standardized manner to Arabidopsis stock centers (both ABRC and NASC) to aid data synchronization –  Embrace and support other open-source initiatives
  14. araport.org @araport Araport on GitHub •  GitHub organization: https://www.github.com/Arabidopsis-Information-Portal • 

    Relevant repositories: –  tair-chado-batchflow –  chado_pub_loader –  pasa-chado-hook –  GMOD/Apollo (fork)
  15. araport.org @araport Acknowledgements •  JCVI Developers –  Maria Kim – 

    Irina Belyaeva –  Svetlana Karamycheva •  Tripal co-PI Stephen Ficklin and development community •  TAIR/Phoenix Bio: assistance with data migration •  Funding Agencies
  16. araport.org @araport Chris Town, PI Lisa McDonald Education and Outreach

    Coordinator Chris Nelson Project Manager Jason Miller, Co-PI JCVI Technical Lead Erik Ferlanti Software Engineer Vivek Krishnakumar Bioinf. Engineer Svetlana Karamycheva Bioinf Engineer Eva Huala Project lead, TAIR Bob Muller Technical lead, TAIR Gos Micklem, co-PI Sergio Contrino Software Engineer Matt Vaughn co-PI Steve Mock Advanced Computing Interfaces Rion Dooley, Web and Cloud Services Matt Hanlon, Web and Mobile Applications Maria Kim Bioinf Engineer Ben Rosen Bioinf Analyst Joe Stubbs, API Developer Platform Walter Moreira API Developer Federation Chris Jordan Database Manager Eleanor Pence Intern Chia-Yi Cheng Bioinf Analyst Seth Schobel Bioinf. Engineer Araport Team Irina Belyaeva Software Engineer
  17. araport.org @araport Araport @ PAG XXIII Session Details Topic(s) Presenter(s)

    Tripal Database Network and Initiatives Sunday, January 11, 2015 5:30 PM-5:45 PM California W876: Tripal within the Arabidopsis Information Portal Vivek Krishnakumar Arabidopsis Information Portal & IAIC Workshop Monday, January 12, 2015 12:50 PM-3:00 PM Pacific Salon 6-7 (2nd Floor) W059: Walkthrough the Araport Web Site W061: Exposing Web Services for Araport W062: Developing applications for Araport Chia-Yi Cheng Jason Miller Matt Vaughn Computer Demo 2 Tuesday, January 13, 2015 12:30 PM California C23: Using the Arabidopsis Information Portal Jason Miller GMOD Wednesday, January 14, 2015 11:30 AM Golden West W410: JBrowse within the Arabidopsis Information Portal Vivek Krishnakumar Poster Session – Even Monday, January 12, 2015 10:00 AM-11:30 AM Grand Exhibit Hall P0790: Data Integration for the Plant Research Community: Araport P0792: Developing Content for the Arabidopsis Information Portal Chia-Yi Cheng Matt Vaughn