Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tripal within the Arabidopsis Information Portal - PAG XXIII

Tripal within the Arabidopsis Information Portal - PAG XXIII

Araport plans to implement a Chado-backed data warehouse, fronted by Tripal, serving as as our core database, used to track multiple versions of genome annotation (TAIR10, Araport11, etc.), evidentiary data (used by our annotation update pipeline), metadata such as publications collated from multiple sources like TAIR, NCBI PubMed and UniProtKB (curated and unreviewed) and stock/germplasm data linked to AGI loci via their associated polymorphisms.

Presentation regarding Tripal and its adoption by the Araport project, made at Tripal Database Network and Initiatives workshop the on 11 Jan 2015, conducted at PAG XXIII in San Diego, CA.

Presented by Vivek Krishnakumar

Vivek Krishnakumar

January 11, 2015
Tweet

More Decks by Vivek Krishnakumar

Other Decks in Programming

Transcript

  1. araport.org
    @araport
    Tripal within the Arabidopsis
    Information Portal
    Vivek Krishnakumar
    J. Craig Venter Institute
    12/11/2015
    Tripal Database Network and Initiatives
    PAG XXIII, San Diego, CA

    View Slide

  2. araport.org
    @araport
    Overview
    •  About Araport
    •  Current architecture
    •  Planned implementation
    – Leverage Chado schema
    – Accommodate inherited data
    – Serve as point of integration
    – Facilitate data sharing via web services

    View Slide

  3. araport.org
    @araport
    About Araport
    •  Objectives
    –  Develop community web interface
    •  sustainable, fundable and community-extensible
    •  hosts analysis modules, visualization tools, user data
    spaces
    –  Practice data federation
    •  integrate diverse data sets from distributed sources
    •  consume and expose data via RESTful web services
    –  Maintain “gold standard” Col-0 annotation
    •  assemble tissue-specific transcripts from publicly available
    RNA-seq datasets
    •  incorporate novel coding and non-coding genes

    View Slide

  4. araport.org
    @araport
    Araport
    https://www.araport.org
    •  Explore data
    •  ThaleMine
    •  JBrowse
    •  Science Apps
    •  Search data
    •  Quick Search
    •  BLAST
    •  Raw data downloads
    •  Community
    •  News & Events
    •  Ask a question
    •  Job Postings
    •  Useful Links

    View Slide

  5. araport.org
    @araport
    Araport Architecture
    External programs
    Portal (www.araport.org)
    API (api.araport.org)
    Agave Core
    meta data
    user profile
    ADAMA
    service manage
    service enroll
    a b c d e f
    CGI
    Computing
    Storage
    Databases
    ThaleMine JBrowse
    Authentication, metering, logging, versioning, HTTPS, CORS
    a b c d e f
    Apps
    Jobs
    Systems
    CGI
    InterMine
    Others
    Tripal
    SOAP
    CGI
    REST
    Science Apps

    View Slide

  6. araport.org
    @araport
    Current implementation
    Araport data mart
    Combination of flat-files and databases
    •  TAIR datasets
    •  Ontologies (GO, PSI)
    •  Interactions (BAR)
    •  Orthologs (Panther)
    Data Mart
    •  InterMine schema, PostgreSQL DB
    •  Indexed and flattened for speed
    •  Rebuilt periodically
    Outputs
    •  ThaleMine WebApp
    •  ThaleMine web services
    publish
    Araport warehouse
    Web services
    InterMine loader live calls to…
    •  UniProt web services
    •  PubMed web services
    publish

    View Slide

  7. araport.org
    @araport
    Planned implementation
    Araport warehouse Araport data mart
    Warehouse
    •  Chado schema, PostgreSQL DB
    •  General purpose but slow
    •  Permanent host for core genomic
    datasets (assembly, annotation,
    metadata, etc.)
    Inputs
    •  Genome annotation pipeline
    •  Community curation data
    Outputs
    •  ThaleMine WebApp
    •  ThaleMine web services
    publish
    Data Mart
    •  InterMine schema, PostgreSQL DB
    •  Indexed and flattened for speed
    •  Rebuilt periodically

    View Slide

  8. araport.org
    @araport
    •  Functions as our low-level (core) Araport data
    warehouse
    –  Preserve legacy datasets with appropriate attributions
    –  Track any new datasets generated (annotation updates,
    community contributions)
    –  Serve as point of integration and de-duplication of
    certain data types
    –  Integrate with planned community curation interface
    •  Supports our pursuit of being open-source (and
    future-proof)
    http://gmod.org/wiki/Chado

    View Slide

  9. araport.org
    @araport
    •  Drupal CMS based modularized framework,
    exposing a user-friendly interface to Chado
    – provides standardized loaders for genomic
    datasets (FASTA, GFF3, GenBank, BLAST,
    GO, InterProScan, KEGG)
    – supports building custom templates and
    materialized views
    – exposes well documented API
    http://tripal.info

    View Slide

  10. araport.org
    @araport
    Integrate data inherited from TAIR
    •  Currently a combination of flat-files and TAIR’s Oracle database
    –  Genome Assembly (TAIR9)
    –  Genome Annotation (TAIR10): genes, pseudogenes, transposons,
    ncRNAs
    –  Annotation properties: gene symbols, confidence ranking, functional
    descriptions, curator summary
    –  GO Annotations (TAIR curated data at geneontology.org)
    –  Publications (curated gene à publication relationships)
    –  Variation data: Genetic markers, Polymorphisms (SNPs, TILLing) and T-
    DNA Insertions
    –  Stock data (lines, clones, germplasm)
    •  Chado backed Tripal will serve as the core repository for this data

    View Slide

  11. araport.org
    @araport
    Integrate with planned Community
    Curation Interface

    View Slide

  12. araport.org
    @araport
    Integrate publication data
    •  Existing sources for publication data
    –  TAIR locus to PubMed ID mapping
    –  NCBI gene2pubmed mapping
    –  UniProt curated Protein to PubMed ID mapping
    –  Publications missing PMIDs and/or DOIs
    •  Chado will act as point of integration
    –  Combine and de-duplicate publication data from 3
    sources (more in the future)
    –  Collect and store metadata for publications with and
    without PMID and/or DOIs

    View Slide

  13. araport.org
    @araport
    Integrate
    Stock data
    •  TAIR stock related
    tables mapped to
    corresponding
    Chado counterpart
    •  Custom loaders
    developed to
    perform bulk
    update of Stock
    information,
    Phenotypes,
    Polymorphism data
    and mappings to
    AGI locus

    View Slide

  14. araport.org
    @araport
    Role of Tripal within Araport
    •  Tripal is under active development, with plans in
    place to begin developing rational web services
    (WS) as well as support interoperability
    •  Araport plans to be involved in this working
    group to satisfy the following needs of our
    project:
    –  Expose live data from future annotation update
    pipelines to the community directly via WS
    –  Expose stock data via WS in a standardized manner
    to Arabidopsis stock centers (both ABRC and NASC)
    to aid data synchronization
    –  Embrace and support other open-source initiatives

    View Slide

  15. araport.org
    @araport
    Araport on GitHub
    •  GitHub organization:
    https://www.github.com/Arabidopsis-Information-Portal
    •  Relevant repositories:
    –  tair-chado-batchflow
    –  chado_pub_loader
    –  pasa-chado-hook
    –  GMOD/Apollo (fork)

    View Slide

  16. araport.org
    @araport
    Acknowledgements
    •  JCVI Developers
    –  Maria Kim
    –  Irina Belyaeva
    –  Svetlana Karamycheva
    •  Tripal co-PI Stephen Ficklin and development
    community
    •  TAIR/Phoenix Bio: assistance with data
    migration
    •  Funding Agencies

    View Slide

  17. araport.org
    @araport
    Chris Town, PI
    Lisa McDonald
    Education and
    Outreach Coordinator
    Chris Nelson
    Project Manager
    Jason Miller, Co-PI
    JCVI Technical Lead
    Erik Ferlanti
    Software Engineer
    Vivek Krishnakumar
    Bioinf. Engineer
    Svetlana Karamycheva
    Bioinf Engineer
    Eva Huala
    Project lead, TAIR
    Bob Muller
    Technical lead, TAIR
    Gos Micklem,
    co-PI
    Sergio Contrino
    Software Engineer
    Matt Vaughn
    co-PI Steve Mock
    Advanced Computing
    Interfaces
    Rion Dooley,
    Web and Cloud
    Services
    Matt Hanlon,
    Web and Mobile
    Applications
    Maria Kim
    Bioinf
    Engineer
    Ben Rosen
    Bioinf Analyst
    Joe Stubbs,
    API Developer
    Platform
    Walter Moreira
    API Developer
    Federation
    Chris Jordan
    Database
    Manager
    Eleanor Pence
    Intern
    Chia-Yi Cheng
    Bioinf Analyst
    Seth Schobel
    Bioinf. Engineer
    Araport Team
    Irina Belyaeva
    Software Engineer

    View Slide

  18. araport.org
    @araport
    THANK YOU!

    View Slide

  19. araport.org
    @araport
    Araport @ PAG XXIII
    Session Details Topic(s) Presenter(s)
    Tripal Database Network
    and Initiatives
    Sunday, January 11, 2015
    5:30 PM-5:45 PM
    California
    W876: Tripal within the Arabidopsis Information Portal Vivek Krishnakumar
    Arabidopsis Information
    Portal & IAIC Workshop
    Monday, January 12, 2015
    12:50 PM-3:00 PM
    Pacific Salon 6-7 (2nd Floor)
    W059: Walkthrough the Araport Web Site
    W061: Exposing Web Services for Araport
    W062: Developing applications for Araport
    Chia-Yi Cheng
    Jason Miller
    Matt Vaughn
    Computer Demo 2
    Tuesday, January 13, 2015
    12:30 PM
    California
    C23: Using the Arabidopsis Information Portal Jason Miller
    GMOD
    Wednesday, January 14, 2015
    11:30 AM
    Golden West
    W410: JBrowse within the Arabidopsis Information Portal Vivek Krishnakumar
    Poster Session – Even
    Monday, January 12, 2015
    10:00 AM-11:30 AM
    Grand Exhibit Hall
    P0790: Data Integration for the Plant Research Community: Araport
    P0792: Developing Content for the Arabidopsis Information Portal
    Chia-Yi Cheng
    Matt Vaughn

    View Slide