Database Integration to Improve Accessibility to Public High-throughput Sequencing Data

Database Integration to Improve Accessibility to High-Throughput Seq Data

TAZRO OHTA @inutano

What do you imagine with a term “Database”?

Knowledge Scientific data Experimental data

Knowledge base Database Raw Data repository

What kind of data? Next-generation is already out there…

We all need Raw data repo for NGS

We’ve already seen WHY WE NEED

Reproducibility is what makes science fair.

2 things required for data repository is…

1: Reliability Data should be archived correctly, with explicit metadata
2: Accessibility Data should be able to be accessed by anyone, without special trick

1: Reliability needs curation Data should be archived correctly, with
explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick

Current Web-interface for DRA http://trace.ddbj.nig.ac.jp/DRASearch

Good: Simple, Fast, and no bugs (!) Challenge: Lack of
metadata caused “NOT FOUND”

PROBLEM:

DRASearch can NOT ﬁnd Data without metadata …but they deﬁnitely
exist in the repo.

Too many to ask submitters; then we implemented a system
to make metadata rich enough

2 sources into DRA DDBJ Read Archive

Publications can have details of seq process, Seq Read Quality
can be a source of data quality. DDBJ Read Archive PubMed PMC Extracted Read Quality

And then: integration enables to implement Efficient Data Search

Available via DBCLS SRA http://sra.dbcls.jp/

Power of Integration: Metadata Search http://sra.dbcls.jp/search

83% seq reads satisﬁed average quality over 30 0.03% of
seq reads fall into over 50% N content

1: Reliability from paper/data qual more description brings more proof.
2: Accessibility from text-search Search included publication brings ﬂexibility.

2.20% of submitted projects has at least one publication 4429
/ 201558 PROBLEM:

NIH Data sharing Guideline http://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx

What is Next-step to carry on?

1: Beyond Raw Data Archive is going to handle alignment
data. 2: Analysis Reproducibility Public repo for analysis pipeline is required.

Database is for Biologists not for developers.

Thank you! [email protected] http://speakerdeck.com/inutano

Database Integration to Improve Accessibility t...

Database Integration to Improve Accessibility to Public High-throughput Sequencing Data

More Decks by Tazro Inutano Ohta

Other Decks in Science

Featured

Transcript