Database Integration
to Improve Accessibility to
High-Throughput Seq Data
Slide 2
Slide 2 text
TAZRO OHTA
@inutano
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
What do you imagine with a term
“Database”?
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
No content
Slide 8
Slide 8 text
Knowledge
Scientific data
Experimental data
Slide 9
Slide 9 text
Knowledge base
Database
Raw Data repository
Slide 10
Slide 10 text
Knowledge base
Database
Raw Data repository
Slide 11
Slide 11 text
What kind of data?
Next-generation
is already out there…
Slide 12
Slide 12 text
We all need
Raw data repo
for
NGS
Slide 13
Slide 13 text
We’ve already seen
WHY WE NEED
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
Reproducibility
is what makes science fair.
Slide 16
Slide 16 text
2 things
required for data repository is…
Slide 17
Slide 17 text
1: Reliability
Data should be archived correctly,
with explicit metadata
2: Accessibility
Data should be able to be accessed
by anyone, without special trick
Slide 18
Slide 18 text
1: Reliability needs curation
Data should be archived correctly,
with explicit metadata
2: Accessibility needs good interface
Data should be able to be accessed
by anyone, without special trick
Slide 19
Slide 19 text
1: Reliability needs curation
Data should be archived correctly,
with explicit metadata
2: Accessibility needs good interface
Data should be able to be accessed
by anyone, without special trick
Slide 20
Slide 20 text
1: Reliability needs curation
Data should be archived correctly,
with explicit metadata
2: Accessibility needs good interface
Data should be able to be accessed
by anyone, without special trick
Slide 21
Slide 21 text
Current Web-interface for DRA
http://trace.ddbj.nig.ac.jp/DRASearch
Slide 22
Slide 22 text
Good:
Simple, Fast, and no bugs (!)
Challenge:
Lack of metadata caused “NOT FOUND”
Slide 23
Slide 23 text
PROBLEM:
Slide 24
Slide 24 text
???
Slide 25
Slide 25 text
DRASearch can NOT find
Data without metadata
…but they definitely exist in the repo.
Slide 26
Slide 26 text
Too many to ask submitters;
then we implemented
a system to
make metadata
rich enough
Slide 27
Slide 27 text
2 sources
into DRA
DDBJ Read Archive
Slide 28
Slide 28 text
Publications
can have details
of seq process,
Seq Read Quality
can be a source
of data quality.
DDBJ Read Archive
PubMed
PMC
Extracted
Read Quality
Slide 29
Slide 29 text
And then: integration enables to implement
Efficient Data Search
Slide 30
Slide 30 text
Available via DBCLS SRA
http://sra.dbcls.jp/
Slide 31
Slide 31 text
Available via DBCLS SRA
http://sra.dbcls.jp/
Slide 32
Slide 32 text
Available via DBCLS SRA
http://sra.dbcls.jp/
Slide 33
Slide 33 text
Power of Integration: Metadata Search
http://sra.dbcls.jp/search
Slide 34
Slide 34 text
Power of Integration: Metadata Search
http://sra.dbcls.jp/search
Slide 35
Slide 35 text
Power of Integration: Metadata Search
http://sra.dbcls.jp/search
Slide 36
Slide 36 text
83%
seq reads satisfied
average quality over 30
0.03%
of seq reads fall into
over 50% N content
Slide 37
Slide 37 text
1: Reliability from paper/data qual
more description brings more proof.
2: Accessibility from text-search
Search included publication brings flexibility.
Slide 38
Slide 38 text
2.20%
of submitted projects has
at least one publication
4429 / 201558
PROBLEM:
Slide 39
Slide 39 text
NIH Data sharing Guideline
http://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx
Slide 40
Slide 40 text
NIH Data sharing Guideline
http://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx
Slide 41
Slide 41 text
What is
Next-step
to carry on?
Slide 42
Slide 42 text
1: Beyond Raw Data
Archive is going to handle alignment data.
2: Analysis Reproducibility
Public repo for analysis pipeline is required.
Slide 43
Slide 43 text
1: Beyond Raw Data
Archive is going to handle alignment data.
2: Analysis Reproducibility
Public repo for analysis pipeline is required.