Database Integration to Improve Accessibility to Public High-throughput Sequencing Data

Slide 1

Slide 1 text

Database Integration to Improve Accessibility to High-Throughput Seq Data

Slide 2

Slide 2 text

TAZRO OHTA @inutano

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

What do you imagine with a term “Database”?

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Knowledge Scientific data Experimental data

Slide 9

Slide 9 text

Knowledge base Database Raw Data repository

Slide 10

Slide 10 text

Knowledge base Database Raw Data repository

Slide 11

Slide 11 text

What kind of data? Next-generation is already out there…

Slide 12

Slide 12 text

We all need Raw data repo for NGS

Slide 13

Slide 13 text

We’ve already seen WHY WE NEED

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Reproducibility is what makes science fair.

Slide 16

Slide 16 text

2 things required for data repository is…

Slide 17

Slide 17 text

1: Reliability Data should be archived correctly, with explicit metadata 2: Accessibility Data should be able to be accessed by anyone, without special trick

Slide 18

Slide 18 text

1: Reliability needs curation Data should be archived correctly, with explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick

Slide 19

Slide 19 text

1: Reliability needs curation Data should be archived correctly, with explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick

Slide 20

Slide 20 text

1: Reliability needs curation Data should be archived correctly, with explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick

Slide 21

Slide 21 text

Current Web-interface for DRA http://trace.ddbj.nig.ac.jp/DRASearch

Slide 22

Slide 22 text

Good: Simple, Fast, and no bugs (!) Challenge: Lack of metadata caused “NOT FOUND”

Slide 23

Slide 23 text

PROBLEM:

Slide 24

Slide 24 text

???

Slide 25

Slide 25 text

DRASearch can NOT ﬁnd Data without metadata …but they deﬁnitely exist in the repo.

Slide 26

Slide 26 text

Too many to ask submitters; then we implemented a system to make metadata rich enough

Slide 27

Slide 27 text

2 sources into DRA DDBJ Read Archive

Slide 28

Slide 28 text

Publications can have details of seq process, Seq Read Quality can be a source of data quality. DDBJ Read Archive PubMed PMC Extracted Read Quality

Slide 29

Slide 29 text

And then: integration enables to implement Efficient Data Search

Slide 30

Slide 30 text

Available via DBCLS SRA http://sra.dbcls.jp/

Slide 31

Slide 31 text

Available via DBCLS SRA http://sra.dbcls.jp/

Slide 32

Slide 32 text

Available via DBCLS SRA http://sra.dbcls.jp/

Slide 33

Slide 33 text

Power of Integration: Metadata Search http://sra.dbcls.jp/search

Slide 34

Slide 34 text

Power of Integration: Metadata Search http://sra.dbcls.jp/search

Slide 35

Slide 35 text

Power of Integration: Metadata Search http://sra.dbcls.jp/search

Slide 36

Slide 36 text

83% seq reads satisﬁed average quality over 30 0.03% of seq reads fall into over 50% N content

Slide 37

Slide 37 text

1: Reliability from paper/data qual more description brings more proof. 2: Accessibility from text-search Search included publication brings ﬂexibility.

Slide 38

Slide 38 text

2.20% of submitted projects has at least one publication 4429 / 201558 PROBLEM:

Slide 39

Slide 39 text

NIH Data sharing Guideline http://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx

Slide 40

Slide 40 text

NIH Data sharing Guideline http://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx

Slide 41

Slide 41 text

What is Next-step to carry on?

Slide 42

Slide 42 text

1: Beyond Raw Data Archive is going to handle alignment data. 2: Analysis Reproducibility Public repo for analysis pipeline is required.

Slide 43

Slide 43 text

1: Beyond Raw Data Archive is going to handle alignment data. 2: Analysis Reproducibility Public repo for analysis pipeline is required.