Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Database Integration to Improve Accessibility t...
Search
Tazro Inutano Ohta
July 04, 2014
Science
0
140
Database Integration to Improve Accessibility to Public High-throughput Sequencing Data
A Presentation at National Institute of Genetics, Japan Retreat 2014
Tazro Inutano Ohta
July 04, 2014
Tweet
Share
More Decks by Tazro Inutano Ohta
See All by Tazro Inutano Ohta
Yevis: System to support building a workflow registry with automated quality control
inutano
0
130
Standardization of biological sample information database
inutano
0
77
Describe data analysis workflow with workflow languages
inutano
5
5.6k
Container virtualization technologies and workflow languages improve portability and reproducibility of data analysis environment
inutano
3
350
次世代シーケンサーによるメタゲノム解析:桜の花びらに付着した環境DNAを解析する
inutano
0
110
Workflows that run everywhere and where to run them
inutano
0
160
The Sequence Read Archive search system to make use of public high-throughput sequencing data
inutano
0
300
Improve portability of bioinformatics software across HPC and cloud infrastructures
inutano
1
120
Container, Cloud, and HPC
inutano
0
180
Other Decks in Science
See All in Science
Performance Evaluation and Ranking of Drivers in Multiple Motorsports Using Massey’s Method
konakalab
0
120
HDC tutorial
michielstock
0
230
MCMCのR-hatは分散分析である
moricup
0
520
データマイニング - グラフ埋め込み入門
trycycle
PRO
1
130
AIによる科学の加速: 各領域での革新と共創の未来
masayamoriofficial
0
280
白金鉱業Meetup_Vol.20 効果検証ことはじめ / Introduction to Impact Evaluation
brainpadpr
2
1.4k
機械学習 - 授業概要
trycycle
PRO
0
280
NASの容量不足のお悩み解決!災害対策も兼ねた「Wasabi Cloud NAS」はここがスゴイ
climbteam
1
250
なぜ21は素因数分解されないのか? - Shorのアルゴリズムの現在と壁
daimurat
0
200
凸最適化からDC最適化まで
santana_hammer
1
330
データマイニング - コミュニティ発見
trycycle
PRO
0
180
ランサムウェア対策にも考慮したVMware、Hyper-V、Azure、AWS間のリアルタイムレプリケーション「Zerto」を徹底解説
climbteam
0
170
Featured
See All Featured
Optimising Largest Contentful Paint
csswizardry
37
3.5k
Designing for Performance
lara
610
69k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.6k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
61k
Facilitating Awesome Meetings
lara
57
6.7k
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Music & Morning Musume
bryan
46
7k
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
1
88
Principles of Awesome APIs and How to Build Them.
keavy
127
17k
Site-Speed That Sticks
csswizardry
13
990
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.3k
Balancing Empowerment & Direction
lara
5
790
Transcript
Database Integration to Improve Accessibility to High-Throughput Seq Data
TAZRO OHTA @inutano
None
What do you imagine with a term “Database”?
None
None
None
Knowledge Scientific data Experimental data
Knowledge base Database Raw Data repository
Knowledge base Database Raw Data repository
What kind of data? Next-generation is already out there…
We all need Raw data repo for NGS
We’ve already seen WHY WE NEED
None
Reproducibility is what makes science fair.
2 things required for data repository is…
1: Reliability Data should be archived correctly, with explicit metadata
2: Accessibility Data should be able to be accessed by anyone, without special trick
1: Reliability needs curation Data should be archived correctly, with
explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
1: Reliability needs curation Data should be archived correctly, with
explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
1: Reliability needs curation Data should be archived correctly, with
explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
Current Web-interface for DRA http://trace.ddbj.nig.ac.jp/DRASearch
Good: Simple, Fast, and no bugs (!) Challenge: Lack of
metadata caused “NOT FOUND”
PROBLEM:
???
DRASearch can NOT find Data without metadata …but they definitely
exist in the repo.
Too many to ask submitters; then we implemented a system
to make metadata rich enough
2 sources into DRA DDBJ Read Archive
Publications can have details of seq process, Seq Read Quality
can be a source of data quality. DDBJ Read Archive PubMed PMC Extracted Read Quality
And then: integration enables to implement Efficient Data Search
Available via DBCLS SRA http://sra.dbcls.jp/
Available via DBCLS SRA http://sra.dbcls.jp/
Available via DBCLS SRA http://sra.dbcls.jp/
Power of Integration: Metadata Search http://sra.dbcls.jp/search
Power of Integration: Metadata Search http://sra.dbcls.jp/search
Power of Integration: Metadata Search http://sra.dbcls.jp/search
83% seq reads satisfied average quality over 30 0.03% of
seq reads fall into over 50% N content
1: Reliability from paper/data qual more description brings more proof.
2: Accessibility from text-search Search included publication brings flexibility.
2.20% of submitted projects has at least one publication 4429
/ 201558 PROBLEM:
NIH Data sharing Guideline http://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx
NIH Data sharing Guideline http://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx
What is Next-step to carry on?
1: Beyond Raw Data Archive is going to handle alignment
data. 2: Analysis Reproducibility Public repo for analysis pipeline is required.
1: Beyond Raw Data Archive is going to handle alignment
data. 2: Analysis Reproducibility Public repo for analysis pipeline is required.
Database is for Biologists not for developers.
Thank you!
[email protected]
http://speakerdeck.com/inutano