Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Database Integration to Improve Accessibility to Public High-throughput Sequencing Data
Search
Tazro Inutano Ohta
July 04, 2014
Science
0
140
Database Integration to Improve Accessibility to Public High-throughput Sequencing Data
A Presentation at National Institute of Genetics, Japan Retreat 2014
Tazro Inutano Ohta
July 04, 2014
Tweet
Share
More Decks by Tazro Inutano Ohta
See All by Tazro Inutano Ohta
Yevis: System to support building a workflow registry with automated quality control
inutano
0
96
Standardization of biological sample information database
inutano
0
41
Describe data analysis workflow with workflow languages
inutano
4
3.9k
Container virtualization technologies and workflow languages improve portability and reproducibility of data analysis environment
inutano
3
320
次世代シーケンサーによるメタゲノム解析:桜の花びらに付着した環境DNAを解析する
inutano
0
66
Workflows that run everywhere and where to run them
inutano
0
130
The Sequence Read Archive search system to make use of public high-throughput sequencing data
inutano
0
230
Improve portability of bioinformatics software across HPC and cloud infrastructures
inutano
1
83
Container, Cloud, and HPC
inutano
0
140
Other Decks in Science
See All in Science
Science of Scienceおよび科学計量学に関する研究論文の俯瞰可視化_LT版
hayataka88
0
530
Spark_Task_Optimization_Journey_How_I_Increased_10x_Speed_by_Performance_Tuning
tlyu0419
0
210
2023-10-03-FOGBoston
lcolladotor
0
190
障害物を回避する バイナリマニピュレータの軌道の設計 / Design of binary manipulator trajectories avoiding obstacles
konakalab
0
100
解説!データ基盤の進化を後押しする手順とタイミング
shomaekawa
0
150
A Theory of Scrum Team Effectiveness 〜『ゾンビスクラムサバイバルガイド』の裏側にある科学〜
bonotake
14
5.4k
Machine Learning for Materials (Lecture 3)
aronwalsh
0
860
拡散モデルの概要 −§1. 拡散モデルで使われる確率微分⽅程式について−
nearme_tech
0
110
構造活性フォーラム2023-山﨑担当分
yamasakih
0
330
バックアップ『しながら』ランサムウェア検出も!? セキュリティ強化が満載 Veeam 12.1
climbteam
0
340
Non-Gaussian methods for causal discovery
sshimizu2006
0
190
History towards Universal Neural Network Potential for Material Discovery
matlantis
0
160
Featured
See All Featured
A Tale of Four Properties
chriscoyier
153
22k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
23
1.7k
Optimising Largest Contentful Paint
csswizardry
13
2.4k
GraphQLとの向き合い方2022年版
quramy
33
13k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
123
39k
Building a Scalable Design System with Sketch
lauravandoore
457
32k
Product Roadmaps are Hard
iamctodd
45
9.8k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
15
1.6k
Bash Introduction
62gerente
605
210k
The Language of Interfaces
destraynor
151
23k
5 minutes of I Can Smell Your CMS
philhawksworth
199
19k
Agile that works and the tools we love
rasmusluckow
325
20k
Transcript
Database Integration to Improve Accessibility to High-Throughput Seq Data
TAZRO OHTA @inutano
None
What do you imagine with a term “Database”?
None
None
None
Knowledge Scientific data Experimental data
Knowledge base Database Raw Data repository
Knowledge base Database Raw Data repository
What kind of data? Next-generation is already out there…
We all need Raw data repo for NGS
We’ve already seen WHY WE NEED
None
Reproducibility is what makes science fair.
2 things required for data repository is…
1: Reliability Data should be archived correctly, with explicit metadata
2: Accessibility Data should be able to be accessed by anyone, without special trick
1: Reliability needs curation Data should be archived correctly, with
explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
1: Reliability needs curation Data should be archived correctly, with
explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
1: Reliability needs curation Data should be archived correctly, with
explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
Current Web-interface for DRA http://trace.ddbj.nig.ac.jp/DRASearch
Good: Simple, Fast, and no bugs (!) Challenge: Lack of
metadata caused “NOT FOUND”
PROBLEM:
???
DRASearch can NOT find Data without metadata …but they definitely
exist in the repo.
Too many to ask submitters; then we implemented a system
to make metadata rich enough
2 sources into DRA DDBJ Read Archive
Publications can have details of seq process, Seq Read Quality
can be a source of data quality. DDBJ Read Archive PubMed PMC Extracted Read Quality
And then: integration enables to implement Efficient Data Search
Available via DBCLS SRA http://sra.dbcls.jp/
Available via DBCLS SRA http://sra.dbcls.jp/
Available via DBCLS SRA http://sra.dbcls.jp/
Power of Integration: Metadata Search http://sra.dbcls.jp/search
Power of Integration: Metadata Search http://sra.dbcls.jp/search
Power of Integration: Metadata Search http://sra.dbcls.jp/search
83% seq reads satisfied average quality over 30 0.03% of
seq reads fall into over 50% N content
1: Reliability from paper/data qual more description brings more proof.
2: Accessibility from text-search Search included publication brings flexibility.
2.20% of submitted projects has at least one publication 4429
/ 201558 PROBLEM:
NIH Data sharing Guideline http://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx
NIH Data sharing Guideline http://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx
What is Next-step to carry on?
1: Beyond Raw Data Archive is going to handle alignment
data. 2: Analysis Reproducibility Public repo for analysis pipeline is required.
1: Beyond Raw Data Archive is going to handle alignment
data. 2: Analysis Reproducibility Public repo for analysis pipeline is required.
Database is for Biologists not for developers.
Thank you!
[email protected]
http://speakerdeck.com/inutano