Database Integration to Improve Accessibility to Public High-throughput Sequencing Data

Database Integration to Improve Accessibility to Public High-throughput Sequencing Data

A Presentation at National Institute of Genetics, Japan Retreat 2014

991f3366d9cc17386e6a66ef4abc6dbc?s=128

Tazro Inutano Ohta

July 04, 2014
Tweet

Transcript

  1. Database Integration to Improve Accessibility to High-Throughput Seq Data

  2. TAZRO OHTA @inutano

  3. None
  4. What do you imagine with a term “Database”?

  5. None
  6. None
  7. None
  8. Knowledge Scientific data Experimental data

  9. Knowledge base Database Raw Data repository

  10. Knowledge base Database Raw Data repository

  11. What kind of data? Next-generation is already out there…

  12. We all need Raw data repo for NGS

  13. We’ve already seen WHY WE NEED

  14. None
  15. Reproducibility is what makes science fair.

  16. 2 things required for data repository is…

  17. 1: Reliability Data should be archived correctly, with explicit metadata

    2: Accessibility Data should be able to be accessed by anyone, without special trick
  18. 1: Reliability needs curation Data should be archived correctly, with

    explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
  19. 1: Reliability needs curation Data should be archived correctly, with

    explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
  20. 1: Reliability needs curation Data should be archived correctly, with

    explicit metadata 2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
  21. Current Web-interface for DRA http://trace.ddbj.nig.ac.jp/DRASearch

  22. Good: Simple, Fast, and no bugs (!) Challenge: Lack of

    metadata caused “NOT FOUND”
  23. PROBLEM:

  24. ???

  25. DRASearch can NOT find Data without metadata …but they definitely

    exist in the repo.
  26. Too many to ask submitters; then we implemented a system

    to make metadata rich enough
  27. 2 sources into DRA DDBJ Read Archive

  28. Publications can have details of seq process, Seq Read Quality

    can be a source of data quality. DDBJ Read Archive PubMed PMC Extracted Read Quality
  29. And then: integration enables to implement Efficient Data Search

  30. Available via DBCLS SRA http://sra.dbcls.jp/

  31. Available via DBCLS SRA http://sra.dbcls.jp/

  32. Available via DBCLS SRA http://sra.dbcls.jp/

  33. Power of Integration: Metadata Search http://sra.dbcls.jp/search

  34. Power of Integration: Metadata Search http://sra.dbcls.jp/search

  35. Power of Integration: Metadata Search http://sra.dbcls.jp/search

  36. 83% seq reads satisfied average quality over 30 0.03% of

    seq reads fall into over 50% N content
  37. 1: Reliability from paper/data qual more description brings more proof.

    2: Accessibility from text-search Search included publication brings flexibility.
  38. 2.20% of submitted projects has at least one publication 4429

    / 201558 PROBLEM:
  39. NIH Data sharing Guideline http://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx

  40. NIH Data sharing Guideline http://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx

  41. What is Next-step to carry on?

  42. 1: Beyond Raw Data Archive is going to handle alignment

    data. 2: Analysis Reproducibility Public repo for analysis pipeline is required.
  43. 1: Beyond Raw Data Archive is going to handle alignment

    data. 2: Analysis Reproducibility Public repo for analysis pipeline is required.
  44. Database is for Biologists not for developers.

  45. Thank you! t.ohta@dbcls.rois.ac.jp http://speakerdeck.com/inutano