Next-generation sequencing database to encourage the big data science

Next-generation sequencing database to encourage the big data science

データベースから見た 次世代シーケンスによる研究の "これまでとこれから"
- 研究者を助けるために データベースは何をすべきか

ビッグデータ音頭 @ 分子生物学会 2013 #mbsj2013

991f3366d9cc17386e6a66ef4abc6dbc?s=128

Tazro Inutano Ohta

December 05, 2013
Tweet

Transcript

  1. σʔλϕʔε͔Βݟͨ࣍ੈ୅γʔέϯεʹΑΔݚڀͷl͜Ε·Ͱͱ͜Ε͔Βz ݚڀऀΛॿ͚ΔͨΊʹσʔλϕʔε͸ԿΛ͢΂͖͔ %BUBCBTF$FOUFSGPS-JGF4DJFODF େాୡ࿠5B[SP0IUB Now and then: next-generation sequencing database

    to encourage the big data science
  2. ·ͱΊ • σʔλղੳ͕େมͳ࣌୅͸΋͏ऴΘͬͨ • σʔλΛไೲ͢Δਓʑʹ͝རӹΛ Summary: stop annoying about NGS

    data processing, giving rewards to open-data scientists
  3. σʔλղੳ͕େมͳ࣌୅͸΋͏ऴΘͬͨ “data processing is not the most annoying part anymore”

  4. ࠓɺେن໛γʔέϯε͸Կ͕େมͳͷ͔ • ࣗ਎Ͱγʔέϯε͢Δͷͱಉ͘͡Β͍ެ։σʔλΛ࢖͏έʔε͕ଟ͍! • ͦΕͧΕͷϑϩʔͰڞ௨͢Δ෦෼ɼҟͳΔ෦෼ • ͦΕͧΕͲ͕͜େมͳͷ͔ʁ • ʮσʔλͷղੳ͕େมͳΜͰ͠ΐʯͱࢥΘΕͯʰ͍ͨʱ! •

    ࣗલγʔέϯε͸࣮ݧσβΠϯͷํ͕େࣄ • ެ։σʔλ͸ʮ࣮ݧσβΠϯͷ৘ใΛ͍͔ʹखʹೖΕΔ͔ʯ͕େࣄ the hardest part is designing whole sequencing experiment, for both self-sequencing and using public sequencing data
  5. ௨ৗͷγʔέϯεϓϩδΣΫτʹ͓͚ΔݚڀϫʔΫϑϩʔ αϯϓϦϯά ϥΠϒϥϦϓϨοϓ γʔέϯγϯά 2$ ϚοϐϯάΞηϯϒϧ ղੳ • ʮ࣍ੈ୅͸σʔλղੳ͕େมʯ͔Βʮྑ͍γʔέϯε͕͋Ε͹Ͳ͏ʹ͔ͳΔʯ΁! •

    πʔϧ΍ख๏ɺ࿦จ΋ଟ͘ग़ճ͓ͬͯΓɺղੳ͕େมͳ࣌୅͸΋͏ऴΘΔ • ܭࢉػࢿݯͷ໰୊΋ެڞϦιʔεͳͲʹΑͬͯղܾͰ͖Δ • ॏཁͳͷ͸Α͘σβΠϯ͞Ε࣮ͨݧͱ࣭ͷߴ͍ϥΠϒϥϦ ޙʹͳ͔ͬͯΒ͸Ͳ͏͠Α͏΋ͳ͍ ٕज़తͳ໰୊ͳͷͰͲ͏ʹ͔ͳΔ over the data processing, just a technical part, now researcher must care about designing experiment
  6. ެڞͷγʔέϯεσʔλΛར༻͢ΔͨΊͷϑϩʔ ݕࡧ ϝλσʔλͷऩू μ΢ϯϩʔυ 2$ ϚοϐϯάΞηϯϒϧ ղੳ • ʮσʔλͷ࣭͸ղੳͰ͸Ͳ͏ʹ΋ͳΒͳ͍ʯͷ͸ಉ͡! •

    σʔλͷ࣭ͷ൑அʹ͸࣮ݧ৚݅ͳͲͷϝλ৘ใͷॆ࣮͕ඞཁ • େྔͷσʔλ͔Βޮ཰Α͘ඞཁͳσʔλΛ୳͞ͳͯ͘͸ͳΒͳ͍ • αΠζͷେ͖ͳσʔλ͸DLɾల։ʹ͕͔͔࣌ؒΔͷͰʮϋζϨʯΛҾ͖ͨ͘ͳ͍ ղౚ on-line local using public data requires retrieving detailed metadata to control the quality of sequencing
  7. ެ։σʔλͷར༻ίετΛԼ͛ΔͨΊͷ%#తΞϓϩʔν • ඞཁͳσʔλΛૉૣ͘ݕࡧͰ͖Δ! • ʮ໨తͱ͢Δσʔλ͕Ͳͷ͘Β͍ొ࿥͞Ε͍ͯΔ͔ʯΛՄࢹԽ • ղੳʹඞཁͳϝλσʔλ͕֬ೝͰ͖Δ! • PubMed, PMC͔Βจݙ৘ใΛநग़

    • Ϧʔυ৘ใͷ௥Ճ (Ϧʔυ਺ɼϦʔυ௕ɼΤϥʔ཰ɼetc.) • ʮϋζϨʯΛආ͚Δ͜ͱͰDL/ղౚͷίετΛ࡟ݮ • ༧ΊΫΦϦςΟΛ֬ೝ͢Δ͜ͱͰQCॲཧΛলུ an approach from the database: improving data search system with method description from papers as metadata
  8. ݕࡧγεςϜͷ։ൃ%#$-443" IUUQTSBECDMTKQ

  9. 'BTU2$ʹΑΔ4FRVFODF2VBMJUZͷఏڙ IUUQTSBECDMTKQ

  10. Φο͜ͷσʔλΑͦ͞͏ˠμ΢ϯϩʔυ͢Δ Ұ൩ ˠղౚ͢Δ Ұ൩ ˠݟͯΈΔˠશ෦/Ͱͨ͠ˠʘ ?P? ʗ

  11. %#$-443"ʹΑ࣮ͬͯݱ͢Δίετͷ࡟ݮ ݕࡧ ϝλσʔλͷऩू %- 2$ ϚοϐϯάΞηϯϒϧ ղੳ • ໨త͸“ݚڀ໨తʹ߹க͢Δ࣭ͷߴ͍σʔλΛ࠷খίετͰखʹೖΕΔ”͜ͱ! •

    ʮͳ͍΋ͷΛ୳͠ଓ͚Δʯ͜ͱΛ๷͙ • ʮಉ͡΋ͷ͕ෳ਺͋ΔͳΒྑ͍ํΛ࢖͍͍ͨʯΛαϙʔτ͢Δ • ݕࡧͷࣗಈԽ΋αϙʔτ ղౚ 2$ on-line local “retrieving data that works for one’s study from the public database with minimum effort”
  12. Ͳ͏ʹ͔ͳͬͨͷ͔ “And it goes..”

  13. Ͳ͏ʹ΋ͳΒͳ͔ͬͨ • σʔλͷྔ΍όϦΤʔγϣϯʹґଘ͢Δ໰୊! • σʔλ͕෼ࢄ͢Δ໰୊! • ϝλ৘ใͷ໰୊! • จݙ৘ใͳͲิ଍৘ใͷ໰୊ not

    so good: amount and variation of data, data distribution to various public DB, insufficient quality of metadata, difficulty with linking data to publication
  14. σʔλྔ͸Ԇʑ૿͑ଓ͚͍ͯΔ http://www.ncbi.nlm.nih.gov/Traces/sra/ 2PB >

  15. σʔλྔ͸Ԇʑ૿͑ଓ͚͍ͯΔ http://trace.ddbj.nig.ac.jp/DRASearch/

  16. σʔλͷόϦΤʔγϣϯ΋૿͍͑ͯΔ http://liorpachter.wordpress.com/seq/ [*-Seq].size > 80

  17. σʔλͷόϦΤʔγϣϯ΋૿͍͑ͯΔ

  18. %#ଆ͸ͬ͘͟Γͨ͠4UVEZ5ZQF by study (http://sra.dbcls.jp/trends.html)

  19. σʔλ͕෼ࢄ͢Δ by study (http://sra.dbcls.jp/trends.html)

  20. 5$("EBUBNPWFEUP$()VC http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=history

  21. ϝλ৘ใͷ໰୊ Total = 338,765! (words.size == 0).size = 92,089! (words.size

    > 200).size = 2,184 4BOHFS$FOUFS UFNQMBUF Volumes of free word field “design description”
  22. େ੾ͳ͜ͱ͸શͯ1VC.FE͕ڭ͑ͯ͘Εͨ http://sra.dbcls.jp/cgi-bin/publication.cgi

  23. ʮ࿦จ͕ग़͔ͨΒσʔλΛެ։͢Δʯ͸Ή͠Ζগ਺೿ʁ ΋͘͠͸ग़ͯ΋ใࠂ͠ͳ͍ 0 37500 75000 112500 150000 total publication #submission

    0 50000 100000 150000 200000 total publication #sample 0 100000 200000 300000 400000 total publication #run 115440 3059 194338 31787 376904 51202 26.5% 16.4% 13.6% not all the published data has paper publication (or never update after the first data submission)
  24. Ͳ͛Μ͔ͤͳ • σʔλͷྔ΍όϦΤʔγϣϯʹґଘ͢Δ໰୊! • ୯७ʹܭࢉྔ͕૿͑Δ • ৘ใͷཻ౓͸Ͳ͜·ͰରԠ͢Δ΂͖͔ • σʔλ͕෼ࢄ͢Δ໰୊! •

    ؅ཧ্ͷίετͱར༻্ͷίετͷ݉Ͷ߹͍ • ϝλ৘ใͷ໰୊! • ొ࿥ऀʹΑͬͯهड़ྔʹ͕ࠩ͋Δ • จݙ৘ใͳͲิ଍৘ใͷ໰୊! • ͦ΋ͦ΋จݙ͕ͳ͍ • Materials&Methods ʹͲ͜·Ͱৄ͘͠ॻ͔Ε͍ͯΔ͔ “summary of those problems”
  25. Ͳ͏ʹ͔ͳΔͷ͔ Is there any hope?

  26. Ͳ͏ʹ͔ͳͬͯ͘Ε • ະདྷ༧ଌ • Compression strategy ͷ໰୊! • Sequencing technology

    ͷਐԽ͸༧ଌ͕೉͍͠ The other problems; problems of data compression strategy, estimation of sequencing technology advance
  27. $PNQSFTTJPO4USBUFHZͷ໰୊ Cochrane, Guy, Charles E. Cook, and Ewan Birney. "The

    future of DNA sequence archiving." GigaScience 1.1 (2012): 2.
  28. $PNQSFTTJPO4USBUFHZͷ໰୊ Cochrane, Guy, Charles E. Cook, and Ewan Birney. "The

    future of DNA sequence archiving." GigaScience 1.1 (2012): 2.
  29. 4FRVFODJOH5FDIOPMPHZͷਐԽ͸༧ଌ͕Ӡʑ  https://www.nanoporetech.com

  30. 4FRVFODJOH5FDIOPMPHZͷਐԽ͸༧ଌ͕Ӡʑ  http://gnubio.com

  31. 4FRVFODJOH5FDIOPMPHZͷਐԽ͸༧ଌ͕Ӡʑ  http://www.picoseq.com/

  32. σʔλΛไೲ͢Δਓʑʹ͝རӹΛ “giving rewards to open-data scientists”

  33. http://www.flickr.com/photos/ogachin/5420953786/

  34. σʔλϕʔεਆࣾγεςϜͷ࣮ݱʹ޲͚ͯ • ʮ࣭ͷߴ͍ϝλ৘ใͱڞʹσʔλΛొ࿥ͯ͠΋Β͏ʯ͜ͱ͕ඞਢ! • σʔλొ࿥࣌ͷෛ୲ΛݮΒ͢͜ͱ͕େࣄ • มԽ͢Δσʔλͷੑ࣭ʹϑϨΩγϒϧʹରԠ͢Δ • σʔλΛొ࿥͢Δݚڀऀͷڠྗ΋͔ܽͤͳ͍ •

    ࣭ͷߴ͍σʔλొ࿥Λͯ͘͠ΕΔݚڀऀʹʮ͝རӹʯΛ! • ࠓ͸ϝλ৘ใͷ࣭΋ʮળҙϕʔεʯ • ࿦จ͕cite͞ΕΔɼάϥϯτ͕औΕΔͳͲͷධՁʹܨ͛Δඞཁ͕͋Δ Improving the DB ecosystem to make submission with high-quality metadata easy, giving rewards to researchers who made highly cited submission, etc.
  35. ·ͱΊ • σʔλղੳ͕େมͳ࣌୅͸΋͏ऴΘͬͨ! • ͜Ε͔Β͸ʮ࣭ͷߴ͍γʔέϯεΛ͢Δͷ͕େมʯͳ࣌୅ • ղੳͰ͖Δਓ͕࣮ݧσβΠϯͷஈ֊͔ΒؔΘΔඞཁ͕͋Δ • ެ։σʔλ͸ʮ࣭ͷߴ͍ϝλ৘ใΛॻ͍ͯ΋Β͏ʯͨΊʹDBͷվળ͕ඞਢ •

    σʔλΛไೲ͢Δਓʑʹ͝རӹΛ! • ળҙϕʔεͰ͸ݶք͕͋ΔͷͰ࣭ͷߴ͍σʔλΛެ։͢ΔΠϯηϯςΟϒ͕ඞཁ Summary: well-designed sequencing project for highly reusable data, make an incentive to submit high-quality metadata
  36. "DLOPXMFEHFNFOU • ͍ͭ΋៉ྷͳσʔλΛެ։ͯ͘͠ΕΔΈͳ͞· • σʔλϕʔεϓϩδΣΫτͰ೔ʑ҉༂͢ΔDBCLS, DDBJ, NBDCͷಉࢤͷΈͳ͞· • ༗Γ೉͍ΞυόΠε΍͝ҙݟΛͩ͘͞ΔNGSݱ৔ͷձͷΈͳ͞· •

    ͪΐͬͱڠྗͯ͠ΈΑ͏͔ͳʁͱࢥͬͯͩͬͨ͘͞ձ৔ͷΈͳ͞· • ΦʔΨφΠβͷͩ͜·͞Μɺͳ͔͟ͱ͞Μ Thank you!