Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Next-generation sequencing database to encourage the big data science

Next-generation sequencing database to encourage the big data science

データベースから見た 次世代シーケンスによる研究の "これまでとこれから"
- 研究者を助けるために データベースは何をすべきか

ビッグデータ音頭 @ 分子生物学会 2013 #mbsj2013

Tazro Inutano Ohta

December 05, 2013
Tweet

More Decks by Tazro Inutano Ohta

Other Decks in Research

Transcript

  1. ࠓɺେن໛γʔέϯε͸Կ͕େมͳͷ͔ • ࣗ਎Ͱγʔέϯε͢Δͷͱಉ͘͡Β͍ެ։σʔλΛ࢖͏έʔε͕ଟ͍! • ͦΕͧΕͷϑϩʔͰڞ௨͢Δ෦෼ɼҟͳΔ෦෼ • ͦΕͧΕͲ͕͜େมͳͷ͔ʁ • ʮσʔλͷղੳ͕େมͳΜͰ͠ΐʯͱࢥΘΕͯʰ͍ͨʱ! •

    ࣗલγʔέϯε͸࣮ݧσβΠϯͷํ͕େࣄ • ެ։σʔλ͸ʮ࣮ݧσβΠϯͷ৘ใΛ͍͔ʹखʹೖΕΔ͔ʯ͕େࣄ the hardest part is designing whole sequencing experiment, for both self-sequencing and using public sequencing data
  2. ௨ৗͷγʔέϯεϓϩδΣΫτʹ͓͚ΔݚڀϫʔΫϑϩʔ αϯϓϦϯά ϥΠϒϥϦϓϨοϓ γʔέϯγϯά 2$ ϚοϐϯάΞηϯϒϧ ղੳ • ʮ࣍ੈ୅͸σʔλղੳ͕େมʯ͔Βʮྑ͍γʔέϯε͕͋Ε͹Ͳ͏ʹ͔ͳΔʯ΁! •

    πʔϧ΍ख๏ɺ࿦จ΋ଟ͘ग़ճ͓ͬͯΓɺղੳ͕େมͳ࣌୅͸΋͏ऴΘΔ • ܭࢉػࢿݯͷ໰୊΋ެڞϦιʔεͳͲʹΑͬͯղܾͰ͖Δ • ॏཁͳͷ͸Α͘σβΠϯ͞Ε࣮ͨݧͱ࣭ͷߴ͍ϥΠϒϥϦ ޙʹͳ͔ͬͯΒ͸Ͳ͏͠Α͏΋ͳ͍ ٕज़తͳ໰୊ͳͷͰͲ͏ʹ͔ͳΔ over the data processing, just a technical part, now researcher must care about designing experiment
  3. ެڞͷγʔέϯεσʔλΛར༻͢ΔͨΊͷϑϩʔ ݕࡧ ϝλσʔλͷऩू μ΢ϯϩʔυ 2$ ϚοϐϯάΞηϯϒϧ ղੳ • ʮσʔλͷ࣭͸ղੳͰ͸Ͳ͏ʹ΋ͳΒͳ͍ʯͷ͸ಉ͡! •

    σʔλͷ࣭ͷ൑அʹ͸࣮ݧ৚݅ͳͲͷϝλ৘ใͷॆ࣮͕ඞཁ • େྔͷσʔλ͔Βޮ཰Α͘ඞཁͳσʔλΛ୳͞ͳͯ͘͸ͳΒͳ͍ • αΠζͷେ͖ͳσʔλ͸DLɾల։ʹ͕͔͔࣌ؒΔͷͰʮϋζϨʯΛҾ͖ͨ͘ͳ͍ ղౚ on-line local using public data requires retrieving detailed metadata to control the quality of sequencing
  4. ެ։σʔλͷར༻ίετΛԼ͛ΔͨΊͷ%#తΞϓϩʔν • ඞཁͳσʔλΛૉૣ͘ݕࡧͰ͖Δ! • ʮ໨తͱ͢Δσʔλ͕Ͳͷ͘Β͍ొ࿥͞Ε͍ͯΔ͔ʯΛՄࢹԽ • ղੳʹඞཁͳϝλσʔλ͕֬ೝͰ͖Δ! • PubMed, PMC͔Βจݙ৘ใΛநग़

    • Ϧʔυ৘ใͷ௥Ճ (Ϧʔυ਺ɼϦʔυ௕ɼΤϥʔ཰ɼetc.) • ʮϋζϨʯΛආ͚Δ͜ͱͰDL/ղౚͷίετΛ࡟ݮ • ༧ΊΫΦϦςΟΛ֬ೝ͢Δ͜ͱͰQCॲཧΛলུ an approach from the database: improving data search system with method description from papers as metadata
  5. %#$-443"ʹΑ࣮ͬͯݱ͢Δίετͷ࡟ݮ ݕࡧ ϝλσʔλͷऩू %- 2$ ϚοϐϯάΞηϯϒϧ ղੳ • ໨త͸“ݚڀ໨తʹ߹க͢Δ࣭ͷߴ͍σʔλΛ࠷খίετͰखʹೖΕΔ”͜ͱ! •

    ʮͳ͍΋ͷΛ୳͠ଓ͚Δʯ͜ͱΛ๷͙ • ʮಉ͡΋ͷ͕ෳ਺͋ΔͳΒྑ͍ํΛ࢖͍͍ͨʯΛαϙʔτ͢Δ • ݕࡧͷࣗಈԽ΋αϙʔτ ղౚ 2$ on-line local “retrieving data that works for one’s study from the public database with minimum effort”
  6. Ͳ͏ʹ΋ͳΒͳ͔ͬͨ • σʔλͷྔ΍όϦΤʔγϣϯʹґଘ͢Δ໰୊! • σʔλ͕෼ࢄ͢Δ໰୊! • ϝλ৘ใͷ໰୊! • จݙ৘ใͳͲิ଍৘ใͷ໰୊ not

    so good: amount and variation of data, data distribution to various public DB, insufficient quality of metadata, difficulty with linking data to publication
  7. ϝλ৘ใͷ໰୊ Total = 338,765! (words.size == 0).size = 92,089! (words.size

    > 200).size = 2,184 4BOHFS$FOUFS UFNQMBUF Volumes of free word field “design description”
  8. ʮ࿦จ͕ग़͔ͨΒσʔλΛެ։͢Δʯ͸Ή͠Ζগ਺೿ʁ ΋͘͠͸ग़ͯ΋ใࠂ͠ͳ͍ 0 37500 75000 112500 150000 total publication #submission

    0 50000 100000 150000 200000 total publication #sample 0 100000 200000 300000 400000 total publication #run 115440 3059 194338 31787 376904 51202 26.5% 16.4% 13.6% not all the published data has paper publication (or never update after the first data submission)
  9. Ͳ͛Μ͔ͤͳ • σʔλͷྔ΍όϦΤʔγϣϯʹґଘ͢Δ໰୊! • ୯७ʹܭࢉྔ͕૿͑Δ • ৘ใͷཻ౓͸Ͳ͜·ͰରԠ͢Δ΂͖͔ • σʔλ͕෼ࢄ͢Δ໰୊! •

    ؅ཧ্ͷίετͱར༻্ͷίετͷ݉Ͷ߹͍ • ϝλ৘ใͷ໰୊! • ొ࿥ऀʹΑͬͯهड़ྔʹ͕ࠩ͋Δ • จݙ৘ใͳͲิ଍৘ใͷ໰୊! • ͦ΋ͦ΋จݙ͕ͳ͍ • Materials&Methods ʹͲ͜·Ͱৄ͘͠ॻ͔Ε͍ͯΔ͔ “summary of those problems”
  10. Ͳ͏ʹ͔ͳͬͯ͘Ε • ະདྷ༧ଌ • Compression strategy ͷ໰୊! • Sequencing technology

    ͷਐԽ͸༧ଌ͕೉͍͠ The other problems; problems of data compression strategy, estimation of sequencing technology advance
  11. $PNQSFTTJPO4USBUFHZͷ໰୊ Cochrane, Guy, Charles E. Cook, and Ewan Birney. "The

    future of DNA sequence archiving." GigaScience 1.1 (2012): 2.
  12. $PNQSFTTJPO4USBUFHZͷ໰୊ Cochrane, Guy, Charles E. Cook, and Ewan Birney. "The

    future of DNA sequence archiving." GigaScience 1.1 (2012): 2.
  13. σʔλϕʔεਆࣾγεςϜͷ࣮ݱʹ޲͚ͯ • ʮ࣭ͷߴ͍ϝλ৘ใͱڞʹσʔλΛొ࿥ͯ͠΋Β͏ʯ͜ͱ͕ඞਢ! • σʔλొ࿥࣌ͷෛ୲ΛݮΒ͢͜ͱ͕େࣄ • มԽ͢Δσʔλͷੑ࣭ʹϑϨΩγϒϧʹରԠ͢Δ • σʔλΛొ࿥͢Δݚڀऀͷڠྗ΋͔ܽͤͳ͍ •

    ࣭ͷߴ͍σʔλొ࿥Λͯ͘͠ΕΔݚڀऀʹʮ͝རӹʯΛ! • ࠓ͸ϝλ৘ใͷ࣭΋ʮળҙϕʔεʯ • ࿦จ͕cite͞ΕΔɼάϥϯτ͕औΕΔͳͲͷධՁʹܨ͛Δඞཁ͕͋Δ Improving the DB ecosystem to make submission with high-quality metadata easy, giving rewards to researchers who made highly cited submission, etc.
  14. ·ͱΊ • σʔλղੳ͕େมͳ࣌୅͸΋͏ऴΘͬͨ! • ͜Ε͔Β͸ʮ࣭ͷߴ͍γʔέϯεΛ͢Δͷ͕େมʯͳ࣌୅ • ղੳͰ͖Δਓ͕࣮ݧσβΠϯͷஈ֊͔ΒؔΘΔඞཁ͕͋Δ • ެ։σʔλ͸ʮ࣭ͷߴ͍ϝλ৘ใΛॻ͍ͯ΋Β͏ʯͨΊʹDBͷվળ͕ඞਢ •

    σʔλΛไೲ͢Δਓʑʹ͝རӹΛ! • ળҙϕʔεͰ͸ݶք͕͋ΔͷͰ࣭ͷߴ͍σʔλΛެ։͢ΔΠϯηϯςΟϒ͕ඞཁ Summary: well-designed sequencing project for highly reusable data, make an incentive to submit high-quality metadata