Slide 1

Slide 1 text

σʔλϕʔε͔Βݟͨ࣍ੈ୅γʔέϯεʹΑΔݚڀͷl͜Ε·Ͱͱ͜Ε͔Βz ݚڀऀΛॿ͚ΔͨΊʹσʔλϕʔε͸ԿΛ͢΂͖͔ %BUBCBTF$FOUFSGPS-JGF4DJFODF େాୡ࿠5B[SP0IUB Now and then: next-generation sequencing database to encourage the big data science

Slide 2

Slide 2 text

·ͱΊ • σʔλղੳ͕େมͳ࣌୅͸΋͏ऴΘͬͨ • σʔλΛไೲ͢Δਓʑʹ͝རӹΛ Summary: stop annoying about NGS data processing, giving rewards to open-data scientists

Slide 3

Slide 3 text

σʔλղੳ͕େมͳ࣌୅͸΋͏ऴΘͬͨ “data processing is not the most annoying part anymore”

Slide 4

Slide 4 text

ࠓɺେن໛γʔέϯε͸Կ͕େมͳͷ͔ • ࣗ਎Ͱγʔέϯε͢Δͷͱಉ͘͡Β͍ެ։σʔλΛ࢖͏έʔε͕ଟ͍! • ͦΕͧΕͷϑϩʔͰڞ௨͢Δ෦෼ɼҟͳΔ෦෼ • ͦΕͧΕͲ͕͜େมͳͷ͔ʁ • ʮσʔλͷղੳ͕େมͳΜͰ͠ΐʯͱࢥΘΕͯʰ͍ͨʱ! • ࣗલγʔέϯε͸࣮ݧσβΠϯͷํ͕େࣄ • ެ։σʔλ͸ʮ࣮ݧσβΠϯͷ৘ใΛ͍͔ʹखʹೖΕΔ͔ʯ͕େࣄ the hardest part is designing whole sequencing experiment, for both self-sequencing and using public sequencing data

Slide 5

Slide 5 text

௨ৗͷγʔέϯεϓϩδΣΫτʹ͓͚ΔݚڀϫʔΫϑϩʔ αϯϓϦϯά ϥΠϒϥϦϓϨοϓ γʔέϯγϯά 2$ ϚοϐϯάΞηϯϒϧ ղੳ • ʮ࣍ੈ୅͸σʔλղੳ͕େมʯ͔Βʮྑ͍γʔέϯε͕͋Ε͹Ͳ͏ʹ͔ͳΔʯ΁! • πʔϧ΍ख๏ɺ࿦จ΋ଟ͘ग़ճ͓ͬͯΓɺղੳ͕େมͳ࣌୅͸΋͏ऴΘΔ • ܭࢉػࢿݯͷ໰୊΋ެڞϦιʔεͳͲʹΑͬͯղܾͰ͖Δ • ॏཁͳͷ͸Α͘σβΠϯ͞Ε࣮ͨݧͱ࣭ͷߴ͍ϥΠϒϥϦ ޙʹͳ͔ͬͯΒ͸Ͳ͏͠Α͏΋ͳ͍ ٕज़తͳ໰୊ͳͷͰͲ͏ʹ͔ͳΔ over the data processing, just a technical part, now researcher must care about designing experiment

Slide 6

Slide 6 text

ެڞͷγʔέϯεσʔλΛར༻͢ΔͨΊͷϑϩʔ ݕࡧ ϝλσʔλͷऩू μ΢ϯϩʔυ 2$ ϚοϐϯάΞηϯϒϧ ղੳ • ʮσʔλͷ࣭͸ղੳͰ͸Ͳ͏ʹ΋ͳΒͳ͍ʯͷ͸ಉ͡! • σʔλͷ࣭ͷ൑அʹ͸࣮ݧ৚݅ͳͲͷϝλ৘ใͷॆ࣮͕ඞཁ • େྔͷσʔλ͔Βޮ཰Α͘ඞཁͳσʔλΛ୳͞ͳͯ͘͸ͳΒͳ͍ • αΠζͷେ͖ͳσʔλ͸DLɾల։ʹ͕͔͔࣌ؒΔͷͰʮϋζϨʯΛҾ͖ͨ͘ͳ͍ ղౚ on-line local using public data requires retrieving detailed metadata to control the quality of sequencing

Slide 7

Slide 7 text

ެ։σʔλͷར༻ίετΛԼ͛ΔͨΊͷ%#తΞϓϩʔν • ඞཁͳσʔλΛૉૣ͘ݕࡧͰ͖Δ! • ʮ໨తͱ͢Δσʔλ͕Ͳͷ͘Β͍ొ࿥͞Ε͍ͯΔ͔ʯΛՄࢹԽ • ղੳʹඞཁͳϝλσʔλ͕֬ೝͰ͖Δ! • PubMed, PMC͔Βจݙ৘ใΛநग़ • Ϧʔυ৘ใͷ௥Ճ (Ϧʔυ਺ɼϦʔυ௕ɼΤϥʔ཰ɼetc.) • ʮϋζϨʯΛආ͚Δ͜ͱͰDL/ղౚͷίετΛ࡟ݮ • ༧ΊΫΦϦςΟΛ֬ೝ͢Δ͜ͱͰQCॲཧΛলུ an approach from the database: improving data search system with method description from papers as metadata

Slide 8

Slide 8 text

ݕࡧγεςϜͷ։ൃ%#$-443" IUUQTSBECDMTKQ

Slide 9

Slide 9 text

'BTU2$ʹΑΔ4FRVFODF2VBMJUZͷఏڙ IUUQTSBECDMTKQ

Slide 10

Slide 10 text

Φο͜ͷσʔλΑͦ͞͏ˠμ΢ϯϩʔυ͢Δ Ұ൩ ˠղౚ͢Δ Ұ൩ ˠݟͯΈΔˠશ෦/Ͱͨ͠ˠʘ ?P? ʗ

Slide 11

Slide 11 text

%#$-443"ʹΑ࣮ͬͯݱ͢Δίετͷ࡟ݮ ݕࡧ ϝλσʔλͷऩू %- 2$ ϚοϐϯάΞηϯϒϧ ղੳ • ໨త͸“ݚڀ໨తʹ߹க͢Δ࣭ͷߴ͍σʔλΛ࠷খίετͰखʹೖΕΔ”͜ͱ! • ʮͳ͍΋ͷΛ୳͠ଓ͚Δʯ͜ͱΛ๷͙ • ʮಉ͡΋ͷ͕ෳ਺͋ΔͳΒྑ͍ํΛ࢖͍͍ͨʯΛαϙʔτ͢Δ • ݕࡧͷࣗಈԽ΋αϙʔτ ղౚ 2$ on-line local “retrieving data that works for one’s study from the public database with minimum effort”

Slide 12

Slide 12 text

Ͳ͏ʹ͔ͳͬͨͷ͔ “And it goes..”

Slide 13

Slide 13 text

Ͳ͏ʹ΋ͳΒͳ͔ͬͨ • σʔλͷྔ΍όϦΤʔγϣϯʹґଘ͢Δ໰୊! • σʔλ͕෼ࢄ͢Δ໰୊! • ϝλ৘ใͷ໰୊! • จݙ৘ใͳͲิ଍৘ใͷ໰୊ not so good: amount and variation of data, data distribution to various public DB, insufficient quality of metadata, difficulty with linking data to publication

Slide 14

Slide 14 text

σʔλྔ͸Ԇʑ૿͑ଓ͚͍ͯΔ http://www.ncbi.nlm.nih.gov/Traces/sra/ 2PB >

Slide 15

Slide 15 text

σʔλྔ͸Ԇʑ૿͑ଓ͚͍ͯΔ http://trace.ddbj.nig.ac.jp/DRASearch/

Slide 16

Slide 16 text

σʔλͷόϦΤʔγϣϯ΋૿͍͑ͯΔ http://liorpachter.wordpress.com/seq/ [*-Seq].size > 80

Slide 17

Slide 17 text

σʔλͷόϦΤʔγϣϯ΋૿͍͑ͯΔ

Slide 18

Slide 18 text

%#ଆ͸ͬ͘͟Γͨ͠4UVEZ5ZQF by study (http://sra.dbcls.jp/trends.html)

Slide 19

Slide 19 text

σʔλ͕෼ࢄ͢Δ by study (http://sra.dbcls.jp/trends.html)

Slide 20

Slide 20 text

5$("EBUBNPWFEUP$()VC http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=history

Slide 21

Slide 21 text

ϝλ৘ใͷ໰୊ Total = 338,765! (words.size == 0).size = 92,089! (words.size > 200).size = 2,184 4BOHFS$FOUFS UFNQMBUF Volumes of free word field “design description”

Slide 22

Slide 22 text

େ੾ͳ͜ͱ͸શͯ1VC.FE͕ڭ͑ͯ͘Εͨ http://sra.dbcls.jp/cgi-bin/publication.cgi

Slide 23

Slide 23 text

ʮ࿦จ͕ग़͔ͨΒσʔλΛެ։͢Δʯ͸Ή͠Ζগ਺೿ʁ ΋͘͠͸ग़ͯ΋ใࠂ͠ͳ͍ 0 37500 75000 112500 150000 total publication #submission 0 50000 100000 150000 200000 total publication #sample 0 100000 200000 300000 400000 total publication #run 115440 3059 194338 31787 376904 51202 26.5% 16.4% 13.6% not all the published data has paper publication (or never update after the first data submission)

Slide 24

Slide 24 text

Ͳ͛Μ͔ͤͳ • σʔλͷྔ΍όϦΤʔγϣϯʹґଘ͢Δ໰୊! • ୯७ʹܭࢉྔ͕૿͑Δ • ৘ใͷཻ౓͸Ͳ͜·ͰରԠ͢Δ΂͖͔ • σʔλ͕෼ࢄ͢Δ໰୊! • ؅ཧ্ͷίετͱར༻্ͷίετͷ݉Ͷ߹͍ • ϝλ৘ใͷ໰୊! • ొ࿥ऀʹΑͬͯهड़ྔʹ͕ࠩ͋Δ • จݙ৘ใͳͲิ଍৘ใͷ໰୊! • ͦ΋ͦ΋จݙ͕ͳ͍ • Materials&Methods ʹͲ͜·Ͱৄ͘͠ॻ͔Ε͍ͯΔ͔ “summary of those problems”

Slide 25

Slide 25 text

Ͳ͏ʹ͔ͳΔͷ͔ Is there any hope?

Slide 26

Slide 26 text

Ͳ͏ʹ͔ͳͬͯ͘Ε • ະདྷ༧ଌ • Compression strategy ͷ໰୊! • Sequencing technology ͷਐԽ͸༧ଌ͕೉͍͠ The other problems; problems of data compression strategy, estimation of sequencing technology advance

Slide 27

Slide 27 text

$PNQSFTTJPO4USBUFHZͷ໰୊ Cochrane, Guy, Charles E. Cook, and Ewan Birney. "The future of DNA sequence archiving." GigaScience 1.1 (2012): 2.

Slide 28

Slide 28 text

$PNQSFTTJPO4USBUFHZͷ໰୊ Cochrane, Guy, Charles E. Cook, and Ewan Birney. "The future of DNA sequence archiving." GigaScience 1.1 (2012): 2.

Slide 29

Slide 29 text

4FRVFODJOH5FDIOPMPHZͷਐԽ͸༧ଌ͕Ӡʑ  https://www.nanoporetech.com

Slide 30

Slide 30 text

4FRVFODJOH5FDIOPMPHZͷਐԽ͸༧ଌ͕Ӡʑ  http://gnubio.com

Slide 31

Slide 31 text

4FRVFODJOH5FDIOPMPHZͷਐԽ͸༧ଌ͕Ӡʑ  http://www.picoseq.com/

Slide 32

Slide 32 text

σʔλΛไೲ͢Δਓʑʹ͝རӹΛ “giving rewards to open-data scientists”

Slide 33

Slide 33 text

http://www.flickr.com/photos/ogachin/5420953786/

Slide 34

Slide 34 text

σʔλϕʔεਆࣾγεςϜͷ࣮ݱʹ޲͚ͯ • ʮ࣭ͷߴ͍ϝλ৘ใͱڞʹσʔλΛొ࿥ͯ͠΋Β͏ʯ͜ͱ͕ඞਢ! • σʔλొ࿥࣌ͷෛ୲ΛݮΒ͢͜ͱ͕େࣄ • มԽ͢Δσʔλͷੑ࣭ʹϑϨΩγϒϧʹରԠ͢Δ • σʔλΛొ࿥͢Δݚڀऀͷڠྗ΋͔ܽͤͳ͍ • ࣭ͷߴ͍σʔλొ࿥Λͯ͘͠ΕΔݚڀऀʹʮ͝རӹʯΛ! • ࠓ͸ϝλ৘ใͷ࣭΋ʮળҙϕʔεʯ • ࿦จ͕cite͞ΕΔɼάϥϯτ͕औΕΔͳͲͷධՁʹܨ͛Δඞཁ͕͋Δ Improving the DB ecosystem to make submission with high-quality metadata easy, giving rewards to researchers who made highly cited submission, etc.

Slide 35

Slide 35 text

·ͱΊ • σʔλղੳ͕େมͳ࣌୅͸΋͏ऴΘͬͨ! • ͜Ε͔Β͸ʮ࣭ͷߴ͍γʔέϯεΛ͢Δͷ͕େมʯͳ࣌୅ • ղੳͰ͖Δਓ͕࣮ݧσβΠϯͷஈ֊͔ΒؔΘΔඞཁ͕͋Δ • ެ։σʔλ͸ʮ࣭ͷߴ͍ϝλ৘ใΛॻ͍ͯ΋Β͏ʯͨΊʹDBͷվળ͕ඞਢ • σʔλΛไೲ͢Δਓʑʹ͝རӹΛ! • ળҙϕʔεͰ͸ݶք͕͋ΔͷͰ࣭ͷߴ͍σʔλΛެ։͢ΔΠϯηϯςΟϒ͕ඞཁ Summary: well-designed sequencing project for highly reusable data, make an incentive to submit high-quality metadata

Slide 36

Slide 36 text

"DLOPXMFEHFNFOU • ͍ͭ΋៉ྷͳσʔλΛެ։ͯ͘͠ΕΔΈͳ͞· • σʔλϕʔεϓϩδΣΫτͰ೔ʑ҉༂͢ΔDBCLS, DDBJ, NBDCͷಉࢤͷΈͳ͞· • ༗Γ೉͍ΞυόΠε΍͝ҙݟΛͩ͘͞ΔNGSݱ৔ͷձͷΈͳ͞· • ͪΐͬͱڠྗͯ͠ΈΑ͏͔ͳʁͱࢥͬͯͩͬͨ͘͞ձ৔ͷΈͳ͞· • ΦʔΨφΠβͷͩ͜·͞Μɺͳ͔͟ͱ͞Μ Thank you!