Large Scale Data in Life Science

Large Scale Data in Life Science

大規模データ処理勉強会 〜「大きな」データと対峙する〜 @株式会社NTTデータ「ライフサイエンス分野の大規模データ 現場での課題とこれから」

991f3366d9cc17386e6a66ef4abc6dbc?s=128

Tazro Inutano Ohta

December 08, 2011
Tweet

Transcript

 1. LARGE SCALE DATA IN E SCIENCE ϥΠϑαΠΤϯε෼໺ͷେن໛σʔλɹݱ৔Ͱͷ՝୊ͱ͜Ε͔Β

 2. ͓ ͜ ͱ Θ Γ UIJT DIBQUFS JT JOTQJSFE CZ

  ZCFOKP
 3. None
 4. None
 5. None
 6. None
 7. ֬ ͔ ʹ ͦ ͏ ͳ ͷ Ͱ ͢ ͕

 8. ࠓ೔͸ϦϨʔγϣφϧɾσʔλϕʔεͷ࿩͸Ұ ੾ग़͖ͯ·ͤΜ

 9. /P4 2- ͷ࿩΋ग़·ͤΜ

 10. Update( new_suffix ) { current_suffix = active_point test_char = last_char

  in new_suffix done = false; while ( !done ) { if current_suffix ends at an explicit node { if the node has no descendant edge starting with test_char create new leaf edge starting at the explicit node else done = true; } else { if the implicit node's next char isn't test_char { split the edge at the implicit node create new leaf edge starting at the split in the edge } else done = true; } if current_suffix is the empty string done = true; else current_suffix = next_smaller_suffix( current_suffix ) } active_point = current_suffix } ΞϧΰϦζϜͷ࿩΍ٖࣅίʔυ΋ग़·ͤΜ
 11. ཧ༝͸௥ͬͯઆ໌க͠·͢ͷͰ

 12. ը໘ͷલͷօ͞Μ΋མͪண͍ͯԼ͍͞ photo by http://www.photoxpress.com/stock-photos/1814937

 13. Ͳ͏͔ฏʹ͝༰ࣻΛ photo by @meguu

 14. ࢝Ί·͢

 15. Large-scale data in Life Science Contents fontin sans fonts by

  Jos Buivenga (exljbris). Thank You! -> www.exljbris.com
 16. LARGE SCALE DATA LIFE SCIENCE NOW IS THE NEXT-GENERATION

 17. %#$-4ʹ͍ͭͯɹੜ໋Պֶ෼໺Ͱͷσʔλϕʔεͱ͸ LARGE SCALE DATA LIFE SCIENCE NOW IS THE NEXT-GENERATION

  ੜ ໋ Պ ֶ ෼ ໺ Ͱ ͷ େ ͖ ͳ σ ʔ λ ɹ ྫ ͱ ಛ ௃ ʮ ࣍ ੈ ୅ ʯ σ ʔ λ ͱ ͦ ͷ ໰ ୊ Ξ ʔ Χ Π ϒ ͱ ղ ੳ ର ʮ ࣍ ੈ ୅ ʯ ɹ ݱ ঢ় ͱ ՝ ୊
 18. DBCLS: DATABASE CENTER FOR LIFE SCIENCE

 19. େֶڞಉར༻ػؔ๏ਓɹ৘ใɾγεςϜݚڀػߏ ϥΠϑαΠΤϯε౷߹σʔλϕʔεηϯλʔ%#$-4 ࠃཱҨ఻ֶݚڀॴɼࠃཱ৘ใֶݚڀॴɼ౷ܭ਺ཧݚڀॴ౳ͱಉ͡ػߏʹॴଐ ݱॴࡏ஍ɿ౦ژେֶઙ໺Ωϟϯύε಺ɹ ૊৫ӡӦͰ͸ಉେֶͱ͸ແؔ܎ ৗۈ৬һ໊ڧɼϦαʔνΞγελϯτ໊ڧ จ෦Պֶলҕୗݚڀ։ൃࣄۀ ౷߹σʔλϕʔεϓϩδΣΫτ )d +45ϥΠϑαΠΤϯεσʔλϕʔε౷߹ਪਐࣄۀ

  ج൫ٕज़։ൃϓϩάϥϜ )d ࠃ಺ʹ͓͚ΔࣗવՊֶ෼໺σʔλϕʔε౷߹ͷத֩ػؔ IUUQECDMTSPJTBDKQ
 20. େాɹୡ࿠Tazro Inutano Ohta @iNut ಛ೚ઐ໳ٕज़һ/ Technical Specialist ج൫ٕज़։ൃϓϩάϥϜɿେن໛σʔλͷར༻ٕज़։ൃʹैࣄ

 21. σʔλϕʔεͱ͸

 22. ੜ໋Պֶܥͷσʔλϕʔεͱ͸ ݚڀ੒Ռ͕ެ։͞ΕΔ৔ͱͯ͠ͷ%# ɹݚڀࣨɼڞಉݚڀ͔ΒࠃࡍίϯιʔγΞϜ·Ͱن໛͸͞·͟· ެڞͷ൚༻ݚڀϦιʔεͱͯ͠ͷ%# ɹήϊϜ΍Ҩ఻ࢠ͔Β࿦จ৘ใ·Ͱ

 23. ໰୊఺ ૊৫ɼϓϩδΣΫτ͕ಠࣗʹߏங͞Εͨ%#͕ཚཱ͢Δ ϓϩδΣΫτ άϥϯτ ͕ऴྃ͢Δͱҡ࣋͞Εͣ์ஔ͞ΕΔ ˠɹͦΕΒΛ੔උɾ౷߹͠࠶ར༻ੑΛߴΊΔͷ͕%#$-4ͷ໾ׂ ੜ໋Պֶܥͷσʔλϕʔεͱ͸

 24. Large-scale data in Life Science ੜ໋Պֶ෼໺ͷେن໛σʔλ

 25. ఆٛ

 26. ੜ໋Պֶܥͷେن໛σʔλͱ͸ ྫ จݙ৘ใ ɹສͷ࿦จཁࢫͱສͷ࿦จͷશจ৘ใJO1VC.FE ूஂӸֶͷσʔλ ɹ਺ඦ͔Β਺ઍͷݸਓ͔ΒͳΔूஂʹ͍ͭͯ࣌ܥྻͰෳ਺ͷσʔλ͕ಘΒΕΔ େن໛Ԙج഑ྻσʔλ ɹ৽ܕͷ%/"γʔέϯαʔʹΑͬͯߴ଎ɾେྔʹੜΈग़͞ΕΔԘج഑ྻσʔλ

 27. ఆٛ ෼໺ڞ௨ͷఆٛ͸ͳ͍JUUBNPOHBDIJ ैདྷͱൺֱͯ͠σʔλαΠζ͕ඇৗʹେ͖͍ɼσʔλྔ͕ଟ͍ ϦΞϧλΠϜੑ͸ ଞ෼໺ͱൺ΂ͯɼࠓͷͱ͜Ζ ௿͍ ੜ໋Պֶܥͷେن໛σʔλͱ͸ ྫ

 28. ಛ௃

 29. ϝλσʔλͷॏཁੑ ɹσʔλʹ͍ͭͯͷ৘ใΛهड़ͨ͠ϝλσʔλ͕σʔλͷղੳʹඞਢ ΞϧΰϦζϜɾπʔϧͷ࣮૷ऀͱ࣮ߦऀͷؔ܎ ɹ৘ใܥݚڀऀ͕πʔϧΛ࣮૷͠ɼੜ෺ܥݚڀऀ͕ͦͷπʔϧͰղੳ Ұൠతͳੜ໋Պֶܥͷσʔλͷಛ௃ ˠɹ͜ΕΒ͸ͦͷ··େن໛σʔλʹ΋౰ͯ͸·Δ

 30. ϝλσʔλͷॏཁੑ ɹσʔλͷղੳʹ͸ͦͷσʔλΛग़࣮ͨ͠ݧͷ৘ใ͕ඞਢ ɹࡉ͔͍৔߹෼͚͕ඞཁͳ͜ͱ͕ଟ͘ɼϝλσʔλ͸؅ཧ΋ίετ͕ߴ͍ Ұൠతͳੜ໋Պֶܥͷσʔλͷಛ௃ ATGCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCATGC ATGCATGATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA

  TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCAGCATGCAT GCATGCATGCATGCATGCATGCAT GCATGCATGCATGCATGCATGCAT GCATGCATGCATGCATGCATGCAT or or or or ? ? photo by Togopic, Licensed under CreativeCommons 2.1 JP Attribution
 31. ϝλσʔλͷॏཁੑ ɹσʔλͷ࠶ݱੑͷͨΊʹ͸ϝλσʔλͷҡ࣋؅ཧ͕ॏཁ ɹେن໛σʔλͷ%#ʹ͓͍ͯେ͖ͳ໰୊ͷ̍ͭͱͳ͍ͬͯΔ Ұൠతͳੜ໋Պֶܥͷσʔλͷಛ௃ ATGCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCATGC ATGCATGATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA

  TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCAGCATGCAT GCATGCATGCATGCATGCATGCAT GCATGCATGCATGCATGCATGCAT GCATGCATGCATGCATGCATGCAT Data ID : 000001 organism : mouse cell : nervous cell sequencer : 454 date : 2011 12 08 photo by Togopic, Licensed under CreativeCommons 2.1 JP Attribution
 32. ΞϧΰϦζϜɾπʔϧͷ࣮૷ऀͱ࣮ߦऀͷؔ܎ ɹ࣮ߦऀ͕ίΞͱͳΔϓϩάϥϜΛॻ͍࣮ͯߦ͢Δέʔε͸গͳ͍ ɹ৘ใܥੜ෺ֶݚڀऀESZɹ࣮ݧܥੜ෺ֶݚڀऀXFU ɹɹҰ෦ͷESZݚڀऀ͕࣮૷ɾެ։ͨ͠ϓϩάϥϜΛɼ ɹɹXFUݚڀऀͳ͍͠ESZͳڞಉݚڀऀ͕࣮ߦ͢Δͷ͕Ұൠత Ұൠతͳੜ໋Պֶܥͷσʔλͷಛ௃

 33. ΞϧΰϦζϜɾπʔϧͷ࣮૷ऀͱ࣮ߦऀͷؔ܎ɼ໰୊఺ ɹ࣮ߦ؀ڥʹ߹Θͳ͍πʔϧ͕ར༻ग़དྷͳ͍ ɹΤϥʔ͕ى͖ͨ࣌ʹରԠ͕೉͍͠ ɹˠɹඞવతʹ(6*ιϑτ΢ΣΞɼXFCΠϯλʔϑΣʔεπʔϧɼ ɹɹɹΫϥ΢υ࣮ߦ؀ڥͳͲͷधཁ͕ߴ͍ Ұൠతͳੜ໋Պֶܥͷσʔλͷಛ௃

 34. ۩ମྫ

 35. Next-generation sequencing data ࣍ੈ୅Ԙج഑ྻσʔλ

 36. ࣍ੈ୅%/"γʔέϯαͱ͸ %/"γʔέϯαҨ఻෺࣭Ͱ͋Δ%/"ͷԘج഑ྻΛܾఆ͢Δ૷ஔ ɹ֩ࢎछྨɼ"5($ͷ࢛จࣈͰදͤΔ ԘجόΠτ ࣍ੈ୅%/"γʔέϯα /(4 ௒ฒྻܕ ɹैདྷ͸Ұ౓ʹ,#ఔ౓ɼ৽ܕͰ͸(#ʙ5#ͷΞ΢τϓοτ அยԽ͞Εͨେྔͷ୹͍Ԙج഑ྻ TIPSUSFBE

  ɹͦͷ··Ͱ͸࢖͑ͳ͍ͷͰݩͷԘج഑ྻΛ෮ݩ͢Δඞཁ͕͋Δ ҩֶɾੜ෺ֶʹେ͖ͳӨڹͱਐาΛ΋ͨΒ͍ͯ͠Δ ɹ೥͔͔ͬͨώτήϊϜ΋਺೔Ͱ׬ྃɼݸਓήϊϜͷ࣌୅
 37. ݸਓήϊϜͷ࣌୅BOENFDPN

 38. ݸਓήϊϜͷ࣌୅FYPNF શͯͷҨ఻ࢠ ͷϓϩϑΝΠϦϯά *ਖ਼֬ʹ͸ήϊϜDNA্Ͱసࣸ͞ΕΔྖҬͷ͏ͪػೳ͢Δ෦෼ɼexonͷશ୳ࡧ grazie per le informazioni @ma_ko

 39. σʔλ

 40. ̨̣̜σʔλ ը૾σʔλ ม׵ޙ࡟আ ɹ5# γάφϧڧ౓ ɹʙ5# Ԙج഑ྻσʔλ RVBMJUZWBMVFؚΉ ɹதؒϑΝΠϧʙ5#ఔ౓ ɹ݁ՌϑΝΠϧʙ5#ఔ౓

  ղੳ݁Ռ ɹதؒϑΝΠϧʙ5#ఔ౓ ɹ݁ՌϑΝΠϧʙ5#ఔ౓ *illuminaࣾ HiSeq 2000ͷ৔߹
 41. ॏ͍

 42. σʔλͷେ͖͞ʹΑΔ໰୊ डୗγʔέϯεͷ৔߹ͳͲసૹͷ໰୊͕͋Δ ɹੜͷσʔλ͸ωοτϫʔΫͰૹΔΑΓ΋)%%Λ୐ٸศͰૹͬͨํ͕଎͍ όοΫΞοϓͷ༨༟͕ͳ͍ ɹެڞͷσʔλϕʔε΁ͷTVCNJUΛόοΫΞοϓ୅ΘΓʹʁ ͦ΋ͦ΋ී௨ͷXFUݚڀࣨʹͦΜͳʹετϨʔδ͸ͳ͍ ɹʮळ༿ݪʹ)%%ങ͍ʹߦ͖ͬͯ·͢ʯʮ·ͨʁʯ ๭දܭࢉιϑτ΢ΣΞʹσʔλ͕ࡌΒͳ͍ ɹʮ݁ՌΛ&DFOTPSFEͰԼ͍͞ʯʮ͑ʁʯʮ͑ʁʯʮ͍΍͔ͩΒϭ

 43. https://twitter.com/#!/dritoshi/status/121817788200390656 )%%௕ऀͧͧ͘͘஀ੜ

 44. σʔλͷղੳ

 45. Ԙج഑ྻͷ෮ݩ de novo assemble short read from NGS reference genome

  reference alignment ͭͷΞϓϩʔν EFOPWPBTTFNCMF SFGFSFODFBMJHONFOU
 46. Ԙج഑ྻͷ෮ݩ de novoBTTFNCMZ ୹͍഑ྻಉ࢜ͷॏͳΓ͋͏෦෼Λݩʹ ܨ͗߹ΘͤΔ͜ͱͰ෮ݩ ୠ͠ɼ୹͍഑ྻͷΈʹΑΔ׬શͳBTTFNCMF͸ݱঢ়ࠔ೉ de novo assemble short

  read from NGS reference genome reference alignment ՝୊ ݱࡏެ։͞Ε͍ͯΔπʔϧ͸ ϝϞϦཁٻੑ͕ඇৗʹߴ͍ Ϧʔυͷ௕͞ɼήϊϜαΠζʹൺྫͯ͠ ཁٻϝϞϦ͕૿͑Δ (#ϝϞϦ͘Β͍Ͱ͸શવ଍Γͳ͍ 
 47. None
 48. Velvet http://www.ebi.ac.uk/~zerbino/velvet/ SOAPdenovo http://soap.genomics.org.cn/soapdenovo.html sequence assembly in wikipedia http://en.wikipedia.org/wiki/Sequence_assembly EFOPWPBTTFNCMFUPPM

 49. Ԙج഑ྻͷ෮ݩ SFGFSFODFBMJHONFOU طʹղಡ͞ΕͨήϊϜ഑ྻΛ ࢀর഑ྻͱͯ͠ɼ૬ಉੑΛݩʹ෮ݩ ώτͷ৔߹͸(#ͷήϊϜʹCQ ఔ౓ͷ୹͍഑ྻΛ਺ԯຊ౰ͯΔ ՝୊ ܭࢉྔ͕ଟ͍ ࢀর഑ྻΛར༻͢ΔͨΊɼҰఆͷ ϝϞϦ֬อ͕ඞཁ

  de novo assemble short read from NGS reference genome reference alignment
 50. Chr1 Chr2 Chr3 CPU1 CPU2 CPU3 ରࡦ ϚϧνίΞͷϚγϯͰ෼ࢄॲཧ ࢀর഑ྻΛછ৭ମ͝ͱʹ෼ׂ͠ɼ ͦΕͧΕΛ$16ʹׂΓ౰ͯΔ

  ՝୊ /(4Ͱ͸ྨࣅͷ୹͍Ԙج഑ྻ͕ େྔʹग़ΔҝɼޡͬͨྖҬʹ ΞϥΠϯϝϯτ͞ΕΔ γʔέϯαͷੑೳ޲্ͰϦʔυ௕͸ ௕͘ͳΓɼ·ͨ௕͍Ԙج഑ྻͷ ྆୺ΛಡΉͳͲͷ޻෉ʹΑͬͯղܾ
 51. ࣮ࡍʹͲ͏ରॲ͍ͯ͠Δ͔

 52. Troubles not yet shooted ࠷લઢ ݱঢ়ͱ՝୊

 53. ܭࢉػతରࡦͷݱঢ় ϩʔΧϧͷ1$ ɹήϊϜαΠζͷখ͍͞ੜ෺छ΍Ϧʔυ਺ʹΑͬͯ͸ؒʹ߹͏͕ʜ 1$Ϋϥελ ɹ4VO(SJE&OHJOF౳ʹΑΔ෼ࢄॲཧɼଞ૊৫ͷ΋ͷΛआΓΔ͜ͱ΋ Ϋϥ΢υ ɹ"84ͳͲΛར༻ͨ͠Ϋϥ΢υܭࢉ؀ڥͷఏڙ͕࢝·Γͭͭ͋Δ εύίϯ ɹ෼ࢄॲཧ͸ڧ͍͕ɼϊʔυ͋ͨΓͷׂΓ౰ͯϝϞϦ͕ݮΔͱܭࢉ͕ग़དྷͳ͍

 54. ϝϞϦ͸͍͘Β͋ͬͯ΋ ଍Γͳ্͍ʹ

 55. ઐ໳ͷΤϯδχΞ͕͍ͳ͍ͷͰ

 56. ௒ೳྗʹ໨֮ΊͨΓ https://twitter.com/#!/dritoshi/status/110559890413600768

 57. ಛघೳྗʹ໨֮ΊͨΓ https://twitter.com/#!/dritoshi/status/113546074760822784

 58. ਫ਼ਆ͕஁͑ΒΕͨΓ https://twitter.com/#!/dritoshi/status/114675417998311425

 59. ϚγϯϝϯςͰݚڀͲ͜Ζ͡Όͳ͍

 60. Ͳ͏͢Ε͹

 61. Ϋϥ΢υ

 62. VTFHBMBYZPSHPOMJOFCJPJOGPSNBUJDTBOBMZTJT http://bcbio.wordpress.com/tag/galaxy/

 63. Ϋϥ΢υͷ໰୊఺ खݩͷσʔλͷΞοϓϩʔυʹ͕͔͔࣌ؒΔ ɹܭࢉػࢿݯͷ໰୊͸ղܾ͢Δ͕ґવసૹͷ໰୊͕࢒Δ ҩྍσʔλͳͲͷݸਓ৘ใ͸ʁ ɹηΩϡϦςΟͷ֬อ͸े෼Ͱ͋Δ͔ ίετύϑΥʔϚϯε͸ʁ ɹ͜Ε͔Βઌ͞Βʹεέʔϧ͢Δσʔλྔʹݟ߹͍ͬͯΔ͔

 64. ͦΕ)BEPPQͰʜ

 65. ITProΑΓ http://itpro.nikkeibp.co.jp/article/NEWS/20110927/369510/ ೔ཱ GFBUҨ఻ݚ

 66. asahi.com ΑΓ http://www.asahi.com/digital/bcnnews/BCN201111240007.html ΠϯςοΫ GFBUཧݚδΣωγε grazie per le informazioni @yag_ays!

 67. ΍ͬͯΔͦ͏Ͱ͢ ΍ͬͯΔͱ͜͸

 68. ·ͱΊ·͢

 69. ·ͱΊ ੜ໋Պֶ෼໺ͷେ͖ͳσʔλͱ͸ ɹఆٛ͸ͳ͍͕ɼैདྷΑΓαΠζɾྔ͕େ͖͍ɼݸਓήϊϜͳͲ਎ۙͳͱ͜Ζʹ΋ อଘɾసૹ౳ͷσʔλͷऔΓճ͠ʹ໰୊ ɹॏཁͳσʔλ͸ফͤͳ͍ɾసૹʹ͸όΠΫศ͔͠ͳ͍ͷ͔ʁ ܭࢉػͷཁٻεϖοΫ͕ߴ͍ ɹ$16͚ͩͰͳ͘ɺ3".ͷཁٻ͕ඇৗʹߴ͍ͷ͕໰୊ ݱঢ়Ͱ͸ͳΜͱ͔΍Γ͘Γ ɹπʔϧͷվྑɾ෼ࢄॲཧͳͲ༷ʑͳํ๏͕ݱࡏࢼΈΒΕ͍ͯΔ

 70. Ҏ্ɺ௕͍࿩Ͱ͕ͨ͠

 71. ͝ਗ਼ௌ͋Γ͕ͱ͏ ͍͟͝·ͨ͠