Large Scale Data in Life Science

Large Scale Data in Life Science

大規模データ処理勉強会 〜「大きな」データと対峙する〜 @株式会社NTTデータ「ライフサイエンス分野の大規模データ 現場での課題とこれから」

991f3366d9cc17386e6a66ef4abc6dbc?s=128

Tazro Inutano Ohta

December 08, 2011
Tweet

Transcript

  1. LARGE SCALE DATA IN E SCIENCE ϥΠϑαΠΤϯε෼໺ͷେن໛σʔλɹݱ৔Ͱͷ՝୊ͱ͜Ε͔Β

  2. ͓ ͜ ͱ Θ Γ UIJT DIBQUFS JT JOTQJSFE CZ

    ZCFOKP
  3. None
  4. None
  5. None
  6. None
  7. ֬ ͔ ʹ ͦ ͏ ͳ ͷ Ͱ ͢ ͕

  8. ࠓ೔͸ϦϨʔγϣφϧɾσʔλϕʔεͷ࿩͸Ұ ੾ग़͖ͯ·ͤΜ

  9. /P4 2- ͷ࿩΋ग़·ͤΜ

  10. Update( new_suffix ) { current_suffix = active_point test_char = last_char

    in new_suffix done = false; while ( !done ) { if current_suffix ends at an explicit node { if the node has no descendant edge starting with test_char create new leaf edge starting at the explicit node else done = true; } else { if the implicit node's next char isn't test_char { split the edge at the implicit node create new leaf edge starting at the split in the edge } else done = true; } if current_suffix is the empty string done = true; else current_suffix = next_smaller_suffix( current_suffix ) } active_point = current_suffix } ΞϧΰϦζϜͷ࿩΍ٖࣅίʔυ΋ग़·ͤΜ
  11. ཧ༝͸௥ͬͯઆ໌க͠·͢ͷͰ

  12. ը໘ͷલͷօ͞Μ΋མͪண͍ͯԼ͍͞ photo by http://www.photoxpress.com/stock-photos/1814937

  13. Ͳ͏͔ฏʹ͝༰ࣻΛ photo by @meguu

  14. ࢝Ί·͢

  15. Large-scale data in Life Science Contents fontin sans fonts by

    Jos Buivenga (exljbris). Thank You! -> www.exljbris.com
  16. LARGE SCALE DATA LIFE SCIENCE NOW IS THE NEXT-GENERATION

  17. %#$-4ʹ͍ͭͯɹੜ໋Պֶ෼໺Ͱͷσʔλϕʔεͱ͸ LARGE SCALE DATA LIFE SCIENCE NOW IS THE NEXT-GENERATION

    ੜ ໋ Պ ֶ ෼ ໺ Ͱ ͷ େ ͖ ͳ σ ʔ λ ɹ ྫ ͱ ಛ ௃ ʮ ࣍ ੈ ୅ ʯ σ ʔ λ ͱ ͦ ͷ ໰ ୊  Ξ ʔ Χ Π ϒ ͱ ղ ੳ ର ʮ ࣍ ੈ ୅ ʯ ɹ ݱ ঢ় ͱ ՝ ୊
  18. DBCLS: DATABASE CENTER FOR LIFE SCIENCE

  19. େֶڞಉར༻ػؔ๏ਓɹ৘ใɾγεςϜݚڀػߏ ϥΠϑαΠΤϯε౷߹σʔλϕʔεηϯλʔ%#$-4 ࠃཱҨ఻ֶݚڀॴɼࠃཱ৘ใֶݚڀॴɼ౷ܭ਺ཧݚڀॴ౳ͱಉ͡ػߏʹॴଐ ݱॴࡏ஍ɿ౦ژେֶઙ໺Ωϟϯύε಺ɹ ૊৫ӡӦͰ͸ಉେֶͱ͸ແؔ܎ ৗۈ৬һ໊ڧɼϦαʔνΞγελϯτ໊ڧ จ෦Պֶলҕୗݚڀ։ൃࣄۀ ౷߹σʔλϕʔεϓϩδΣΫτ )d +45ϥΠϑαΠΤϯεσʔλϕʔε౷߹ਪਐࣄۀ

    ج൫ٕज़։ൃϓϩάϥϜ )d ࠃ಺ʹ͓͚ΔࣗવՊֶ෼໺σʔλϕʔε౷߹ͷத֩ػؔ IUUQECDMTSPJTBDKQ
  20. େాɹୡ࿠Tazro Inutano Ohta @iNut ಛ೚ઐ໳ٕज़һ/ Technical Specialist ج൫ٕज़։ൃϓϩάϥϜɿେن໛σʔλͷར༻ٕज़։ൃʹैࣄ

  21. σʔλϕʔεͱ͸

  22. ੜ໋Պֶܥͷσʔλϕʔεͱ͸ ݚڀ੒Ռ͕ެ։͞ΕΔ৔ͱͯ͠ͷ%# ɹݚڀࣨɼڞಉݚڀ͔ΒࠃࡍίϯιʔγΞϜ·Ͱن໛͸͞·͟· ެڞͷ൚༻ݚڀϦιʔεͱͯ͠ͷ%# ɹήϊϜ΍Ҩ఻ࢠ͔Β࿦จ৘ใ·Ͱ

  23. ໰୊఺ ૊৫ɼϓϩδΣΫτ͕ಠࣗʹߏங͞Εͨ%#͕ཚཱ͢Δ ϓϩδΣΫτ άϥϯτ ͕ऴྃ͢Δͱҡ࣋͞Εͣ์ஔ͞ΕΔ ˠɹͦΕΒΛ੔උɾ౷߹͠࠶ར༻ੑΛߴΊΔͷ͕%#$-4ͷ໾ׂ ੜ໋Պֶܥͷσʔλϕʔεͱ͸

  24. Large-scale data in Life Science ੜ໋Պֶ෼໺ͷେن໛σʔλ

  25. ఆٛ

  26. ੜ໋Պֶܥͷେن໛σʔλͱ͸ ྫ จݙ৘ใ ɹສͷ࿦จཁࢫͱສͷ࿦จͷશจ৘ใJO1VC.FE ूஂӸֶͷσʔλ ɹ਺ඦ͔Β਺ઍͷݸਓ͔ΒͳΔूஂʹ͍ͭͯ࣌ܥྻͰෳ਺ͷσʔλ͕ಘΒΕΔ େن໛Ԙج഑ྻσʔλ ɹ৽ܕͷ%/"γʔέϯαʔʹΑͬͯߴ଎ɾେྔʹੜΈग़͞ΕΔԘج഑ྻσʔλ

  27. ఆٛ ෼໺ڞ௨ͷఆٛ͸ͳ͍JUUBNPOHBDIJ ैདྷͱൺֱͯ͠σʔλαΠζ͕ඇৗʹେ͖͍ɼσʔλྔ͕ଟ͍ ϦΞϧλΠϜੑ͸ ଞ෼໺ͱൺ΂ͯɼࠓͷͱ͜Ζ ௿͍ ੜ໋Պֶܥͷେن໛σʔλͱ͸ ྫ

  28. ಛ௃

  29. ϝλσʔλͷॏཁੑ ɹσʔλʹ͍ͭͯͷ৘ใΛهड़ͨ͠ϝλσʔλ͕σʔλͷղੳʹඞਢ ΞϧΰϦζϜɾπʔϧͷ࣮૷ऀͱ࣮ߦऀͷؔ܎ ɹ৘ใܥݚڀऀ͕πʔϧΛ࣮૷͠ɼੜ෺ܥݚڀऀ͕ͦͷπʔϧͰղੳ Ұൠతͳੜ໋Պֶܥͷσʔλͷಛ௃ ˠɹ͜ΕΒ͸ͦͷ··େن໛σʔλʹ΋౰ͯ͸·Δ

  30. ϝλσʔλͷॏཁੑ ɹσʔλͷղੳʹ͸ͦͷσʔλΛग़࣮ͨ͠ݧͷ৘ใ͕ඞਢ ɹࡉ͔͍৔߹෼͚͕ඞཁͳ͜ͱ͕ଟ͘ɼϝλσʔλ͸؅ཧ΋ίετ͕ߴ͍ Ұൠతͳੜ໋Պֶܥͷσʔλͷಛ௃ ATGCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCATGC ATGCATGATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA

    TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCAGCATGCAT GCATGCATGCATGCATGCATGCAT GCATGCATGCATGCATGCATGCAT GCATGCATGCATGCATGCATGCAT or or or or ? ? photo by Togopic, Licensed under CreativeCommons 2.1 JP Attribution
  31. ϝλσʔλͷॏཁੑ ɹσʔλͷ࠶ݱੑͷͨΊʹ͸ϝλσʔλͷҡ࣋؅ཧ͕ॏཁ ɹେن໛σʔλͷ%#ʹ͓͍ͯେ͖ͳ໰୊ͷ̍ͭͱͳ͍ͬͯΔ Ұൠతͳੜ໋Պֶܥͷσʔλͷಛ௃ ATGCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCATGC ATGCATGATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA

    TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCATGCATGCA TGCATGCATGCATGCAGCATGCAT GCATGCATGCATGCATGCATGCAT GCATGCATGCATGCATGCATGCAT GCATGCATGCATGCATGCATGCAT Data ID : 000001 organism : mouse cell : nervous cell sequencer : 454 date : 2011 12 08 photo by Togopic, Licensed under CreativeCommons 2.1 JP Attribution
  32. ΞϧΰϦζϜɾπʔϧͷ࣮૷ऀͱ࣮ߦऀͷؔ܎ ɹ࣮ߦऀ͕ίΞͱͳΔϓϩάϥϜΛॻ͍࣮ͯߦ͢Δέʔε͸গͳ͍ ɹ৘ใܥੜ෺ֶݚڀऀESZɹ࣮ݧܥੜ෺ֶݚڀऀXFU ɹɹҰ෦ͷESZݚڀऀ͕࣮૷ɾެ։ͨ͠ϓϩάϥϜΛɼ ɹɹXFUݚڀऀͳ͍͠ESZͳڞಉݚڀऀ͕࣮ߦ͢Δͷ͕Ұൠత Ұൠతͳੜ໋Պֶܥͷσʔλͷಛ௃

  33. ΞϧΰϦζϜɾπʔϧͷ࣮૷ऀͱ࣮ߦऀͷؔ܎ɼ໰୊఺ ɹ࣮ߦ؀ڥʹ߹Θͳ͍πʔϧ͕ར༻ग़དྷͳ͍ ɹΤϥʔ͕ى͖ͨ࣌ʹରԠ͕೉͍͠ ɹˠɹඞવతʹ(6*ιϑτ΢ΣΞɼXFCΠϯλʔϑΣʔεπʔϧɼ ɹɹɹΫϥ΢υ࣮ߦ؀ڥͳͲͷधཁ͕ߴ͍ Ұൠతͳੜ໋Պֶܥͷσʔλͷಛ௃

  34. ۩ମྫ

  35. Next-generation sequencing data ࣍ੈ୅Ԙج഑ྻσʔλ

  36. ࣍ੈ୅%/"γʔέϯαͱ͸ %/"γʔέϯαҨ఻෺࣭Ͱ͋Δ%/"ͷԘج഑ྻΛܾఆ͢Δ૷ஔ ɹ֩ࢎछྨɼ"5($ͷ࢛จࣈͰදͤΔ ԘجόΠτ ࣍ੈ୅%/"γʔέϯα /(4 ௒ฒྻܕ ɹैདྷ͸Ұ౓ʹ,#ఔ౓ɼ৽ܕͰ͸(#ʙ5#ͷΞ΢τϓοτ அยԽ͞Εͨେྔͷ୹͍Ԙج഑ྻ TIPSUSFBE

    ɹͦͷ··Ͱ͸࢖͑ͳ͍ͷͰݩͷԘج഑ྻΛ෮ݩ͢Δඞཁ͕͋Δ ҩֶɾੜ෺ֶʹେ͖ͳӨڹͱਐาΛ΋ͨΒ͍ͯ͠Δ ɹ೥͔͔ͬͨώτήϊϜ΋਺೔Ͱ׬ྃɼݸਓήϊϜͷ࣌୅
  37. ݸਓήϊϜͷ࣌୅BOENFDPN

  38. ݸਓήϊϜͷ࣌୅FYPNF શͯͷҨ఻ࢠ ͷϓϩϑΝΠϦϯά *ਖ਼֬ʹ͸ήϊϜDNA্Ͱసࣸ͞ΕΔྖҬͷ͏ͪػೳ͢Δ෦෼ɼexonͷશ୳ࡧ grazie per le informazioni @ma_ko

  39. σʔλ

  40. ̨̣̜σʔλ ը૾σʔλ ม׵ޙ࡟আ ɹ5# γάφϧڧ౓ ɹʙ5# Ԙج഑ྻσʔλ RVBMJUZWBMVFؚΉ ɹதؒϑΝΠϧʙ5#ఔ౓ ɹ݁ՌϑΝΠϧʙ5#ఔ౓

    ղੳ݁Ռ ɹதؒϑΝΠϧʙ5#ఔ౓ ɹ݁ՌϑΝΠϧʙ5#ఔ౓ *illuminaࣾ HiSeq 2000ͷ৔߹
  41. ॏ͍

  42. σʔλͷେ͖͞ʹΑΔ໰୊ डୗγʔέϯεͷ৔߹ͳͲసૹͷ໰୊͕͋Δ ɹੜͷσʔλ͸ωοτϫʔΫͰૹΔΑΓ΋)%%Λ୐ٸศͰૹͬͨํ͕଎͍ όοΫΞοϓͷ༨༟͕ͳ͍ ɹެڞͷσʔλϕʔε΁ͷTVCNJUΛόοΫΞοϓ୅ΘΓʹʁ ͦ΋ͦ΋ී௨ͷXFUݚڀࣨʹͦΜͳʹετϨʔδ͸ͳ͍ ɹʮळ༿ݪʹ)%%ങ͍ʹߦ͖ͬͯ·͢ʯʮ·ͨʁʯ ๭දܭࢉιϑτ΢ΣΞʹσʔλ͕ࡌΒͳ͍ ɹʮ݁ՌΛ&DFOTPSFEͰԼ͍͞ʯʮ͑ʁʯʮ͑ʁʯʮ͍΍͔ͩΒϭ

  43. https://twitter.com/#!/dritoshi/status/121817788200390656 )%%௕ऀͧͧ͘͘஀ੜ

  44. σʔλͷղੳ

  45. Ԙج഑ྻͷ෮ݩ de novo assemble short read from NGS reference genome

    reference alignment ͭͷΞϓϩʔν EFOPWPBTTFNCMF SFGFSFODFBMJHONFOU
  46. Ԙج഑ྻͷ෮ݩ de novoBTTFNCMZ ୹͍഑ྻಉ࢜ͷॏͳΓ͋͏෦෼Λݩʹ ܨ͗߹ΘͤΔ͜ͱͰ෮ݩ ୠ͠ɼ୹͍഑ྻͷΈʹΑΔ׬શͳBTTFNCMF͸ݱঢ়ࠔ೉ de novo assemble short

    read from NGS reference genome reference alignment ՝୊ ݱࡏެ։͞Ε͍ͯΔπʔϧ͸ ϝϞϦཁٻੑ͕ඇৗʹߴ͍ Ϧʔυͷ௕͞ɼήϊϜαΠζʹൺྫͯ͠ ཁٻϝϞϦ͕૿͑Δ (#ϝϞϦ͘Β͍Ͱ͸શવ଍Γͳ͍ 
  47. None
  48. Velvet http://www.ebi.ac.uk/~zerbino/velvet/ SOAPdenovo http://soap.genomics.org.cn/soapdenovo.html sequence assembly in wikipedia http://en.wikipedia.org/wiki/Sequence_assembly EFOPWPBTTFNCMFUPPM

  49. Ԙج഑ྻͷ෮ݩ SFGFSFODFBMJHONFOU طʹղಡ͞ΕͨήϊϜ഑ྻΛ ࢀর഑ྻͱͯ͠ɼ૬ಉੑΛݩʹ෮ݩ ώτͷ৔߹͸(#ͷήϊϜʹCQ ఔ౓ͷ୹͍഑ྻΛ਺ԯຊ౰ͯΔ ՝୊ ܭࢉྔ͕ଟ͍ ࢀর഑ྻΛར༻͢ΔͨΊɼҰఆͷ ϝϞϦ֬อ͕ඞཁ

    de novo assemble short read from NGS reference genome reference alignment
  50. Chr1 Chr2 Chr3 CPU1 CPU2 CPU3 ରࡦ ϚϧνίΞͷϚγϯͰ෼ࢄॲཧ ࢀর഑ྻΛછ৭ମ͝ͱʹ෼ׂ͠ɼ ͦΕͧΕΛ$16ʹׂΓ౰ͯΔ

    ՝୊ /(4Ͱ͸ྨࣅͷ୹͍Ԙج഑ྻ͕ େྔʹग़ΔҝɼޡͬͨྖҬʹ ΞϥΠϯϝϯτ͞ΕΔ γʔέϯαͷੑೳ޲্ͰϦʔυ௕͸ ௕͘ͳΓɼ·ͨ௕͍Ԙج഑ྻͷ ྆୺ΛಡΉͳͲͷ޻෉ʹΑͬͯղܾ
  51. ࣮ࡍʹͲ͏ରॲ͍ͯ͠Δ͔

  52. Troubles not yet shooted ࠷લઢ ݱঢ়ͱ՝୊

  53. ܭࢉػతରࡦͷݱঢ় ϩʔΧϧͷ1$ ɹήϊϜαΠζͷখ͍͞ੜ෺छ΍Ϧʔυ਺ʹΑͬͯ͸ؒʹ߹͏͕ʜ 1$Ϋϥελ ɹ4VO(SJE&OHJOF౳ʹΑΔ෼ࢄॲཧɼଞ૊৫ͷ΋ͷΛआΓΔ͜ͱ΋ Ϋϥ΢υ ɹ"84ͳͲΛར༻ͨ͠Ϋϥ΢υܭࢉ؀ڥͷఏڙ͕࢝·Γͭͭ͋Δ εύίϯ ɹ෼ࢄॲཧ͸ڧ͍͕ɼϊʔυ͋ͨΓͷׂΓ౰ͯϝϞϦ͕ݮΔͱܭࢉ͕ग़དྷͳ͍

  54. ϝϞϦ͸͍͘Β͋ͬͯ΋ ଍Γͳ্͍ʹ

  55. ઐ໳ͷΤϯδχΞ͕͍ͳ͍ͷͰ

  56. ௒ೳྗʹ໨֮ΊͨΓ https://twitter.com/#!/dritoshi/status/110559890413600768

  57. ಛघೳྗʹ໨֮ΊͨΓ https://twitter.com/#!/dritoshi/status/113546074760822784

  58. ਫ਼ਆ͕஁͑ΒΕͨΓ https://twitter.com/#!/dritoshi/status/114675417998311425

  59. ϚγϯϝϯςͰݚڀͲ͜Ζ͡Όͳ͍

  60. Ͳ͏͢Ε͹

  61. Ϋϥ΢υ

  62. VTFHBMBYZPSHPOMJOFCJPJOGPSNBUJDTBOBMZTJT http://bcbio.wordpress.com/tag/galaxy/

  63. Ϋϥ΢υͷ໰୊఺ खݩͷσʔλͷΞοϓϩʔυʹ͕͔͔࣌ؒΔ ɹܭࢉػࢿݯͷ໰୊͸ղܾ͢Δ͕ґવసૹͷ໰୊͕࢒Δ ҩྍσʔλͳͲͷݸਓ৘ใ͸ʁ ɹηΩϡϦςΟͷ֬อ͸े෼Ͱ͋Δ͔ ίετύϑΥʔϚϯε͸ʁ ɹ͜Ε͔Βઌ͞Βʹεέʔϧ͢Δσʔλྔʹݟ߹͍ͬͯΔ͔

  64. ͦΕ)BEPPQͰʜ

  65. ITProΑΓ http://itpro.nikkeibp.co.jp/article/NEWS/20110927/369510/ ೔ཱ GFBUҨ఻ݚ

  66. asahi.com ΑΓ http://www.asahi.com/digital/bcnnews/BCN201111240007.html ΠϯςοΫ GFBUཧݚδΣωγε grazie per le informazioni @yag_ays!

  67. ΍ͬͯΔͦ͏Ͱ͢ ΍ͬͯΔͱ͜͸

  68. ·ͱΊ·͢

  69. ·ͱΊ ੜ໋Պֶ෼໺ͷେ͖ͳσʔλͱ͸ ɹఆٛ͸ͳ͍͕ɼैདྷΑΓαΠζɾྔ͕େ͖͍ɼݸਓήϊϜͳͲ਎ۙͳͱ͜Ζʹ΋ อଘɾసૹ౳ͷσʔλͷऔΓճ͠ʹ໰୊ ɹॏཁͳσʔλ͸ফͤͳ͍ɾసૹʹ͸όΠΫศ͔͠ͳ͍ͷ͔ʁ ܭࢉػͷཁٻεϖοΫ͕ߴ͍ ɹ$16͚ͩͰͳ͘ɺ3".ͷཁٻ͕ඇৗʹߴ͍ͷ͕໰୊ ݱঢ়Ͱ͸ͳΜͱ͔΍Γ͘Γ ɹπʔϧͷվྑɾ෼ࢄॲཧͳͲ༷ʑͳํ๏͕ݱࡏࢼΈΒΕ͍ͯΔ

  70. Ҏ্ɺ௕͍࿩Ͱ͕ͨ͠

  71. ͝ਗ਼ௌ͋Γ͕ͱ͏ ͍͟͝·ͨ͠