Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OSC-Hokkaido-2018-hayabusa

Hiroshi
July 07, 2018

 OSC-Hokkaido-2018-hayabusa

This is the presentation material for OSC Hokkaido 2018

Hiroshi

July 07, 2018
Tweet

More Decks by Hiroshi

Other Decks in Research

Transcript

  1. طଘͷղܾࡦ • HadoopΤίγεςϜʢSpark, Impala, Hive, …ʣ • OSSʢElasticsearch + Kibana,

    fluentd, …ʣ • ঎༻ϓϩμΫτʢSplunk, VMware Loginsight, …ʣ • Ϋϥ΢υαʔϏεʢGoogle BigQuery, Treasure Data, …ʣ !7
  2. େ͖ͳ໰୊ • ϩάͷߏ଄Խ͕Ͱ͖ͳ͍ • ػࡐʹ౷Ұੑ͕ͳ͍ɾ࠷৽ͷϑΝʔϜ͗ͯ͢৘ใ͕ͳ͍ • ετϦʔϛϯάॲཧ͕೉͍͠ྲྀྔ • ϩάͷྲྀྔ͕ଟ͗ͯ͢ॲཧ͕௥͍͔ͭͳ͍ •

    όονॲཧ͕௥͍͔ͭͳ͍ • όονॲཧ͕ࢦఆ࣌ؒ಺ʹऴΘΒͳ͍ • ෼ࢄॲཧγεςϜ͕ෳࡶ͗͢Δ • ؅ཧίετ͕ലେ !8
  3. SearchEngine • GNU ParallelΛ༻͍ͯSQLite3ϑΝΠϧ΁ฒྻݕࡧΛ͔͚Δ $ parallel sqlite3 ::: target files

    ::: “select count(*) from xxx where logs match ‘keyword’;” • ݕࡧ݁ՌΛUNIXύΠϓϥΠϯΛ༻͍ͯɺawk΍countίϚϯυͰूܭ $ parallel sqlite3 ::: target files ::: “select count(*) from xxx where logs match ‘keyword’;” | awk ‘{m+=$1} END{print m;}’ SeachEngine !13
  4. શจݕࡧੑೳ • Apache SparkͱͷൺֱʢελϯυΞϩϯ؀ڥʣ • Apache SparkͱͷൺֱʢSpark x 3୆ +

    HDFS vs Hayabusa x 1୆ʣ Hayabusa͕ ̐ഒߴ଎ Hayabusa͕ 27ഒߴ଎
  5. Hayabusaͷ໰୊఺ • ελϯυΞϩϯ؀ڥ • ੑೳΛ্͛Δʹ͸εέʔϧΞοϓ͔͠ͳ͍ • εέʔϧΞοϓίετ • ෼ࢄॲཧγεςϜͱͷࠩ •

    ن໛͕େ͖͘ͳΕ͹෼ࢄॲཧγεςϜͷॲཧ଎౓͸଎͘ͳΔ • Hayabusa͸͍͔ͭੑೳ͕ൈ͔ΕΔ !16
  6. GNU ParallelͷϦϞʔτ࣮ߦ • ཧ૝ : GNU ParallelͷϦϞʔτ࣮ߦΛར༻͢Ε͹෼ࢄ࣮ߦ͸Մೳ $ time parallel

    —controlmaster -S host1,host2,host3 sqlite3 ::: … • ݱ࣮ : sshͷΦʔόϔου͕͔͔Γॲཧ͕஗Ԇ ϗετ͕૿͑Δͱॲཧ͕࣌ؒ૿͑Δ
  7. ఏҊख๏ • ෼ࢄݕࡧ • ࣮ߦ͢ΔݕࡧॲཧΛRPCͱͯ͠Hayabusa΁ૹΓࠐΉ • ݁ՌΛRPCͷϨεϙϯεͱͯ͠ड͚औΓूܭ͢Δ • ฒྻ஝ੵ •

    શͯͷϗετ΁ಉҰͷϦΫΤετ͕ಧ͍ͯ΋ಉ݁͡ՌΛฦ͢Α͏ʹ͢Δ • ࣄલʹશॲཧϗετ΁ͱϩάσʔλΛෳ੡͢Δ !20
  8. ෼ࢄݕࡧ • RPC • Producer / ConsumerϞσϧͷ࠾༻ • ࣮૷ •

    ZeroMQͷPush / Pullύλʔϯ • ϦΫΤετͷϩʔυόϥϯε • Push / Pullύλʔϯ͸ۉҰʹϦΫΤετΛϗετ΁෼഑͢Δ ZeroMQͷPush / Pullύλʔϯ !24
  9. ॲཧϦΫΤετ • ϦΫΤετ $ parallel sqlite3 ::: target files :::

    “select count(*) from xxx where logs match ‘keyword’;” | awk ‘{m+=$1} END{print m;}’ ੨ࣈ : GNU ParallelͷίϚϯυΛ֤ॲཧϗετ΁ૹΓࠐΉ ੺ࣈ : ΫϥΠΞϯτϗετͰ·ͱΊ͋͛Δ !26
  10. ࣮ݧ؀ڥ • Amazon Web Service (AWS) • EC2Πϯελϯε : c4.4xlarge

    • vCPU : Xeon E5-2666 v3 @ 2.90GHz x 16 cores • ϝϞϦ : 30GB • σΟεΫʢEBSʣ : SSD 8GB (OS) + SSD 50GB (Data) • OS : Ubuntu 16.04.3 LTS (Xenial Xerus) !29
  11. ෼ࢄݕࡧ • ݕࡧͷ৚݅ • 1೔෼ͷσʔλʹରͯ͠100ճϦΫΤετΛ࣮ߦ͢Δ • 1೔෼ͷσʔλϑΝΠϧ਺͸60෼ʢ60ϑΝΠϧʣ x 24࣌ؒ =

    1,440ϑΝΠϧ • 1ϑΝΠϧ͋ͨΓͷϨίʔυ਺͸10ສ݅ʢ1,440 x 10ສʹ1ԯ4400ສϨίʔυʣ • 100ճ෼ͷϦΫΤετͰ͸144ԯϨίʔυ͕ର৅ͱͳΔ • ࣮ߦ͢ΔSQLจ͸ҎԼͰશจݕࡧͱΧ΢ϯτ • select count(*) from syslog where logs match ‘keyword’; !30
  12. ݁Ռͷ·ͱΊ • ॲཧੑೳ • ϗετ10୆ͷ৔߹ : ϗετ1୆ͷ10ഒૣ͘ͳΔʢ249ඵ -> 39ඵʣ •

    ϗετ10୆ͰϫʔΧ਺Λ૿Ճ : ૯ϫʔΧ਺10ʙ160Ͱ 249ඵ -> 6.8ඵ • ϨίʔυΛϑϧεΩϟϯˍશจݕࡧͨ݁͠Ռ • 144ԯϨίʔυ͔ΒඞཁͳσʔλΛൈ͖ग़͢ͷʹ໿6.8ඵ·Ͱߴ଎Խ • 10୆ͷϗετͰ໿36ഒͷߴ଎ԽΛ࣮ݱ !34
  13. Amazon Elastic MapReduceͱͷൺֱ • Amazon EMR : Πϯελϯε͸Hayabusaͱಉ͡c4.4xlarge • ߏ੒͸1Ϛελʔϊʔυ

    + 10 ίΞϊʔυ • σʔλ΁ͷΞΫηε • EMR͔ΒAmazon S3΁μΠϨΫτʹ
 ΞΫηε • શจݕࡧͷํ๏ • ϚελʔϊʔυͷPySpark͔Βߦ͏ JNQPSUUJNF GSPNQZTQBSLTRMJNQPSU42-$POUFYU TRM$POUFYU42-$POUFYU TD  MJOFTTDUFYU'JMF TBCFXPSLTTECFODINBSLMPH pMFTLL MPH  MJOFTDBDIF  GPSJJOSBOHF   TUBSUUJNFUJNF <MJOFTpMUFS MBNCEBTOPDJO T DPVOU GPSJJOSBOHF  >FMBQTFE@UJNFUJNFUJNF  TUBSUQSJOUFMBQTFE@UJNF 1Z4QBSLͰ࣮ߦ͢Δίʔυ
  14. ݕࡧͷεέʔϧΞ΢τ • 144ԯ͔ΒඞཁͳσʔλΛൈ͖ग़͢ͷʹ໿6.8ඵ·Ͱߴ଎Խ • 2೥લͷBigQueryͷϑϧεΩϟϯ͕120ԯϨίʔυͰ໿5ඵ • 10୆ͷϗετͰ໿36ഒͷߴ଎ԽΛ࣮ݱ • BigQuery͸Կඦ୆ɺԿઍ୆ͷϗετ͕ಉ࣌ʹಈ͍͍ͯΔ͔ෆ໌ •

    Amazon Elastic MapReduceͱͷൺֱ • 10୆ͷߏ੒Ͱ໿17ഒHayabusaͷํ͕ߴ଎ʹશจݕࡧՄೳ • γεςϜͷίετΛߟ͑ͨ৔߹ • ϦʔζφϒϧͰߴੑೳͳ෼ࢄݕࡧॲཧ͕࣮ݱͰ͖ͨ !38
  15. ஝ੵͷฒྻԽ • syslogͷෳ੡ͷ໰୊఺ • େྔͷσʔλʢύέοτʣͷෳ੡ͰଳҬΛѹഭ͢Δ • ຊདྷͰ͋Ε͹HDFSͷΑ͏ʹ෼ࢄϑΝΠϧγεςϜΛ࢖͏΂͖ • ϝλσʔλػߏΛܦ༝ͯ͠σʔλʹΞΫηε͢ΔͨΊຊ࣭తʹ஗͘ͳΔ •

    ෼ࢄϑΝΠϧγεςϜ͸ͱͯ΋೉͍͠ʢҰͭͷݚڀ෼໺ʣ • γϯϓϧ͞ͷ௥ٻͷ݁Ռ • อ࣋σʔλ͕ػثͷނোͰফࣦͨ͠ͱͯ͠΋ෳ੡͕࢒ΔɾނোػΛ֎͚ͩ͢ • ෼ࢄϑΝΠϧγεςϜͷΑ͏ʹ࠶഑ஔॲཧ͕ෆཁ !39
  16. γϯϓϧͳઃܭʹΑΔӡ༻ͷ؆ུԽ • ෼ࢄݕࡧ • Procedure / ConsumerϞσϧͰ࣮ݱ • ϓϩηε࣮ߦεέδϡʔϥ͸GNU Parallelʹґଘ

    • ෳࡶͳ෼ࢄγεςϜΛ࢖Θͳ͍ར఺ • τϥϒϧ೺Ѳͷߴ଎Խ • γεςϜӡ༻ෛՙͷܰݮ !40
  17. ଞͷγεςϜͱͷൺֱ • શจݕࡧͰApache Sparkͱൺֱͨ͠ • Elasticsearchͱͷൺֱ͸ʁ • Ͳ͏΍ͬͯൺ΂Δʁ • Τϯδϯͷ଎౓ʁʢElasticsearch͸ͱͯ΋ૣ͍ʣ

    • ݺͼग़͠APIͷ଎౓͸Ճຯ͢ΔʁʢREST APIݺͼग़͠͸ͱͯ΋஗͍ʣ • Write & Read • ॻ͖ͳ͕ΒಡΈࠐΜͩ৔߹͸ʁ