Upgrade to Pro — share decks privately, control downloads, hide ads and more …

データ基盤を作るのに使うようなサービスは単発で使うだけでも便利 / Manage to look over unmanaged data with fully managed services

データ基盤を作るのに使うようなサービスは単発で使うだけでも便利 / Manage to look over unmanaged data with fully managed services

F9fa9b6225d62d8c0559e993906ca4f5?s=128

ABE Satoru

January 23, 2019
Tweet

Transcript

  1. σʔλج൫Λ࡞Δͷʹ࢖͏Α͏ͳαʔϏε͸ ୯ൃͰ࢖͏͚ͩͰ΋ศར ŠάϩʔεϋοΫͷจ຺Λఴ͑ͯ Hatena Engineer Seminar #11 2019-01-23 @ Tokyo

    / 2019-01-30 @ Kyoto id:polamjag
  2. whoami • id:polamjag ϙ ϥ Ϝ δ ϟ ά •

    ͸ͯͳϒϩά • ΞϓϦέʔγϣϯΤϯδχΞ
  3. ࠓ೔࿩͢͜ͱ • େ౷Ұσʔλج൫ͷΑ͏ͳଘࡏ͕ແ͍ঢ়ଶͰ΋ɺ
 σʔλج൫Λ࡞Δͷʹ࢖͏Α͏ͳαʔϏε͸
 ΞυϗοΫͳ෼ੳʹ୯ൃͰ࢖͑Δ͚ͩͰ΋ศར • Google BigQuery ͱ Amazon

    Athena Ͱ
 ·ͣ͸໨ઌͷ໰୊͔Βղܾ
  4. ࠓ೔࿩͞ͳ͍͜ͱ • ͍͢͝σʔλج൫Λθϩ͔Β࡞͍ͬͯ͘ํ๏ • ͍͢͝σʔλج൫Λ׆༻͢Δํ๏ • ࣮ࡍͷࢪࡦͦͷ΋ͷʹ͍ͭͯ • ਺ࣈͷݟํɺଊ͑ํ

  5. άϩʔεϋοΫ https://trends.google.com/trends/explore?date=all&geo=JP&q=άϩʔεϋοΫ

  6. ͱ͍͑͹਺ࣈ

  7. (ผʹάϩʔεϋοΫʹݶΒͣ) ͪΐͬͱ͜͏͍͏਺ࣈΛ
 ݟͯΈ͍ͨΜ͚ͩͲɺͱ͍͏ͱ͖

  8. (ผʹάϩʔεϋοΫʹݶΒͣ) ͪΐͬͱ͜͏͍͏਺ࣈΛ
 ݟͯΈ͍ͨΜ͚ͩͲɺͱ͍͏ͱ͖ αʔϏεͷσʔλϕʔε͔Βूܭ ϩάϑΝΠϧ܈͔Βूܭ

  9. αʔϏεͷσʔλϕʔε͔Βूܭ ϩάϑΝΠϧ܈͔Βूܭ (͸ͯͳϒϩά͸ MySQL ͳͷͰͦΕΛલఏʹ)

  10. • SQL Λී௨ʹୟ͘ • ॻࣺͯεΫϦϓτΛॻ͍ͯूܭͨ͠ͷΛεϓϨο υγʔτʹΞοϓϩʔυ • ͦ͏͍͏ͷΛόονͱͯ͠ఆظ࣮ߦ͢Δ • ……

  11. • SQL Λී௨ʹୟ͘ • ෳ਺ͷσʔλϕʔεʹ·͕ͨͬͯूܭ͍ͨ͠? • SELECT COUNT(*) ʹ͔͔࣌ؒΔΑ͏ͳςʔϒϧͰ΋ͱʹ͔͘ू ܭ͍ͨ͠?

    • ूܭΫΤϦΛߴ଎Խ͢ΔͨΊʹ௥ՃͰ index ཉ͍͚͠Ͳ
 ͱΓ͋͑ͣݟͯΈ͍ͨϨϕϧͷूܭͷͨΊʹ ALTER TABLE Ͱ͖Δ? • ྺ࢙తܦҢͰγϦΞϥΠζ͞Εͨσʔλ͕ೖͬͯΔΧϥϜ͕͋Δ ͷΛΫΤϦ͍ͨ͠?
  12. • SQL Λී௨ʹୟ͘ • ୟ͍ͨ݁Ռͷڞ༗΍ՄࢹԽͱ͔Ͳ͏͢Δ? • ॻࣺͯεΫϦϓτΛॻ͍ͯूܭͨ͠ͷΛεϓϨου γʔτʹΞοϓϩʔυ • ͦ͏͍͏ͷΛόονͱͯ͠ఆظ࣮ߦ͢Δ

    • ͦͷεϓϨουγʔτ͸୭͕ϝϯςφϯε͢Δ? • …………
  13. !!! BigQuery !!! • ͏·͍͍҆଎͍ۜͷ஄ؙ • సૹʹ Embulk Λ࢖͍ɺ DSL

    ͷதͰσʔλΛ௚͔ͯ͠Β
 όϧΫϩʔσΟϯά͢Δ • ඞཁͳσʔλશ෦ूΊ͖ͯͨΒ౰વ JOIN ͳͲͰ͖Δ • σʔλྔ͕͍ͨͨ͜͠ͱͳ͍[ͲΕ͘Β͍?]ͳΒߟ͑ͳ͍ͱ͍ ͚ͳ͍͜ͱ͸Ұ੾ͳ͍
  14. ΛϫϯγϣοτͰ࢖͏ • Embulk ͰσʔλૹΔ • 1ճ͖ΓͳΒߟ͑ͳ͍ͱ͍͚ͳ͍͜ͱ͸গͳ͍ • (ఆظ࣮ߦɾߋ৽ͷ͘͠ΈɺΠϯΫϦϝϯλϧϩʔυɺ……) • BigQuery

    ͰͳΜΒ͔ूܭ͢Δ • Google εϓϨουγʔτʹΤΫεϙʔτ͢ΔϘλϯ • σʔλϙʔλϧ (Data Studio) ʹಥͬࠐΉ
  15. BigQuery ʹΠϯϙʔτ͠ʹ͍͘ςʔϒϧ • ೔෇෼ׂͷ෼ׂςʔϒϧΛ࢖͍ʹ͍͘ςʔϒϧ • Ϩίʔυ͕ INSERT ͞Εͨ࣌ࠁɺΛද͢ΧϥϜ͕ແ͍ • ্Ґ֓೦ͷͦ͏͍͏৘ใΛ

    JOIN ͖ͯͯͦ͠ΕͰύʔςΟγϣ ϯ͢Δ? • αʔϏεͷྺ࢙͕ 4000 ೔ (≒ 10.96 ೥) Ҏ্͋Δ • צͰύʔςΟγϣϯͷִؒΛؒҾ͘? • Α͘Α͘ߟ͑ͨΒຊ౰͸·ͩύʔςΟγϣχϯάඞཁͳ͍ͱ͔
  16. Google BigQuery ྉۚ • ౦ژϦʔδϣϯͳΒετϨʔδ୯Ձ͸ S3 ΑΓ͍҆ • 0.023USD /

    GB / ݄ • ΫΤϦྉۚ͸ 8.55 USD / 1 TB • ࠓͷҝସϨʔτͩͱ͍͍ͩͨ 1 GB ͋ͨΓ 1 ԁऑ • ͪΐͬͱͨ͠࢖͍ํͳΒ΄ͱΜͲޡࠩϨϕϧ
  17. ͜͜·Ͱͷ·ͱΊ • αʔϏεͷσʔλʹରͯ͠ϔϏʔͳूܭΛͨ͘͠ ͳͬͨͱ͖ͷಀ͛ಓͱͯ͠ͷ Google BigQuery Λ ͝঺հ͠·ͨ͠

  18. αʔϏεͷσʔλϕʔε͔Βूܭ ϩάϑΝΠϧ܈͔Βूܭ

  19. ͦͷϩά͸ࠓͲ͜ʹ • Ͳ͔͜ͷαʔό্? • ͦΕͦ͜ BigQuery ͱ͔ RedShift ͱ͔? •

    Amazon S3 ʹஔ͔Ε͍ͯΔ? • ࣗ෼ͷ৔߹͸ݟ͔ͨͬͨ΋ͷ͕ S3 ʹஔ͔Εͯ ͍ͨ
  20. S3 ʹஔ͍ͯ͋ΔͳΒ Amazon Athena Ͱ • S3 ʹஔ͍ͯ͋ΔϑΝΠϧΛΫΤϦͰ͖ΔϚωʔδυαʔϏε • த਎͸

    Presto (https://prestodb.io/) • Presto ΍ Hive ͷ஌͕ࣝ͋Δͱ࢖͍͜ͳͤΔͱ͍͏͜ͱ͕ଟ͍ • ϑϧϚωʔδυͳͷͰɺूܭͷͨΊʹετϨʔδͱωοτϫʔΫ ڧΊͷϚγϯΛ༻ҙͨ͠Γɺͱ͍ͬͨ͜ͱΛ͠ͳ͍͍ͯ͘ • ͓΋ΉΖʹ S3 ͷϑΝΠϧΛࢦఆͯ͠ CREATE TABLE ͯ͠ΫΤ Ϧ͢Δ͚ͩ
  21. • ೚ҙͷϑΥʔϚοτΛύʔεͯ͠ SQL ͰΫΤϦ • CSV/TSV, JSONL, ֤छΧϥϜφ (ྻࢤ޲) ϑΝΠϧͳͲ

    ͸ඪ४αϙʔτ • LTSV ͸ਖ਼نදݱͰύʔεͯ͠ΫΤϦͰ͖Δ • → ਖ਼نදݱͰύʔεͯ͠஋Λൈ͖ग़ͤΔϑΝΠϧͳ ΒԿͰ΋ΫΤϦͰ͖Δ • ύʔε͠ͳ͍͜ͱͰ grep as a service ͱͯ͠׆༻
  22. -- LTSV Λ Athena ͰΫΤϦ͢ΔͨΊʹύʔεͯ͠ΈΑ͏ CREATE EXTERNAL TABLE IF NOT

    EXISTS `sugoi-service-logs` ( `time` string, `host` string, `method` string, `path` string, `taken_sec` float, `ua` string, `referer` string ) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '^time:([^\\t]*)\\thost:([^\\t]*)\\tmethod:([^\\t]*)\\turi: ([^\\t]*)\\ttaken_sec:([^\\t]*)\\tua:([^\\t]*)\\treferer:([^\\t]*)$' ) LOCATION 's3://your-logs-bucket/sugoi-service-logs/' TBLPROPERTIES ('has_encrypted_data'='false')
  23. -- LTSV Λ Athena ͰΫΤϦ͢ΔͨΊʹύʔεͯ͠ΈΑ͏ CREATE EXTERNAL TABLE IF NOT

    EXISTS `sugoi-service-logs` ( `time` string, `host` string, `method` string, `path` string, `taken_sec` float, `ua` string, `referer` string ) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '^time:([^\\t]*)\\thost:([^\\t]*)\\tmethod:([^\\t]*)\\turi: ([^\\t]*)\\ttaken_sec:([^\\t]*)\\tua:([^\\t]*)\\treferer:([^\\t]*)$' ) LOCATION 's3://your-logs-bucket/sugoi-service-logs/' TBLPROPERTIES ('has_encrypted_data'='false')
  24. -- LTSV Λ Athena ͰΫΤϦ͢ΔͨΊʹύʔεͯ͠ΈΑ͏ CREATE EXTERNAL TABLE IF NOT

    EXISTS `sugoi-service-logs` ( `time` string, `host` string, `method` string, `path` string, `taken_sec` float, `ua` string, `referer` string ) PARTITIONED BY (date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '^time:([^\\t]*)\\thost:([^\\t]*)\\tmethod:([^\\t]*)\\turi: ([^\\t]*)\\ttaken_sec:([^\\t]*)\\tua:([^\\t]*)\\treferer:([^\\t]*)$' ) LOCATION 's3://your-logs-bucket/sugoi-service-logs/' TBLPROPERTIES ('has_encrypted_data'='false') ……
  25. ৘ใ • S3 ʹͲΜͲΜϩάΛอଘ͢Δͱ͖ɺHive ܗࣜͷσΟϨΫτϦͰ อଘͯ͠ΔͱύʔςΟγϣχϯά͕ࣗಈͰߦΘΕͯศར • ྫ: s3://bucket/somelogs/year=2019/month=1/ day=23/engineer_seminar_001.log

    • ͦ͏ͳͬͯͳͯ͘΋ symlink ͰͳΜͱ͔Ͱ͖Δͱ͍͏࿩୊΋͋Δ • https://dev.classmethod.jp/cloud/aws/create-athena-table- with-symboliclink/
  26. Amazon Athena ஋ஈ • εΩϟϯͨ͠༰ྔʹରͯ͠՝ۚ (5 USD / 1 TB)

    • gzip ѹॖࡁϑΝΠϧ΋ͦͷ··ΫΤϦͰ͖Δ • ྻࢤ޲Ͱͳ͍ϑΝΠϧʹରͯ͠ΫΤϦ͢ΔͳΒɺूܭ ͢ΔΧϥϜΛߜͬͯ΋ྉۚઅ໿ʹ͸ͳΒͳ͍ • ج൫ͱͯ͠׆༻͍ͯ͘͠ײ͡ͳΒɺ
 ྻࢤ޲ϑΝΠϧʹม׵͢ΔϑΣʔζ͕ͳ͍ͱݫͦ͠͏
  27. ͜͜·Ͱͷ·ͱΊ • Amazon Athena Λ࢖͏͜ͱͰɺ
 S3 ্ʹ͋ΔͳΒ͍͍ͩͨͲΜͳϩάͰ΋ͱ͘ʹ
 લ४උ͍͖ͤͣͳΓूܭͰ͖Δ

  28. • όϧΫϩʔυͯ͠͠·͑͹ͬͪ͜ͷ΋ͷͱ͍͏
 ؾ࣋ͪͰര଎ूܭͰ͖Δάοζ • όϧΫϩʔυ͸ؾ͕ॏ͍ͱ͍͏ϩάϑΝΠϧͰ΋ɺ
 ϑΝΠϧͷ͋Γ͔Λઃఆ͢Δ͚ͩͰɺ
 Կ΋४උΛ͍͖ͤͣͳΓूܭͰ͖Δάοζ • ໨ઌͷ໰୊͸݁ߏ͢͹΍͘ղܾͰ͖ΔΑ͏ʹ

  29. • ໨ઌͷ໰୊Λղܾ͠ଓ͚ͯΔͱී௨ʹج൫ͬΆ͍ ΋ͷ͕ཉ͘͠ͳͬͯ͘Δ͠ɺ͜͏ͳͬͯΔͱྑ͍ ͔ͳͱ͍͏ͷ͕ݟ͑ͯ͘Δ • ج൫͕͋Δͷʹӽͨ͜͠ͱ͸ͳ͍ • ͍Ζ͍Ζࢼ͢͜ͱͰσʔλͱͷ޲͖߹͍ํͷ
 Πϝʔδ͕๲ΒΜͩͷ͸ྑ͔ͬͨ