Upgrade to Pro — share decks privately, control downloads, hide ads and more …

データ基盤を作るのに使うようなサービスは単発で使うだけでも便利 / Manage to look over unmanaged data with fully managed services

polamjag
January 23, 2019

データ基盤を作るのに使うようなサービスは単発で使うだけでも便利 / Manage to look over unmanaged data with fully managed services

polamjag

January 23, 2019
Tweet

More Decks by polamjag

Other Decks in Programming

Transcript

  1. σʔλج൫Λ࡞Δͷʹ࢖͏Α͏ͳαʔϏε͸
    ୯ൃͰ࢖͏͚ͩͰ΋ศར
    ŠάϩʔεϋοΫͷจ຺Λఴ͑ͯ
    Hatena Engineer Seminar #11
    2019-01-23 @ Tokyo / 2019-01-30 @ Kyoto
    id:polamjag

    View full-size slide

  2. whoami
    • id:polamjag
    ϙ ϥ Ϝ δ ϟ ά
    • ͸ͯͳϒϩά
    • ΞϓϦέʔγϣϯΤϯδχΞ

    View full-size slide

  3. ࠓ೔࿩͢͜ͱ
    • େ౷Ұσʔλج൫ͷΑ͏ͳଘࡏ͕ແ͍ঢ়ଶͰ΋ɺ

    σʔλج൫Λ࡞Δͷʹ࢖͏Α͏ͳαʔϏε͸

    ΞυϗοΫͳ෼ੳʹ୯ൃͰ࢖͑Δ͚ͩͰ΋ศར
    • Google BigQuery ͱ Amazon Athena Ͱ

    ·ͣ͸໨ઌͷ໰୊͔Βղܾ

    View full-size slide

  4. ࠓ೔࿩͞ͳ͍͜ͱ
    • ͍͢͝σʔλج൫Λθϩ͔Β࡞͍ͬͯ͘ํ๏
    • ͍͢͝σʔλج൫Λ׆༻͢Δํ๏
    • ࣮ࡍͷࢪࡦͦͷ΋ͷʹ͍ͭͯ
    • ਺ࣈͷݟํɺଊ͑ํ

    View full-size slide

  5. άϩʔεϋοΫ
    https://trends.google.com/trends/explore?date=all&geo=JP&q=άϩʔεϋοΫ

    View full-size slide

  6. ͱ͍͑͹਺ࣈ

    View full-size slide

  7. (ผʹάϩʔεϋοΫʹݶΒͣ)
    ͪΐͬͱ͜͏͍͏਺ࣈΛ

    ݟͯΈ͍ͨΜ͚ͩͲɺͱ͍͏ͱ͖

    View full-size slide

  8. (ผʹάϩʔεϋοΫʹݶΒͣ)
    ͪΐͬͱ͜͏͍͏਺ࣈΛ

    ݟͯΈ͍ͨΜ͚ͩͲɺͱ͍͏ͱ͖
    αʔϏεͷσʔλϕʔε͔Βूܭ
    ϩάϑΝΠϧ܈͔Βूܭ

    View full-size slide

  9. αʔϏεͷσʔλϕʔε͔Βूܭ
    ϩάϑΝΠϧ܈͔Βूܭ
    (͸ͯͳϒϩά͸ MySQL ͳͷͰͦΕΛલఏʹ)

    View full-size slide

  10. • SQL Λී௨ʹୟ͘
    • ॻࣺͯεΫϦϓτΛॻ͍ͯूܭͨ͠ͷΛεϓϨο
    υγʔτʹΞοϓϩʔυ
    • ͦ͏͍͏ͷΛόονͱͯ͠ఆظ࣮ߦ͢Δ
    • ……

    View full-size slide

  11. • SQL Λී௨ʹୟ͘
    • ෳ਺ͷσʔλϕʔεʹ·͕ͨͬͯूܭ͍ͨ͠?
    • SELECT COUNT(*) ʹ͔͔࣌ؒΔΑ͏ͳςʔϒϧͰ΋ͱʹ͔͘ू
    ܭ͍ͨ͠?
    • ूܭΫΤϦΛߴ଎Խ͢ΔͨΊʹ௥ՃͰ index ཉ͍͚͠Ͳ

    ͱΓ͋͑ͣݟͯΈ͍ͨϨϕϧͷूܭͷͨΊʹ ALTER TABLE Ͱ͖Δ?
    • ྺ࢙తܦҢͰγϦΞϥΠζ͞Εͨσʔλ͕ೖͬͯΔΧϥϜ͕͋Δ
    ͷΛΫΤϦ͍ͨ͠?

    View full-size slide

  12. • SQL Λී௨ʹୟ͘
    • ୟ͍ͨ݁Ռͷڞ༗΍ՄࢹԽͱ͔Ͳ͏͢Δ?
    • ॻࣺͯεΫϦϓτΛॻ͍ͯूܭͨ͠ͷΛεϓϨου
    γʔτʹΞοϓϩʔυ
    • ͦ͏͍͏ͷΛόονͱͯ͠ఆظ࣮ߦ͢Δ
    • ͦͷεϓϨουγʔτ͸୭͕ϝϯςφϯε͢Δ?
    • …………

    View full-size slide

  13. !!! BigQuery !!!
    • ͏·͍͍҆଎͍ۜͷ஄ؙ
    • సૹʹ Embulk Λ࢖͍ɺ DSL ͷதͰσʔλΛ௚͔ͯ͠Β

    όϧΫϩʔσΟϯά͢Δ
    • ඞཁͳσʔλશ෦ूΊ͖ͯͨΒ౰વ JOIN ͳͲͰ͖Δ
    • σʔλྔ͕͍ͨͨ͜͠ͱͳ͍[ͲΕ͘Β͍?]ͳΒߟ͑ͳ͍ͱ͍
    ͚ͳ͍͜ͱ͸Ұ੾ͳ͍

    View full-size slide

  14. ΛϫϯγϣοτͰ࢖͏
    • Embulk ͰσʔλૹΔ
    • 1ճ͖ΓͳΒߟ͑ͳ͍ͱ͍͚ͳ͍͜ͱ͸গͳ͍
    • (ఆظ࣮ߦɾߋ৽ͷ͘͠ΈɺΠϯΫϦϝϯλϧϩʔυɺ……)
    • BigQuery ͰͳΜΒ͔ूܭ͢Δ
    • Google εϓϨουγʔτʹΤΫεϙʔτ͢ΔϘλϯ
    • σʔλϙʔλϧ (Data Studio) ʹಥͬࠐΉ

    View full-size slide

  15. BigQuery ʹΠϯϙʔτ͠ʹ͍͘ςʔϒϧ
    • ೔෇෼ׂͷ෼ׂςʔϒϧΛ࢖͍ʹ͍͘ςʔϒϧ
    • Ϩίʔυ͕ INSERT ͞Εͨ࣌ࠁɺΛද͢ΧϥϜ͕ແ͍
    • ্Ґ֓೦ͷͦ͏͍͏৘ใΛ JOIN ͖ͯͯͦ͠ΕͰύʔςΟγϣ
    ϯ͢Δ?
    • αʔϏεͷྺ࢙͕ 4000 ೔ (≒ 10.96 ೥) Ҏ্͋Δ
    • צͰύʔςΟγϣϯͷִؒΛؒҾ͘?
    • Α͘Α͘ߟ͑ͨΒຊ౰͸·ͩύʔςΟγϣχϯάඞཁͳ͍ͱ͔

    View full-size slide

  16. Google BigQuery ྉۚ
    • ౦ژϦʔδϣϯͳΒετϨʔδ୯Ձ͸ S3 ΑΓ͍҆
    • 0.023USD / GB / ݄
    • ΫΤϦྉۚ͸ 8.55 USD / 1 TB
    • ࠓͷҝସϨʔτͩͱ͍͍ͩͨ 1 GB ͋ͨΓ 1 ԁऑ
    • ͪΐͬͱͨ͠࢖͍ํͳΒ΄ͱΜͲޡࠩϨϕϧ

    View full-size slide

  17. ͜͜·Ͱͷ·ͱΊ
    • αʔϏεͷσʔλʹରͯ͠ϔϏʔͳूܭΛͨ͘͠
    ͳͬͨͱ͖ͷಀ͛ಓͱͯ͠ͷ Google BigQuery Λ
    ͝঺հ͠·ͨ͠

    View full-size slide

  18. αʔϏεͷσʔλϕʔε͔Βूܭ
    ϩάϑΝΠϧ܈͔Βूܭ

    View full-size slide

  19. ͦͷϩά͸ࠓͲ͜ʹ
    • Ͳ͔͜ͷαʔό্?
    • ͦΕͦ͜ BigQuery ͱ͔ RedShift ͱ͔?
    • Amazon S3 ʹஔ͔Ε͍ͯΔ?
    • ࣗ෼ͷ৔߹͸ݟ͔ͨͬͨ΋ͷ͕ S3 ʹஔ͔Εͯ
    ͍ͨ

    View full-size slide

  20. S3 ʹஔ͍ͯ͋ΔͳΒ Amazon Athena Ͱ
    • S3 ʹஔ͍ͯ͋ΔϑΝΠϧΛΫΤϦͰ͖ΔϚωʔδυαʔϏε
    • த਎͸ Presto (https://prestodb.io/)
    • Presto ΍ Hive ͷ஌͕ࣝ͋Δͱ࢖͍͜ͳͤΔͱ͍͏͜ͱ͕ଟ͍
    • ϑϧϚωʔδυͳͷͰɺूܭͷͨΊʹετϨʔδͱωοτϫʔΫ
    ڧΊͷϚγϯΛ༻ҙͨ͠Γɺͱ͍ͬͨ͜ͱΛ͠ͳ͍͍ͯ͘
    • ͓΋ΉΖʹ S3 ͷϑΝΠϧΛࢦఆͯ͠ CREATE TABLE ͯ͠ΫΤ
    Ϧ͢Δ͚ͩ

    View full-size slide

  21. • ೚ҙͷϑΥʔϚοτΛύʔεͯ͠ SQL ͰΫΤϦ
    • CSV/TSV, JSONL, ֤छΧϥϜφ (ྻࢤ޲) ϑΝΠϧͳͲ
    ͸ඪ४αϙʔτ
    • LTSV ͸ਖ਼نදݱͰύʔεͯ͠ΫΤϦͰ͖Δ
    • → ਖ਼نදݱͰύʔεͯ͠஋Λൈ͖ग़ͤΔϑΝΠϧͳ
    ΒԿͰ΋ΫΤϦͰ͖Δ
    • ύʔε͠ͳ͍͜ͱͰ grep as a service ͱͯ͠׆༻

    View full-size slide

  22. -- LTSV Λ Athena ͰΫΤϦ͢ΔͨΊʹύʔεͯ͠ΈΑ͏
    CREATE EXTERNAL TABLE IF NOT EXISTS `sugoi-service-logs` (
    `time` string,
    `host` string,
    `method` string,
    `path` string,
    `taken_sec` float,
    `ua` string,
    `referer` string
    )
    PARTITIONED BY (date string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
    'serialization.format' = '1',
    'input.regex' = '^time:([^\\t]*)\\thost:([^\\t]*)\\tmethod:([^\\t]*)\\turi:
    ([^\\t]*)\\ttaken_sec:([^\\t]*)\\tua:([^\\t]*)\\treferer:([^\\t]*)$'
    ) LOCATION 's3://your-logs-bucket/sugoi-service-logs/'
    TBLPROPERTIES ('has_encrypted_data'='false')

    View full-size slide

  23. -- LTSV Λ Athena ͰΫΤϦ͢ΔͨΊʹύʔεͯ͠ΈΑ͏
    CREATE EXTERNAL TABLE IF NOT EXISTS `sugoi-service-logs` (
    `time` string,
    `host` string,
    `method` string,
    `path` string,
    `taken_sec` float,
    `ua` string,
    `referer` string
    )
    PARTITIONED BY (date string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
    'serialization.format' = '1',
    'input.regex' = '^time:([^\\t]*)\\thost:([^\\t]*)\\tmethod:([^\\t]*)\\turi:
    ([^\\t]*)\\ttaken_sec:([^\\t]*)\\tua:([^\\t]*)\\treferer:([^\\t]*)$'
    ) LOCATION 's3://your-logs-bucket/sugoi-service-logs/'
    TBLPROPERTIES ('has_encrypted_data'='false')

    View full-size slide

  24. -- LTSV Λ Athena ͰΫΤϦ͢ΔͨΊʹύʔεͯ͠ΈΑ͏
    CREATE EXTERNAL TABLE IF NOT EXISTS `sugoi-service-logs` (
    `time` string,
    `host` string,
    `method` string,
    `path` string,
    `taken_sec` float,
    `ua` string,
    `referer` string
    )
    PARTITIONED BY (date string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
    'serialization.format' = '1',
    'input.regex' = '^time:([^\\t]*)\\thost:([^\\t]*)\\tmethod:([^\\t]*)\\turi:
    ([^\\t]*)\\ttaken_sec:([^\\t]*)\\tua:([^\\t]*)\\treferer:([^\\t]*)$'
    ) LOCATION 's3://your-logs-bucket/sugoi-service-logs/'
    TBLPROPERTIES ('has_encrypted_data'='false')
    ……

    View full-size slide

  25. ৘ใ
    • S3 ʹͲΜͲΜϩάΛอଘ͢Δͱ͖ɺHive ܗࣜͷσΟϨΫτϦͰ
    อଘͯ͠ΔͱύʔςΟγϣχϯά͕ࣗಈͰߦΘΕͯศར
    • ྫ: s3://bucket/somelogs/year=2019/month=1/
    day=23/engineer_seminar_001.log
    • ͦ͏ͳͬͯͳͯ͘΋ symlink ͰͳΜͱ͔Ͱ͖Δͱ͍͏࿩୊΋͋Δ
    • https://dev.classmethod.jp/cloud/aws/create-athena-table-
    with-symboliclink/

    View full-size slide

  26. Amazon Athena ஋ஈ
    • εΩϟϯͨ͠༰ྔʹରͯ͠՝ۚ (5 USD / 1 TB)
    • gzip ѹॖࡁϑΝΠϧ΋ͦͷ··ΫΤϦͰ͖Δ
    • ྻࢤ޲Ͱͳ͍ϑΝΠϧʹରͯ͠ΫΤϦ͢ΔͳΒɺूܭ
    ͢ΔΧϥϜΛߜͬͯ΋ྉۚઅ໿ʹ͸ͳΒͳ͍
    • ج൫ͱͯ͠׆༻͍ͯ͘͠ײ͡ͳΒɺ

    ྻࢤ޲ϑΝΠϧʹม׵͢ΔϑΣʔζ͕ͳ͍ͱݫͦ͠͏

    View full-size slide

  27. ͜͜·Ͱͷ·ͱΊ
    • Amazon Athena Λ࢖͏͜ͱͰɺ

    S3 ্ʹ͋ΔͳΒ͍͍ͩͨͲΜͳϩάͰ΋ͱ͘ʹ

    લ४උ͍͖ͤͣͳΓूܭͰ͖Δ

    View full-size slide

  28. • όϧΫϩʔυͯ͠͠·͑͹ͬͪ͜ͷ΋ͷͱ͍͏

    ؾ࣋ͪͰര଎ूܭͰ͖Δάοζ
    • όϧΫϩʔυ͸ؾ͕ॏ͍ͱ͍͏ϩάϑΝΠϧͰ΋ɺ

    ϑΝΠϧͷ͋Γ͔Λઃఆ͢Δ͚ͩͰɺ

    Կ΋४උΛ͍͖ͤͣͳΓूܭͰ͖Δάοζ
    • ໨ઌͷ໰୊͸݁ߏ͢͹΍͘ղܾͰ͖ΔΑ͏ʹ

    View full-size slide

  29. • ໨ઌͷ໰୊Λղܾ͠ଓ͚ͯΔͱී௨ʹج൫ͬΆ͍
    ΋ͷ͕ཉ͘͠ͳͬͯ͘Δ͠ɺ͜͏ͳͬͯΔͱྑ͍
    ͔ͳͱ͍͏ͷ͕ݟ͑ͯ͘Δ
    • ج൫͕͋Δͷʹӽͨ͜͠ͱ͸ͳ͍
    • ͍Ζ͍Ζࢼ͢͜ͱͰσʔλͱͷ޲͖߹͍ํͷ

    Πϝʔδ͕๲ΒΜͩͷ͸ྑ͔ͬͨ

    View full-size slide