Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Embulk / Presto / Sparkを用いたETL事情

24b18d2f8d7ceb0fbd521053993fba9e?s=47 kimutyam
February 15, 2018

Embulk / Presto / Sparkを用いたETL事情

PLAZMA OSS Day: TD Tech Talk 2018

24b18d2f8d7ceb0fbd521053993fba9e?s=128

kimutyam

February 15, 2018
Tweet

More Decks by kimutyam

Other Decks in Technology

Transcript

  1. PLAZMA OSS Day: TD Tech Talk 2018 @kimutyam 2018/02/15 Embulk

    / Presto / Spark Λ༻͍ͨETLࣄ৘
  2. ໊લ໦ଜজ޺ 5XJUUFS!LJNVUZBN ձࣾηϓςʔχɾΦϦδφϧ ηϓςʔχͷσʔλج൫ߏஙϓϩδΣΫτΛاըɾ্ཱͪ͛ɻ ϓϩδΣΫτͰ͸ओʹΞʔΩςΫνϟߏஙɾ಺෦඼࣭؅ཧΛݗҾ &NCVML%JHEBH5SFBTVSF%BUB4DBMB%%%*$0/*9 Profile

  3.  "SDIJUFDUVSF  )PX5P&5-  %BUB$POOFDUPSPO5SFBTVSF8PSLqPX  &NCVMLPO%JHEBH  %BUB$POOFDUPS1SFTUPPO5SFBTVSF8PSLqPX

     4QBSL%BUB$POOFDUPS  %FTJHO$PODFQU  4VNNBSZ Agenda
  4. 1. Architecture

  5. Overview 1MB[NB%# %BUB5BOL DynamoDB and so on… Lamdba EC2 RDS

    for DataMart RDS for DataMart S3 for Image/Video EC2 S3 for Data Lake Treasure Data for DWH EMR Internal System for Data Store API Gateway RDS
  6. Contents scope to talk about.. 1MB[NB%# and so on… S3

    for Data Lake Treasure Data for DWH EMR Internal System for Data Store
  7. 6ODPOUSPMMBCMF &YUFSOBM%BUB   ഔମࣾͷ޿ࠂߏ଄޿ࠂϨϙʔτӡ༻ϩά  SE1BSUZ"E4FSWFSͷ࣮੷ Business Restriction $POUSPMMBCMF

    *OUFSOBM%BUB   ૊৫ɾਓͷجຊ৘ใ  ސ٬৘ใ  ঎گ  ΞϓϦέʔγϣϯϩά and so on… ηϓςʔχͰ͸ඇৗʹଟ͘ͷഔମࣾͱSE1BSUZͷ֎෦σʔλΛϏδωεͰར༻ ͠·͢ɻ֎෦σʔλͷϑΥʔϚοτ͸զʑͰ͸ίϯτϩʔϧ͕Ͱ͖·ͤΜɻ
  8. Transform(Parse, Filter, Enrich etc..) S3 for Data Lake Treasure Data

    for DWH EMR ଟ͘ͷ֎෦σʔλߏ଄Λ&5-ॲཧͰ σʔλ׆༻ऀʹ؆୯ͳ42-ͰӾཡ͞ ͤΔ&5-ॲཧ ଟ͘ͷ৔߹ɺ5SBOTGPSNʹίετ͕ ͔͔Γ·ͨ͠ɻ
  9. 2. How to ETL

  10. 2-1. Data Connector on Treasure Workflow

  11. Overview S3 for Data Lake Treasure Data for DWH Data

    Connector ࢫຯ &NCVMLͷϗεςΟϯάͰ͋Δ͜ͱ 5SFBTVSF8PSLqPX্Ͱಈ࡞͢Δ͜ͱ  ؆୯ͳ %4-Ͱهड़Ͱ͖Δ͜ͱ Ͱ͖ͳ͍͜ͱ ݱ࣌఺   YNMͷ5SBOTGPSN  ഑ྻͷߦల։ FNCVMLpMUFSFYQBOE@KTPO@BSSBZ͕ཉ͍͠
  12. IN / OUT { "user": { "id": 1, "name": "kimutyam",

    "age": 29 }, "request": { "params": { "user_id": 1 } } } id name age 1 kimutyam 29 In Out
  13. ETL with Data Connector filters: - type: add_time to_column: name:

    time from_value: value: 0 unix_timestamp_unit: sec - type: expand_json json_column_name: record root: "$." expanded_columns: - {name: "user.id", type: long} - {name: "user.name", type: string} - {name: "user.age", type: long} out: type: td in: type: s3 bucket: hoge endpoint: s3-ap-northeast-1.amazonaws.com access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: "/foo/bar/user.json" parser: type: json charset: UTF-8 newline: LF ※ εϥΠυͷ౎߹্ɺ෼ׂ
  14. 2-2. Embulk on Digdag

  15. Overview S3 for Data Lake Treasure Data for DWH ࢫຯ

     044Ͱଟ͘ͷΤίγεςϜ͕ଘࡏ͢Δ ֦ுੑߴ   044ʹͳ͚Ε͹ϓϥάΠϯͭ͘Ε͹͍͍  ؆୯ͳ %4-Ͱهड़Ͱ͖Δ͜ͱ ೉఺  %JHEBH4FSWFSʹෛՙ͕͔͔Δ  FNCVMLFYFDVUPSNBQSFEVDFͰ෼ࢄ؀ڥʹ ಀ͕͢ํ๏͸͋Δ
  16. IN / OUT user_id friend_id name 2 1 Kimutyam 2

    3 john { "user_id" : 2, "friends" : [ { "friend_id": 1, "name": "kimutyam" }, { "friend_id": 3, "name": “john" }] } ഑ྻΛߦʹల։ In Out
  17. ETL with Embulk in: type: s3 bucket: hoge endpoint: s3-ap-northeast-1.amazonaws.com

    access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: /foo/bar/friends.json parser: type: json filters: - type: expand_json json_column_name: record root: $. expanded_columns: - {name: user_id, type: long} keep_expanding_json_column: true - type: expand_json_array json_column_name: record root: $.friends expanded_column_name: friends - type: expand_json json_column_name: friends root: $. expanded_columns: - {name: friend_id, type: long, path: id} - {name: name, type: string} - type: remove_columns remove: ["record"] - type: add_time to_column: name: time type: timestamp from_value: value: 0 ※ εϥΠυͷ౎߹্ɺ෼ׂ out: type: td apikey: bar database: hoge table: foo
  18. [FYI] Develop Embulk Plugin with Scala (1) S3 for Data

    Lake Treasure Data for DWH FNCVMLQBSTFSUXJUUFS@BET@TUBUT $POUSJCVUPSLJNVUZBN UXJUUFSBETTUBUTͷυϝΠϯಛԽܕͷ1MVHJO ׆༻Ϣʔεέʔε  ଟஈͷωετͨ͠഑ྻΛಛఆͷ৚݅Ͱ'MBUUFO͍ͨ͠  ಛఆͷ৚݅Ͱෳ਺ͷςʔϒϧʹ0VUQVU͍ͨ͠  FNCVMLQBSTFSUXJUUFS@BET@TUBUTͰύϥϝʔλ Λάϧʔϐϯάͯ͠+40/ܕΛੜ੒  %BUB$POOFDUPSͷ&YQMPEF+TPO$PMVNOͰ 5SFBTVSF%BUBʹൃՐ
  19. [FYI] Develop Embulk Plugin with Scala (2) S3 for Data

    Lake Treasure Data for DWH FNCVMLQBSTFSYQBUI $POUSJCVUPSNBKJ,: 91BUI׬શରԠ +BWBඪ४ͷ9.-ύʔαʔΛར༻͍ͯ͠Δ
  20. 2-3. Data Connector & Presto on Treasure Workflow

  21. Overview Data Connector S3 for Data Lake Treasure Data for

    DWH ࢫຯ  1SFTUP͸5SFBTVSF%BUBϗεςΟϯά  %JHEBH4FSWFSෛՙܰݮ  ࣗલͷ෼ࢄ؀ڥ͕ෆཁ ೉఺  ϫʔΫϑϩʔͷهड़ྔ͕૿͑Δ  1SFTUPΫΤϦ͕೉ಡԽ͢Δ৔߹΋
  22. IN / OUT Id name type value 1 kimutyam Company

    Septeni Original,Inc. 1 kimutyam Language Scala 2 john Country America { "data": [ { "id": 1, "name": "kimutyam", "details": [ { "type": "Company", "value": "Septeni Original,Inc." }, { "type": "Language", "value": "Scala" }, ] }, { "id": 2, "name": "john", "details": [ { "type": "Country", "value": "America" } ] } ] } In Out
  23. Workflow timezone: UTC +load_raw_data: td_load>: raw_data.yml database: foo table: raw

    +transform_raw_data: td>: queries/insert.sql database: foo +drop_raw_table: td_ddl>: database: foo drop_tables: ["raw"] Data ConnectorͰJSONͷੜσʔλΛtd_load PrestoͰJSONղੳ ੜσʔλ࡟আ
  24. ETL with Data Connector filters: - type: add_time to_column: name:

    time from_value: value: 0 unix_timestamp_unit: sec out: type: td in: type: s3 bucket: hoge endpoint: s3-ap-northeast-1.amazonaws.com access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: “/foo/bar/user_details.json” parser: type: json charset: UTF-8 newline: LF ※ εϥΠυͷ౎߹্ɺ෼ׂ
  25. ETL with Presto SELECT time, id, name, CAST(detail['type'] AS VARCHAR)

    AS type, CAST(detail['value'] AS VARCHAR) AS value FROM details CROSS JOIN UNNEST(details) AS details(detail) INSERT INTO user_details WITH users AS ( SELECT time, CAST(JSON_EXTRACT(payload, '$.data') AS ARRAY<MAP<VARCHAR, JSON>>) AS data FROM raw ), details AS ( SELECT time, CAST(d['id'] AS INTEGER) AS id, CAST(d['name'] AS VARCHAR) AS name, CAST(d['details'] AS ARRAY<MAP<VARCHAR, JSON>>) AS details FROM users CROSS JOIN UNNEST(data) AS data(d) ) ※ εϥΠυͷ౎߹্ɺ෼ׂ
  26. 2-4. Spark & Data Connector

  27. S3 for Data Lake Overview Treasure Data for DWH EMR

    ࢫຯ  ෳ਺ͷग़ྗઌʹ&5-ॲཧ݁ՌΛ0VUQVUͰ͖Δ  %JHEBH4FSWFSͷෛՙ෼ࢄ  ཁ݅ʹԠͯ͡෼ࢄॲཧϑϨʔϜϫʔΫΛબ୒Ͱ͖Δ ೉఺ ݱ࣌఺   ෼ࢄ؀ڥΛࣗલͰ༻ҙ͢Δඞཁ͕͋Δ  ίʔυΛॻ͘ඞཁ͕͋Δ
  28. In / Out <?xml version="1.0" encoding="utf-8"?> <users> <user> <id>1</id> <name>kimutyam</name>

    </user> <user> <id>2</id> <name>john</name> </user> </users> Id name 1 kimutyam 2 john In Out
  29. Workflow timezone: UTC +xml_to_csv: sh>: xml_to_csv.sh +load_csv: td_load>: csv_load.yml database:

    foo table: bar SparkͰData Lake͔ΒETLॲཧΛͯ͠CSVΛ2࣍੥͚ͷS3ʹ Data ConnectorͰ2࣍੥͚ͷS3͔ΒTDʹٵ্͍͛ %JHEBHͷFNS͕BQOPSUIFBTUͰ͸ಈ࡞ͤͣ *TTVF  ୅ΘΓʹsh>: ͰAWS CLI࢖ͬͯىಈɾఀࢭͨ͠
  30. ETL with Spark object SampleApp extends App{ val inputPath =

    “s3://foo/bar/input/“ val outputPath = “s3://foo/bar/output” LoanPatten.using( SparkSession.builder().appName("sample").master("local").getOrCreate() ) { ss => val rowTag = "user" val df = ss.read .format("com.databricks.spark.xml") .option("rowTag", rowTag) .load(inputPath) df.select( "id", "name" ) .toDF( "user_id", "name" ) .coalesce(128) .write .mode(SaveMode.Overwrite) .option("header", true) .option("codec", "gzip") .csv(outputPath) } } object LoanPattern { type Stoppable = {def stop(): Unit} def using[R <: Stoppable, A](session: R)(f: R => A): A = try { f(session) } finally { ignoring(classOf[Throwable]) apply { session.stop() } } }
  31. ETL with Data Connector in: type: s3 bucket: hoge endpoint:

    s3-ap-northeast-1.amazonaws.com access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: "/foo/bar/user.csv" parser: type: csv charset: UTF-8 newline: CRLF delimiter: "," quote: '"' escape: '"' skip_header_lines: 1 columns: - {name: user_id, type: long} - {name: name, type: string} out: type: td
  32. 2-5. Design Concept

  33. Overview

  34. 3. Summary

  35. Summary ηϓςʔχɾΦϦδφϧͰߦ͍ͬͯΔ&5-ॲཧʹ͍͓ͭͯ࿩͠·ͨ͠ɻ ηϓςʔχɾΦϦδφϧͰ͸σʔλυϦϒϯͳ૊৫ܦӦʹ஫ྗ͓ͯ͠Γɺ σʔλج൫։ൃ͕த֩ͷͭʹͳ͍ͬͯ·͢ɻ ࠓޙߦ͏͜ͱ σʔλ඼࣭ͷ޲্ σʔλ׆༻ࣄྫͷ֦ு ૊৫಺σʔλͷू໿

  36. ηϓςʔχɾΦϦδφϧͰ͸Πϯλʔωοτ޿ࠂͷσʔλج൫ΛҰॹʹҭͯΔ σʔλΤϯδχΞΛืू͍ͯ͠·͢

  37. ͝੩ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠