Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Embulk / Presto / Sparkを用いたETL事情

kimutyam
February 15, 2018

Embulk / Presto / Sparkを用いたETL事情

PLAZMA OSS Day: TD Tech Talk 2018

kimutyam

February 15, 2018
Tweet

More Decks by kimutyam

Other Decks in Technology

Transcript

  1. Overview 1MB[NB%# %BUB5BOL DynamoDB and so on… Lamdba EC2 RDS

    for DataMart RDS for DataMart S3 for Image/Video EC2 S3 for Data Lake Treasure Data for DWH EMR Internal System for Data Store API Gateway RDS
  2. Contents scope to talk about.. 1MB[NB%# and so on… S3

    for Data Lake Treasure Data for DWH EMR Internal System for Data Store
  3. 6ODPOUSPMMBCMF &YUFSOBM%BUB   ഔମࣾͷ޿ࠂߏ଄޿ࠂϨϙʔτӡ༻ϩά  SE1BSUZ"E4FSWFSͷ࣮੷ Business Restriction $POUSPMMBCMF

    *OUFSOBM%BUB   ૊৫ɾਓͷجຊ৘ใ  ސ٬৘ใ  ঎گ  ΞϓϦέʔγϣϯϩά and so on… ηϓςʔχͰ͸ඇৗʹଟ͘ͷഔମࣾͱSE1BSUZͷ֎෦σʔλΛϏδωεͰར༻ ͠·͢ɻ֎෦σʔλͷϑΥʔϚοτ͸զʑͰ͸ίϯτϩʔϧ͕Ͱ͖·ͤΜɻ
  4. Transform(Parse, Filter, Enrich etc..) S3 for Data Lake Treasure Data

    for DWH EMR ଟ͘ͷ֎෦σʔλߏ଄Λ&5-ॲཧͰ σʔλ׆༻ऀʹ؆୯ͳ42-ͰӾཡ͞ ͤΔ&5-ॲཧ ଟ͘ͷ৔߹ɺ5SBOTGPSNʹίετ͕ ͔͔Γ·ͨ͠ɻ
  5. Overview S3 for Data Lake Treasure Data for DWH Data

    Connector ࢫຯ &NCVMLͷϗεςΟϯάͰ͋Δ͜ͱ 5SFBTVSF8PSLqPX্Ͱಈ࡞͢Δ͜ͱ  ؆୯ͳ %4-Ͱهड़Ͱ͖Δ͜ͱ Ͱ͖ͳ͍͜ͱ ݱ࣌఺   YNMͷ5SBOTGPSN  ഑ྻͷߦల։ FNCVMLpMUFSFYQBOE@KTPO@BSSBZ͕ཉ͍͠
  6. IN / OUT { "user": { "id": 1, "name": "kimutyam",

    "age": 29 }, "request": { "params": { "user_id": 1 } } } id name age 1 kimutyam 29 In Out
  7. ETL with Data Connector filters: - type: add_time to_column: name:

    time from_value: value: 0 unix_timestamp_unit: sec - type: expand_json json_column_name: record root: "$." expanded_columns: - {name: "user.id", type: long} - {name: "user.name", type: string} - {name: "user.age", type: long} out: type: td in: type: s3 bucket: hoge endpoint: s3-ap-northeast-1.amazonaws.com access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: "/foo/bar/user.json" parser: type: json charset: UTF-8 newline: LF ※ εϥΠυͷ౎߹্ɺ෼ׂ
  8. Overview S3 for Data Lake Treasure Data for DWH ࢫຯ

     044Ͱଟ͘ͷΤίγεςϜ͕ଘࡏ͢Δ ֦ுੑߴ   044ʹͳ͚Ε͹ϓϥάΠϯͭ͘Ε͹͍͍  ؆୯ͳ %4-Ͱهड़Ͱ͖Δ͜ͱ ೉఺  %JHEBH4FSWFSʹෛՙ͕͔͔Δ  FNCVMLFYFDVUPSNBQSFEVDFͰ෼ࢄ؀ڥʹ ಀ͕͢ํ๏͸͋Δ
  9. IN / OUT user_id friend_id name 2 1 Kimutyam 2

    3 john { "user_id" : 2, "friends" : [ { "friend_id": 1, "name": "kimutyam" }, { "friend_id": 3, "name": “john" }] } ഑ྻΛߦʹల։ In Out
  10. ETL with Embulk in: type: s3 bucket: hoge endpoint: s3-ap-northeast-1.amazonaws.com

    access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: /foo/bar/friends.json parser: type: json filters: - type: expand_json json_column_name: record root: $. expanded_columns: - {name: user_id, type: long} keep_expanding_json_column: true - type: expand_json_array json_column_name: record root: $.friends expanded_column_name: friends - type: expand_json json_column_name: friends root: $. expanded_columns: - {name: friend_id, type: long, path: id} - {name: name, type: string} - type: remove_columns remove: ["record"] - type: add_time to_column: name: time type: timestamp from_value: value: 0 ※ εϥΠυͷ౎߹্ɺ෼ׂ out: type: td apikey: bar database: hoge table: foo
  11. [FYI] Develop Embulk Plugin with Scala (1) S3 for Data

    Lake Treasure Data for DWH FNCVMLQBSTFSUXJUUFS@BET@TUBUT $POUSJCVUPSLJNVUZBN UXJUUFSBETTUBUTͷυϝΠϯಛԽܕͷ1MVHJO ׆༻Ϣʔεέʔε  ଟஈͷωετͨ͠഑ྻΛಛఆͷ৚݅Ͱ'MBUUFO͍ͨ͠  ಛఆͷ৚݅Ͱෳ਺ͷςʔϒϧʹ0VUQVU͍ͨ͠  FNCVMLQBSTFSUXJUUFS@BET@TUBUTͰύϥϝʔλ Λάϧʔϐϯάͯ͠+40/ܕΛੜ੒  %BUB$POOFDUPSͷ&YQMPEF+TPO$PMVNOͰ 5SFBTVSF%BUBʹൃՐ
  12. [FYI] Develop Embulk Plugin with Scala (2) S3 for Data

    Lake Treasure Data for DWH FNCVMLQBSTFSYQBUI $POUSJCVUPSNBKJ,: 91BUI׬શରԠ +BWBඪ४ͷ9.-ύʔαʔΛར༻͍ͯ͠Δ
  13. Overview Data Connector S3 for Data Lake Treasure Data for

    DWH ࢫຯ  1SFTUP͸5SFBTVSF%BUBϗεςΟϯά  %JHEBH4FSWFSෛՙܰݮ  ࣗલͷ෼ࢄ؀ڥ͕ෆཁ ೉఺  ϫʔΫϑϩʔͷهड़ྔ͕૿͑Δ  1SFTUPΫΤϦ͕೉ಡԽ͢Δ৔߹΋
  14. IN / OUT Id name type value 1 kimutyam Company

    Septeni Original,Inc. 1 kimutyam Language Scala 2 john Country America { "data": [ { "id": 1, "name": "kimutyam", "details": [ { "type": "Company", "value": "Septeni Original,Inc." }, { "type": "Language", "value": "Scala" }, ] }, { "id": 2, "name": "john", "details": [ { "type": "Country", "value": "America" } ] } ] } In Out
  15. Workflow timezone: UTC +load_raw_data: td_load>: raw_data.yml database: foo table: raw

    +transform_raw_data: td>: queries/insert.sql database: foo +drop_raw_table: td_ddl>: database: foo drop_tables: ["raw"] Data ConnectorͰJSONͷੜσʔλΛtd_load PrestoͰJSONղੳ ੜσʔλ࡟আ
  16. ETL with Data Connector filters: - type: add_time to_column: name:

    time from_value: value: 0 unix_timestamp_unit: sec out: type: td in: type: s3 bucket: hoge endpoint: s3-ap-northeast-1.amazonaws.com access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: “/foo/bar/user_details.json” parser: type: json charset: UTF-8 newline: LF ※ εϥΠυͷ౎߹্ɺ෼ׂ
  17. ETL with Presto SELECT time, id, name, CAST(detail['type'] AS VARCHAR)

    AS type, CAST(detail['value'] AS VARCHAR) AS value FROM details CROSS JOIN UNNEST(details) AS details(detail) INSERT INTO user_details WITH users AS ( SELECT time, CAST(JSON_EXTRACT(payload, '$.data') AS ARRAY<MAP<VARCHAR, JSON>>) AS data FROM raw ), details AS ( SELECT time, CAST(d['id'] AS INTEGER) AS id, CAST(d['name'] AS VARCHAR) AS name, CAST(d['details'] AS ARRAY<MAP<VARCHAR, JSON>>) AS details FROM users CROSS JOIN UNNEST(data) AS data(d) ) ※ εϥΠυͷ౎߹্ɺ෼ׂ
  18. S3 for Data Lake Overview Treasure Data for DWH EMR

    ࢫຯ  ෳ਺ͷग़ྗઌʹ&5-ॲཧ݁ՌΛ0VUQVUͰ͖Δ  %JHEBH4FSWFSͷෛՙ෼ࢄ  ཁ݅ʹԠͯ͡෼ࢄॲཧϑϨʔϜϫʔΫΛબ୒Ͱ͖Δ ೉఺ ݱ࣌఺   ෼ࢄ؀ڥΛࣗલͰ༻ҙ͢Δඞཁ͕͋Δ  ίʔυΛॻ͘ඞཁ͕͋Δ
  19. In / Out <?xml version="1.0" encoding="utf-8"?> <users> <user> <id>1</id> <name>kimutyam</name>

    </user> <user> <id>2</id> <name>john</name> </user> </users> Id name 1 kimutyam 2 john In Out
  20. Workflow timezone: UTC +xml_to_csv: sh>: xml_to_csv.sh +load_csv: td_load>: csv_load.yml database:

    foo table: bar SparkͰData Lake͔ΒETLॲཧΛͯ͠CSVΛ2࣍੥͚ͷS3ʹ Data ConnectorͰ2࣍੥͚ͷS3͔ΒTDʹٵ্͍͛ %JHEBHͷFNS͕BQOPSUIFBTUͰ͸ಈ࡞ͤͣ *TTVF  ୅ΘΓʹsh>: ͰAWS CLI࢖ͬͯىಈɾఀࢭͨ͠
  21. ETL with Spark object SampleApp extends App{ val inputPath =

    “s3://foo/bar/input/“ val outputPath = “s3://foo/bar/output” LoanPatten.using( SparkSession.builder().appName("sample").master("local").getOrCreate() ) { ss => val rowTag = "user" val df = ss.read .format("com.databricks.spark.xml") .option("rowTag", rowTag) .load(inputPath) df.select( "id", "name" ) .toDF( "user_id", "name" ) .coalesce(128) .write .mode(SaveMode.Overwrite) .option("header", true) .option("codec", "gzip") .csv(outputPath) } } object LoanPattern { type Stoppable = {def stop(): Unit} def using[R <: Stoppable, A](session: R)(f: R => A): A = try { f(session) } finally { ignoring(classOf[Throwable]) apply { session.stop() } } }
  22. ETL with Data Connector in: type: s3 bucket: hoge endpoint:

    s3-ap-northeast-1.amazonaws.com access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: "/foo/bar/user.csv" parser: type: csv charset: UTF-8 newline: CRLF delimiter: "," quote: '"' escape: '"' skip_header_lines: 1 columns: - {name: user_id, type: long} - {name: name, type: string} out: type: td