Slide 1

Slide 1 text

PLAZMA OSS Day: TD Tech Talk 2018 @kimutyam 2018/02/15 Embulk / Presto / Spark Λ༻͍ͨETLࣄ৘

Slide 2

Slide 2 text

໊લ໦ଜজ޺ 5XJUUFS!LJNVUZBN ձࣾηϓςʔχɾΦϦδφϧ ηϓςʔχͷσʔλج൫ߏஙϓϩδΣΫτΛاըɾ্ཱͪ͛ɻ ϓϩδΣΫτͰ͸ओʹΞʔΩςΫνϟߏஙɾ಺෦඼࣭؅ཧΛݗҾ &NCVML%JHEBH5SFBTVSF%BUB4DBMB%%%*$0/*9 Profile

Slide 3

Slide 3 text

"SDIJUFDUVSF )PX5P&5- %BUB$POOFDUPSPO5SFBTVSF8PSLqPX &NCVMLPO%JHEBH %BUB$POOFDUPS1SFTUPPO5SFBTVSF8PSLqPX 4QBSL%BUB$POOFDUPS %FTJHO$PODFQU 4VNNBSZ Agenda

Slide 4

Slide 4 text

1. Architecture

Slide 5

Slide 5 text

Overview 1MB[NB%# %BUB5BOL DynamoDB and so on… Lamdba EC2 RDS for DataMart RDS for DataMart S3 for Image/Video EC2 S3 for Data Lake Treasure Data for DWH EMR Internal System for Data Store API Gateway RDS

Slide 6

Slide 6 text

Contents scope to talk about.. 1MB[NB%# and so on… S3 for Data Lake Treasure Data for DWH EMR Internal System for Data Store

Slide 7

Slide 7 text

6ODPOUSPMMBCMF &YUFSOBM%BUB ഔମࣾͷ޿ࠂߏ଄޿ࠂϨϙʔτӡ༻ϩά SE1BSUZ"E4FSWFSͷ࣮੷ Business Restriction $POUSPMMBCMF *OUFSOBM%BUB ૊৫ɾਓͷجຊ৘ใ ސ٬৘ใ ঎گ ΞϓϦέʔγϣϯϩά and so on… ηϓςʔχͰ͸ඇৗʹଟ͘ͷഔମࣾͱSE1BSUZͷ֎෦σʔλΛϏδωεͰར༻ ͠·͢ɻ֎෦σʔλͷϑΥʔϚοτ͸զʑͰ͸ίϯτϩʔϧ͕Ͱ͖·ͤΜɻ

Slide 8

Slide 8 text

Transform(Parse, Filter, Enrich etc..) S3 for Data Lake Treasure Data for DWH EMR ଟ͘ͷ֎෦σʔλߏ଄Λ&5-ॲཧͰ σʔλ׆༻ऀʹ؆୯ͳ42-ͰӾཡ͞ ͤΔ&5-ॲཧ ଟ͘ͷ৔߹ɺ5SBOTGPSNʹίετ͕ ͔͔Γ·ͨ͠ɻ

Slide 9

Slide 9 text

2. How to ETL

Slide 10

Slide 10 text

2-1. Data Connector on Treasure Workflow

Slide 11

Slide 11 text

Overview S3 for Data Lake Treasure Data for DWH Data Connector ࢫຯ &NCVMLͷϗεςΟϯάͰ͋Δ͜ͱ 5SFBTVSF8PSLqPX্Ͱಈ࡞͢Δ͜ͱ ؆୯ͳ %4-Ͱهड़Ͱ͖Δ͜ͱ Ͱ͖ͳ͍͜ͱ ݱ࣌఺ YNMͷ5SBOTGPSN ഑ྻͷߦల։ FNCVMLpMUFSFYQBOE@KTPO@BSSBZ͕ཉ͍͠

Slide 12

Slide 12 text

IN / OUT { "user": { "id": 1, "name": "kimutyam", "age": 29 }, "request": { "params": { "user_id": 1 } } } id name age 1 kimutyam 29 In Out

Slide 13

Slide 13 text

ETL with Data Connector filters: - type: add_time to_column: name: time from_value: value: 0 unix_timestamp_unit: sec - type: expand_json json_column_name: record root: "$." expanded_columns: - {name: "user.id", type: long} - {name: "user.name", type: string} - {name: "user.age", type: long} out: type: td in: type: s3 bucket: hoge endpoint: s3-ap-northeast-1.amazonaws.com access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: "/foo/bar/user.json" parser: type: json charset: UTF-8 newline: LF ※ εϥΠυͷ౎߹্ɺ෼ׂ

Slide 14

Slide 14 text

2-2. Embulk on Digdag

Slide 15

Slide 15 text

Overview S3 for Data Lake Treasure Data for DWH ࢫຯ 044Ͱଟ͘ͷΤίγεςϜ͕ଘࡏ͢Δ ֦ுੑߴ 044ʹͳ͚Ε͹ϓϥάΠϯͭ͘Ε͹͍͍ ؆୯ͳ %4-Ͱهड़Ͱ͖Δ͜ͱ ೉఺ %JHEBH4FSWFSʹෛՙ͕͔͔Δ FNCVMLFYFDVUPSNBQSFEVDFͰ෼ࢄ؀ڥʹ ಀ͕͢ํ๏͸͋Δ

Slide 16

Slide 16 text

IN / OUT user_id friend_id name 2 1 Kimutyam 2 3 john { "user_id" : 2, "friends" : [ { "friend_id": 1, "name": "kimutyam" }, { "friend_id": 3, "name": “john" }] } ഑ྻΛߦʹల։ In Out

Slide 17

Slide 17 text

ETL with Embulk in: type: s3 bucket: hoge endpoint: s3-ap-northeast-1.amazonaws.com access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: /foo/bar/friends.json parser: type: json filters: - type: expand_json json_column_name: record root: $. expanded_columns: - {name: user_id, type: long} keep_expanding_json_column: true - type: expand_json_array json_column_name: record root: $.friends expanded_column_name: friends - type: expand_json json_column_name: friends root: $. expanded_columns: - {name: friend_id, type: long, path: id} - {name: name, type: string} - type: remove_columns remove: ["record"] - type: add_time to_column: name: time type: timestamp from_value: value: 0 ※ εϥΠυͷ౎߹্ɺ෼ׂ out: type: td apikey: bar database: hoge table: foo

Slide 18

Slide 18 text

[FYI] Develop Embulk Plugin with Scala (1) S3 for Data Lake Treasure Data for DWH FNCVMLQBSTFSUXJUUFS@BET@TUBUT $POUSJCVUPSLJNVUZBN UXJUUFSBETTUBUTͷυϝΠϯಛԽܕͷ1MVHJO ׆༻Ϣʔεέʔε ଟஈͷωετͨ͠഑ྻΛಛఆͷ৚݅Ͱ'MBUUFO͍ͨ͠ ಛఆͷ৚݅Ͱෳ਺ͷςʔϒϧʹ0VUQVU͍ͨ͠ FNCVMLQBSTFSUXJUUFS@BET@TUBUTͰύϥϝʔλ Λάϧʔϐϯάͯ͠+40/ܕΛੜ੒ %BUB$POOFDUPSͷ&YQMPEF+TPO$PMVNOͰ 5SFBTVSF%BUBʹൃՐ

Slide 19

Slide 19 text

[FYI] Develop Embulk Plugin with Scala (2) S3 for Data Lake Treasure Data for DWH FNCVMLQBSTFSYQBUI $POUSJCVUPSNBKJ,: 91BUI׬શରԠ +BWBඪ४ͷ9.-ύʔαʔΛར༻͍ͯ͠Δ

Slide 20

Slide 20 text

2-3. Data Connector & Presto on Treasure Workflow

Slide 21

Slide 21 text

Overview Data Connector S3 for Data Lake Treasure Data for DWH ࢫຯ 1SFTUP͸5SFBTVSF%BUBϗεςΟϯά %JHEBH4FSWFSෛՙܰݮ ࣗલͷ෼ࢄ؀ڥ͕ෆཁ ೉఺ ϫʔΫϑϩʔͷهड़ྔ͕૿͑Δ 1SFTUPΫΤϦ͕೉ಡԽ͢Δ৔߹΋

Slide 22

Slide 22 text

IN / OUT Id name type value 1 kimutyam Company Septeni Original,Inc. 1 kimutyam Language Scala 2 john Country America { "data": [ { "id": 1, "name": "kimutyam", "details": [ { "type": "Company", "value": "Septeni Original,Inc." }, { "type": "Language", "value": "Scala" }, ] }, { "id": 2, "name": "john", "details": [ { "type": "Country", "value": "America" } ] } ] } In Out

Slide 23

Slide 23 text

Workflow timezone: UTC +load_raw_data: td_load>: raw_data.yml database: foo table: raw +transform_raw_data: td>: queries/insert.sql database: foo +drop_raw_table: td_ddl>: database: foo drop_tables: ["raw"] Data ConnectorͰJSONͷੜσʔλΛtd_load PrestoͰJSONղੳ ੜσʔλ࡟আ

Slide 24

Slide 24 text

ETL with Data Connector filters: - type: add_time to_column: name: time from_value: value: 0 unix_timestamp_unit: sec out: type: td in: type: s3 bucket: hoge endpoint: s3-ap-northeast-1.amazonaws.com access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: “/foo/bar/user_details.json” parser: type: json charset: UTF-8 newline: LF ※ εϥΠυͷ౎߹্ɺ෼ׂ

Slide 25

Slide 25 text

ETL with Presto SELECT time, id, name, CAST(detail['type'] AS VARCHAR) AS type, CAST(detail['value'] AS VARCHAR) AS value FROM details CROSS JOIN UNNEST(details) AS details(detail) INSERT INTO user_details WITH users AS ( SELECT time, CAST(JSON_EXTRACT(payload, '$.data') AS ARRAY>) AS data FROM raw ), details AS ( SELECT time, CAST(d['id'] AS INTEGER) AS id, CAST(d['name'] AS VARCHAR) AS name, CAST(d['details'] AS ARRAY>) AS details FROM users CROSS JOIN UNNEST(data) AS data(d) ) ※ εϥΠυͷ౎߹্ɺ෼ׂ

Slide 26

Slide 26 text

2-4. Spark & Data Connector

Slide 27

Slide 27 text

S3 for Data Lake Overview Treasure Data for DWH EMR ࢫຯ ෳ਺ͷग़ྗઌʹ&5-ॲཧ݁ՌΛ0VUQVUͰ͖Δ %JHEBH4FSWFSͷෛՙ෼ࢄ ཁ݅ʹԠͯ͡෼ࢄॲཧϑϨʔϜϫʔΫΛબ୒Ͱ͖Δ ೉఺ ݱ࣌఺ ෼ࢄ؀ڥΛࣗલͰ༻ҙ͢Δඞཁ͕͋Δ ίʔυΛॻ͘ඞཁ͕͋Δ

Slide 28

Slide 28 text

In / Out 1 kimutyam 2 john Id name 1 kimutyam 2 john In Out

Slide 29

Slide 29 text

Workflow timezone: UTC +xml_to_csv: sh>: xml_to_csv.sh +load_csv: td_load>: csv_load.yml database: foo table: bar SparkͰData Lake͔ΒETLॲཧΛͯ͠CSVΛ2࣍੥͚ͷS3ʹ Data ConnectorͰ2࣍੥͚ͷS3͔ΒTDʹٵ্͍͛ %JHEBHͷFNS͕BQOPSUIFBTUͰ͸ಈ࡞ͤͣ *TTVF ୅ΘΓʹsh>: ͰAWS CLI࢖ͬͯىಈɾఀࢭͨ͠

Slide 30

Slide 30 text

ETL with Spark object SampleApp extends App{ val inputPath = “s3://foo/bar/input/“ val outputPath = “s3://foo/bar/output” LoanPatten.using( SparkSession.builder().appName("sample").master("local").getOrCreate() ) { ss => val rowTag = "user" val df = ss.read .format("com.databricks.spark.xml") .option("rowTag", rowTag) .load(inputPath) df.select( "id", "name" ) .toDF( "user_id", "name" ) .coalesce(128) .write .mode(SaveMode.Overwrite) .option("header", true) .option("codec", "gzip") .csv(outputPath) } } object LoanPattern { type Stoppable = {def stop(): Unit} def using[R <: Stoppable, A](session: R)(f: R => A): A = try { f(session) } finally { ignoring(classOf[Throwable]) apply { session.stop() } } }

Slide 31

Slide 31 text

ETL with Data Connector in: type: s3 bucket: hoge endpoint: s3-ap-northeast-1.amazonaws.com access_key_id: ${secret:s3.access_key_id} secret_access_key: ${secret:s3.secret_access_key} path_prefix: "/foo/bar/user.csv" parser: type: csv charset: UTF-8 newline: CRLF delimiter: "," quote: '"' escape: '"' skip_header_lines: 1 columns: - {name: user_id, type: long} - {name: name, type: string} out: type: td

Slide 32

Slide 32 text

2-5. Design Concept

Slide 33

Slide 33 text

Overview

Slide 34

Slide 34 text

3. Summary

Slide 35

Slide 35 text

Summary ηϓςʔχɾΦϦδφϧͰߦ͍ͬͯΔ&5-ॲཧʹ͍͓ͭͯ࿩͠·ͨ͠ɻ ηϓςʔχɾΦϦδφϧͰ͸σʔλυϦϒϯͳ૊৫ܦӦʹ஫ྗ͓ͯ͠Γɺ σʔλج൫։ൃ͕த֩ͷͭʹͳ͍ͬͯ·͢ɻ ࠓޙߦ͏͜ͱ σʔλ඼࣭ͷ޲্ σʔλ׆༻ࣄྫͷ֦ு ૊৫಺σʔλͷू໿

Slide 36

Slide 36 text

ηϓςʔχɾΦϦδφϧͰ͸Πϯλʔωοτ޿ࠂͷσʔλج൫ΛҰॹʹҭͯΔ σʔλΤϯδχΞΛืू͍ͯ͠·͢

Slide 37

Slide 37 text

͝੩ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠