Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Embulk / Presto / Sparkを用いたETL事情

kimutyam
February 15, 2018

Embulk / Presto / Sparkを用いたETL事情

PLAZMA OSS Day: TD Tech Talk 2018

kimutyam

February 15, 2018
Tweet

More Decks by kimutyam

Other Decks in Technology

Transcript

  1. PLAZMA OSS Day: TD Tech Talk 2018
    @kimutyam
    2018/02/15
    Embulk / Presto / Spark
    Λ༻͍ͨETLࣄ৘

    View Slide

  2. ໊લ໦ଜজ޺
    5XJUUFS!LJNVUZBN
    ձࣾηϓςʔχɾΦϦδφϧ
    ηϓςʔχͷσʔλج൫ߏஙϓϩδΣΫτΛاըɾ্ཱͪ͛ɻ
    ϓϩδΣΫτͰ͸ओʹΞʔΩςΫνϟߏஙɾ಺෦඼࣭؅ཧΛݗҾ
    &NCVML%JHEBH5SFBTVSF%BUB4DBMB%%%*$0/*9
    Profile

    View Slide

  3. "SDIJUFDUVSF
    )PX5P&5-
    %BUB$POOFDUPSPO5SFBTVSF8PSLqPX
    &NCVMLPO%JHEBH
    %BUB$POOFDUPS1SFTUPPO5SFBTVSF8PSLqPX
    4QBSL%BUB$POOFDUPS
    %FTJHO$PODFQU
    4VNNBSZ
    Agenda

    View Slide

  4. 1. Architecture

    View Slide

  5. Overview
    1MB[NB%# %BUB5BOL
    DynamoDB
    and so on…
    Lamdba
    EC2
    RDS for DataMart
    RDS for DataMart
    S3 for Image/Video
    EC2
    S3 for Data Lake
    Treasure Data for DWH
    EMR
    Internal System for Data Store
    API Gateway
    RDS

    View Slide

  6. Contents scope to talk about..
    1MB[NB%#
    and so on…
    S3 for Data Lake
    Treasure Data for DWH
    EMR
    Internal System for Data Store

    View Slide

  7. 6ODPOUSPMMBCMF &YUFSOBM%BUB

    ഔମࣾͷ޿ࠂߏ଄޿ࠂϨϙʔτӡ༻ϩά
    SE1BSUZ"E4FSWFSͷ࣮੷
    Business Restriction
    $POUSPMMBCMF *OUFSOBM%BUB

    ૊৫ɾਓͷجຊ৘ใ
    ސ٬৘ใ
    ঎گ
    ΞϓϦέʔγϣϯϩά
    and so on…
    ηϓςʔχͰ͸ඇৗʹଟ͘ͷഔମࣾͱSE1BSUZͷ֎෦σʔλΛϏδωεͰར༻
    ͠·͢ɻ֎෦σʔλͷϑΥʔϚοτ͸զʑͰ͸ίϯτϩʔϧ͕Ͱ͖·ͤΜɻ

    View Slide

  8. Transform(Parse, Filter, Enrich etc..)
    S3 for Data Lake
    Treasure Data for DWH
    EMR
    ଟ͘ͷ֎෦σʔλߏ଄Λ&5-ॲཧͰ
    σʔλ׆༻ऀʹ؆୯ͳ42-ͰӾཡ͞
    ͤΔ&5-ॲཧ
    ଟ͘ͷ৔߹ɺ5SBOTGPSNʹίετ͕
    ͔͔Γ·ͨ͠ɻ

    View Slide

  9. 2. How to ETL

    View Slide

  10. 2-1. Data Connector on Treasure Workflow

    View Slide

  11. Overview
    S3 for Data Lake
    Treasure Data for DWH
    Data Connector
    ࢫຯ
    &NCVMLͷϗεςΟϯάͰ͋Δ͜ͱ
    5SFBTVSF8PSLqPX্Ͱಈ࡞͢Δ͜ͱ
    ؆୯ͳ
    %4-Ͱهड़Ͱ͖Δ͜ͱ
    Ͱ͖ͳ͍͜ͱ ݱ࣌఺

    YNMͷ5SBOTGPSN
    ഑ྻͷߦల։
    [email protected]@BSSBZ͕ཉ͍͠

    View Slide

  12. IN / OUT
    {
    "user": {
    "id": 1,
    "name": "kimutyam",
    "age": 29
    },
    "request": {
    "params": {
    "user_id": 1
    }
    }
    }
    id name age
    1 kimutyam 29
    In Out

    View Slide

  13. ETL with Data Connector
    filters:
    - type: add_time
    to_column:
    name: time
    from_value:
    value: 0
    unix_timestamp_unit: sec
    - type: expand_json
    json_column_name: record
    root: "$."
    expanded_columns:
    - {name: "user.id", type: long}
    - {name: "user.name", type: string}
    - {name: "user.age", type: long}
    out:
    type: td
    in:
    type: s3
    bucket: hoge
    endpoint: s3-ap-northeast-1.amazonaws.com
    access_key_id: ${secret:s3.access_key_id}
    secret_access_key: ${secret:s3.secret_access_key}
    path_prefix: "/foo/bar/user.json"
    parser:
    type: json
    charset: UTF-8
    newline: LF
    ※ εϥΠυͷ౎߹্ɺ෼ׂ

    View Slide

  14. 2-2. Embulk on Digdag

    View Slide

  15. Overview
    S3 for Data Lake
    Treasure Data for DWH
    ࢫຯ
    044Ͱଟ͘ͷΤίγεςϜ͕ଘࡏ͢Δ ֦ுੑߴ

    044ʹͳ͚Ε͹ϓϥάΠϯͭ͘Ε͹͍͍
    ؆୯ͳ
    %4-Ͱهड़Ͱ͖Δ͜ͱ
    ೉఺
    %JHEBH4FSWFSʹෛՙ͕͔͔Δ
    FNCVMLFYFDVUPSNBQSFEVDFͰ෼ࢄ؀ڥʹ
    ಀ͕͢ํ๏͸͋Δ

    View Slide

  16. IN / OUT
    user_id friend_id name
    2 1 Kimutyam
    2 3 john
    {
    "user_id" : 2,
    "friends" : [ {
    "friend_id": 1,
    "name": "kimutyam"
    }, {
    "friend_id": 3,
    "name": “john"
    }]
    }
    ഑ྻΛߦʹల։
    In Out

    View Slide

  17. ETL with Embulk
    in:
    type: s3
    bucket: hoge
    endpoint: s3-ap-northeast-1.amazonaws.com
    access_key_id: ${secret:s3.access_key_id}
    secret_access_key: ${secret:s3.secret_access_key}
    path_prefix: /foo/bar/friends.json
    parser:
    type: json
    filters:
    - type: expand_json
    json_column_name: record
    root: $.
    expanded_columns:
    - {name: user_id, type: long}
    keep_expanding_json_column: true
    - type: expand_json_array
    json_column_name: record
    root: $.friends
    expanded_column_name: friends
    - type: expand_json
    json_column_name: friends
    root: $.
    expanded_columns:
    - {name: friend_id, type: long, path: id}
    - {name: name, type: string}
    - type: remove_columns
    remove: ["record"]
    - type: add_time
    to_column:
    name: time
    type: timestamp
    from_value:
    value: 0
    ※ εϥΠυͷ౎߹্ɺ෼ׂ
    out:
    type: td
    apikey: bar
    database: hoge
    table: foo

    View Slide

  18. [FYI] Develop Embulk Plugin with Scala (1)
    S3 for Data Lake
    Treasure Data for DWH
    [email protected]@TUBUT
    $POUSJCVUPSLJNVUZBN
    UXJUUFSBETTUBUTͷυϝΠϯಛԽܕͷ1MVHJO
    ׆༻Ϣʔεέʔε
    ଟஈͷωετͨ͠഑ྻΛಛఆͷ৚݅Ͱ'MBUUFO͍ͨ͠
    ಛఆͷ৚݅Ͱෳ਺ͷςʔϒϧʹ0VUQVU͍ͨ͠
    [email protected]@TUBUTͰύϥϝʔλ
    Λάϧʔϐϯάͯ͠+40/ܕΛੜ੒
    %BUB$POOFDUPSͷ&YQMPEF+TPO$PMVNOͰ
    5SFBTVSF%BUBʹൃՐ

    View Slide

  19. [FYI] Develop Embulk Plugin with Scala (2)
    S3 for Data Lake
    Treasure Data for DWH
    FNCVMLQBSTFSYQBUI
    $POUSJCVUPSNBKJ,:
    91BUI׬શରԠ
    +BWBඪ४ͷ9.-ύʔαʔΛར༻͍ͯ͠Δ

    View Slide

  20. 2-3. Data Connector & Presto on Treasure Workflow

    View Slide

  21. Overview
    Data Connector
    S3 for Data Lake
    Treasure Data for DWH
    ࢫຯ
    1SFTUP͸5SFBTVSF%BUBϗεςΟϯά
    %JHEBH4FSWFSෛՙܰݮ
    ࣗલͷ෼ࢄ؀ڥ͕ෆཁ
    ೉఺
    ϫʔΫϑϩʔͷهड़ྔ͕૿͑Δ
    1SFTUPΫΤϦ͕೉ಡԽ͢Δ৔߹΋

    View Slide

  22. IN / OUT
    Id name type value
    1 kimutyam Company Septeni Original,Inc.
    1 kimutyam Language Scala
    2 john Country America
    {
    "data": [
    {
    "id": 1,
    "name": "kimutyam",
    "details": [
    {
    "type": "Company",
    "value": "Septeni Original,Inc."
    },
    {
    "type": "Language",
    "value": "Scala"
    },
    ]
    },
    {
    "id": 2,
    "name": "john",
    "details": [
    {
    "type": "Country",
    "value": "America"
    }
    ]
    }
    ]
    }
    In Out

    View Slide

  23. Workflow
    timezone: UTC
    +load_raw_data:
    td_load>: raw_data.yml
    database: foo
    table: raw
    +transform_raw_data:
    td>: queries/insert.sql
    database: foo
    +drop_raw_table:
    td_ddl>:
    database: foo
    drop_tables: ["raw"]
    Data ConnectorͰJSONͷੜσʔλΛtd_load
    PrestoͰJSONղੳ
    ੜσʔλ࡟আ

    View Slide

  24. ETL with Data Connector
    filters:
    - type: add_time
    to_column:
    name: time
    from_value:
    value: 0
    unix_timestamp_unit: sec
    out:
    type: td
    in:
    type: s3
    bucket: hoge
    endpoint: s3-ap-northeast-1.amazonaws.com
    access_key_id: ${secret:s3.access_key_id}
    secret_access_key: ${secret:s3.secret_access_key}
    path_prefix: “/foo/bar/user_details.json”
    parser:
    type: json
    charset: UTF-8
    newline: LF
    ※ εϥΠυͷ౎߹্ɺ෼ׂ

    View Slide

  25. ETL with Presto
    SELECT
    time,
    id,
    name,
    CAST(detail['type'] AS VARCHAR) AS type,
    CAST(detail['value'] AS VARCHAR) AS value
    FROM
    details
    CROSS JOIN UNNEST(details) AS details(detail)
    INSERT INTO user_details
    WITH users AS (
    SELECT
    time,
    CAST(JSON_EXTRACT(payload, '$.data') AS ARRAY>) AS data
    FROM
    raw
    ), details AS (
    SELECT
    time,
    CAST(d['id'] AS INTEGER) AS id,
    CAST(d['name'] AS VARCHAR) AS name,
    CAST(d['details'] AS ARRAY>) AS details
    FROM
    users
    CROSS JOIN
    UNNEST(data) AS data(d)
    )
    ※ εϥΠυͷ౎߹্ɺ෼ׂ

    View Slide

  26. 2-4. Spark & Data Connector

    View Slide

  27. S3 for Data Lake
    Overview
    Treasure Data for DWH
    EMR
    ࢫຯ
    ෳ਺ͷग़ྗઌʹ&5-ॲཧ݁ՌΛ0VUQVUͰ͖Δ
    %JHEBH4FSWFSͷෛՙ෼ࢄ
    ཁ݅ʹԠͯ͡෼ࢄॲཧϑϨʔϜϫʔΫΛબ୒Ͱ͖Δ
    ೉఺ ݱ࣌఺

    ෼ࢄ؀ڥΛࣗલͰ༻ҙ͢Δඞཁ͕͋Δ
    ίʔυΛॻ͘ඞཁ͕͋Δ

    View Slide

  28. In / Out



    1
    kimutyam


    2
    john


    Id name
    1 kimutyam
    2 john
    In Out

    View Slide

  29. Workflow
    timezone: UTC
    +xml_to_csv:
    sh>: xml_to_csv.sh
    +load_csv:
    td_load>: csv_load.yml
    database: foo
    table: bar
    SparkͰData Lake͔ΒETLॲཧΛͯ͠CSVΛ2࣍੥͚ͷS3ʹ
    Data ConnectorͰ2࣍੥͚ͷS3͔ΒTDʹٵ্͍͛
    %JHEBHͷFNS͕BQOPSUIFBTUͰ͸ಈ࡞ͤͣ *TTVF

    ୅ΘΓʹsh>: ͰAWS CLI࢖ͬͯىಈɾఀࢭͨ͠

    View Slide

  30. ETL with Spark
    object SampleApp extends App{
    val inputPath = “s3://foo/bar/input/“
    val outputPath = “s3://foo/bar/output”
    LoanPatten.using(
    SparkSession.builder().appName("sample").master("local").getOrCreate()
    ) { ss =>
    val rowTag = "user"
    val df = ss.read
    .format("com.databricks.spark.xml")
    .option("rowTag", rowTag)
    .load(inputPath)
    df.select(
    "id",
    "name"
    )
    .toDF(
    "user_id",
    "name"
    )
    .coalesce(128)
    .write
    .mode(SaveMode.Overwrite)
    .option("header", true)
    .option("codec", "gzip")
    .csv(outputPath)
    }
    }
    object LoanPattern {
    type Stoppable = {def stop(): Unit}
    def using[R A): A =
    try {
    f(session)
    } finally {
    ignoring(classOf[Throwable]) apply {
    session.stop()
    }
    }
    }

    View Slide

  31. ETL with Data Connector
    in:
    type: s3
    bucket: hoge
    endpoint: s3-ap-northeast-1.amazonaws.com
    access_key_id: ${secret:s3.access_key_id}
    secret_access_key: ${secret:s3.secret_access_key}
    path_prefix: "/foo/bar/user.csv"
    parser:
    type: csv
    charset: UTF-8
    newline: CRLF
    delimiter: ","
    quote: '"'
    escape: '"'
    skip_header_lines: 1
    columns:
    - {name: user_id, type: long}
    - {name: name, type: string}
    out:
    type: td

    View Slide

  32. 2-5. Design Concept

    View Slide

  33. Overview

    View Slide

  34. 3. Summary

    View Slide

  35. Summary
    ηϓςʔχɾΦϦδφϧͰߦ͍ͬͯΔ&5-ॲཧʹ͍͓ͭͯ࿩͠·ͨ͠ɻ
    ηϓςʔχɾΦϦδφϧͰ͸σʔλυϦϒϯͳ૊৫ܦӦʹ஫ྗ͓ͯ͠Γɺ
    σʔλج൫։ൃ͕த֩ͷͭʹͳ͍ͬͯ·͢ɻ
    ࠓޙߦ͏͜ͱ
    σʔλ඼࣭ͷ޲্
    σʔλ׆༻ࣄྫͷ֦ு
    ૊৫಺σʔλͷू໿

    View Slide

  36. ηϓςʔχɾΦϦδφϧͰ͸Πϯλʔωοτ޿ࠂͷσʔλج൫ΛҰॹʹҭͯΔ
    σʔλΤϯδχΞΛืू͍ͯ͠·͢

    View Slide

  37. ͝੩ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠

    View Slide