Upgrade to Pro — share decks privately, control downloads, hide ads and more …

限りなくストリーミングに近いバッチ処理を目指して / #DPCT 20190416

kyontan
April 16, 2019

限りなくストリーミングに近いバッチ処理を目指して / #DPCT 20190416

Data Pipeline Casual Talk Vol.2 (データパイプラインに関する知見をカジュアルに語る会) の発表資料です。
https://dpct.connpass.com/event/121371

kyontan

April 16, 2019
Tweet

More Decks by kyontan

Other Decks in Programming

Transcript

  1. όονॲཧΛ
    ετϦʔϛϯάʹ͍ۙ
    ໨ࢦͯ͠
    ݶΓͳ͘

    View Slide

  2. → 

    !2

    View Slide


  3. !3

    View Slide

  4. → 


    〜 々

    !4

    View Slide

  5. !5

    View Slide

  6. !6


    View Slide



  7. !7

    View Slide

  8. !8

    View Slide

  9. !8

    View Slide


  10. !9

    View Slide



  11. !10

    View Slide

  12. !11

    View Slide

  13. !12





    View Slide

  14. !13


    View Slide

  15. !14

    View Slide

  16. !15

    View Slide

  17. !16



    View Slide

  18. !17


    View Slide

  19. !18


    View Slide

  20. !19

    View Slide

  21. !20

    View Slide

  22. !21
    updated_at id
    id, created_at

    View Slide

  23. !22

    View Slide

  24. !23
    SELECT *

    FROM table

    WHERE updated_at > :last_updated_at

    View Slide

  25. !24
    SELECT *

    FROM table

    WHERE updated_at > :last_updated_at


    View Slide

  26. !25
    updated_at id

    id, created_at

    View Slide

  27. !26

    View Slide

  28. !27


    〜


    View Slide

  29. !28

    View Slide

  30. !29


    〜


    View Slide

  31. !30

    View Slide

  32. !31

    View Slide

  33. !32

    SELECT
    * EXCEPT(rn)
    FROM (
    SELECT
    *,
    row_number() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn
    FROM (
    SELECT * FROM dataset.diff -- diff
    UNION ALL
    SELECT * FROM dataset.master -- destination
    )
    )
    WHERE rn = 1

    View Slide

  34. bq query --destination_table=dataset.master " "
    !33
    SELECT
    * EXCEPT(rn)
    FROM (
    SELECT
    *,
    row_number() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn
    FROM (
    SELECT * FROM dataset.diff -- diff
    UNION ALL
    SELECT * FROM dataset.master -- destination
    )
    )
    WHERE rn = 1
    ɾɾɾ

    View Slide

  35. dataset.master 

    !34
    bq query --destination_table=dataset.master "..."
    dataset

    .master

    View Slide

  36. dataset.master 

    !35
    bq query --destination_table=dataset.master "..."
    dataset

    .master

    View Slide

  37. INSERT, UPDATE, DELETE, MERGE


    !36

    View Slide


  38. !37
    MERGE dataset.master T
    USING (
    SELECT * EXCEPT(rn)
    FROM (
    SELECT
    *,
    row_number() over (PARTITION BY id ORDER BY updated_at DESC) AS rn
    FROM dataset.diff)
    WHERE rn = 1
    ) S
    ON T.id = S.id
    WHEN MATCHED AND T.updated_at < S.updated_at THEN
    UPDATE SET id = S.id, ..., updated_at = S.updated_at
    WHEN NOT MATCHED THEN
    INSERT (id, ..., updated_at) VALUES (id, ..., updated_at)

    View Slide


  39. !38
    MERGE dataset.master T
    USING (
    SELECT * EXCEPT(rn)
    FROM (
    SELECT
    *,
    row_number() over (PARTITION BY id ORDER BY updated_at DESC) AS rn
    FROM dataset.diff)
    WHERE rn = 1
    ) S
    ON T.id = S.id
    WHEN MATCHED AND T.updated_at < S.updated_at THEN
    UPDATE SET id = S.id, ..., updated_at = S.updated_at
    WHEN NOT MATCHED THEN
    INSERT (id, ..., updated_at) VALUES (id, ..., updated_at)
    MERGE 


    View Slide

  40. _YYYYMMDD
    updated_at

    !39
    INSERT dataset.master (id, ..., updated_at)
    SELECT id, ..., updated_at
    FROM dataset.diff

    View Slide

  41. !40

    View Slide

  42. !41
    {“id”: 37, “name”: “Sana Natori”}
    {“id”: 418, “name” “Yuni Akatsuki”}

    View Slide

  43. !42


    View Slide

  44. !43

    View Slide

  45. !44
    ɾ“2019-01-01 01:02:03 JST”
    ɾ“2019-01-01 01:02:03+0900”
    ɾ“2019-01-01 01:02:03+09:00”
    ɾ“2019-01-01 01:02:03 +0900”
    ɾ“2019-01-01 01:02:03 +09:00”

    View Slide

  46. !45

    View Slide

  47. !46


    ɾgsutil -m \

    -o 'GSUtil:parallel_process_count=X'

    -o 'GSUtil:parallel_thread_count=Y' \

    cp ...
    ɾ gsutil -m cp ...

    View Slide

  48. !47

    View Slide

  49. !48
    → 

    natr: NeAr realTime meRger

    View Slide

  50. !49





    View Slide

  51. !50





    View Slide

  52. !51


    View Slide

  53. !52
    → 




    View Slide

  54. !53

    View Slide

  55. !54


    View Slide

  56. !55

    View Slide

  57. !56

    View Slide


  58. !57

    View Slide

  59. !58

    View Slide

  60. !59


    View Slide

  61. !60

    View Slide

  62. !61






    View Slide