Upgrade to Pro — share decks privately, control downloads, hide ads and more …

限りなくストリーミングに近いバッチ処理を目指して / #DPCT 20190416

限りなくストリーミングに近いバッチ処理を目指して / #DPCT 20190416

Data Pipeline Casual Talk Vol.2 (データパイプラインに関する知見をカジュアルに語る会) の発表資料です。
https://dpct.connpass.com/event/121371

308698f0f90a4a88c8b82432afacef2a?s=128

kyontan

April 16, 2019
Tweet

Transcript

  1. όονॲཧΛ ετϦʔϛϯάʹ͍ۙ ໨ࢦͯ͠ ݶΓͳ͘

  2. → 
 !2

  3. 
 !3

  4. → 
 
 〜 々 
 !4

  5. !5 …

  6. !6 … …

  7. 
 
 !7

  8. !8

  9. !8

  10. … !9

  11. 
 
 !10

  12. !11

  13. !12 
 〜 
 〜 〜

  14. !13 
 


  15. !14 


  16. !15 …

  17. !16 〜 〜 〜

  18. !17 ← →

  19. !18 ! →

  20. !19

  21. !20

  22. !21 updated_at id id, created_at

  23. !22 →

  24. !23 SELECT *
 FROM table
 WHERE updated_at > :last_updated_at

  25. !24 SELECT *
 FROM table
 WHERE updated_at > :last_updated_at 


  26. !25 updated_at id … id, created_at …

  27. !26

  28. !27 
 〜 〜
 〜

  29. !28

  30. !29 
 〜 〜
 〜

  31. !30

  32. !31

  33. !32 
 SELECT * EXCEPT(rn) FROM ( SELECT *, row_number()

    OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn FROM ( SELECT * FROM dataset.diff -- diff UNION ALL SELECT * FROM dataset.master -- destination ) ) WHERE rn = 1
  34. bq query --destination_table=dataset.master " " !33 SELECT * EXCEPT(rn) FROM

    ( SELECT *, row_number() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn FROM ( SELECT * FROM dataset.diff -- diff UNION ALL SELECT * FROM dataset.master -- destination ) ) WHERE rn = 1 ɾɾɾ
  35. dataset.master 
 !34 bq query --destination_table=dataset.master "..." dataset
 .master

  36. dataset.master 
 !35 bq query --destination_table=dataset.master "..." dataset
 .master

  37. INSERT, UPDATE, DELETE, MERGE 〜 → !36

  38. → !37 MERGE dataset.master T USING ( SELECT * EXCEPT(rn)

    FROM ( SELECT *, row_number() over (PARTITION BY id ORDER BY updated_at DESC) AS rn FROM dataset.diff) WHERE rn = 1 ) S ON T.id = S.id WHEN MATCHED AND T.updated_at < S.updated_at THEN UPDATE SET id = S.id, ..., updated_at = S.updated_at WHEN NOT MATCHED THEN INSERT (id, ..., updated_at) VALUES (id, ..., updated_at)
  39. → !38 MERGE dataset.master T USING ( SELECT * EXCEPT(rn)

    FROM ( SELECT *, row_number() over (PARTITION BY id ORDER BY updated_at DESC) AS rn FROM dataset.diff) WHERE rn = 1 ) S ON T.id = S.id WHEN MATCHED AND T.updated_at < S.updated_at THEN UPDATE SET id = S.id, ..., updated_at = S.updated_at WHEN NOT MATCHED THEN INSERT (id, ..., updated_at) VALUES (id, ..., updated_at) MERGE 

  40. _YYYYMMDD updated_at → !39 INSERT dataset.master (id, ..., updated_at) SELECT

    id, ..., updated_at FROM dataset.diff
  41. !40

  42. !41 {“id”: 37, “name”: “Sana Natori”} {“id”: 418, “name” “Yuni

    Akatsuki”}
  43. !42 〜 →

  44. !43

  45. !44 ɾ“2019-01-01 01:02:03 JST” ɾ“2019-01-01 01:02:03+0900” ɾ“2019-01-01 01:02:03+09:00” ɾ“2019-01-01 01:02:03

    +0900” ɾ“2019-01-01 01:02:03 +09:00”
  46. !45

  47. !46 → → ɾgsutil -m \
 -o 'GSUtil:parallel_process_count=X'
 -o 'GSUtil:parallel_thread_count=Y'

    \
 cp ... ɾ gsutil -m cp ...
  48. !47

  49. !48 → 
 natr: NeAr realTime meRger

  50. !49 → → → 〜 →

  51. !50 → → → 〜 →

  52. !51 
 


  53. !52 → 
 〜 
 〜

  54. !53 …

  55. !54 → 


  56. !55

  57. !56 


  58. → !57 〜

  59. !58

  60. !59 
 


  61. !60

  62. !61 ✅ ✅ ✅ ✅ 
 …