Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
BigQuery Schema Migration #bq_sushi
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Naotoshi Seo
April 08, 2016
Technology
6.2k
2
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
BigQuery Schema Migration #bq_sushi
Naotoshi Seo
April 08, 2016
More Decks by Naotoshi Seo
See All by Naotoshi Seo
ZOZOTOWNリプレイス2020
sonots
5
39k
Red Chainer and Cumo: Practical Deep Learning in Ruby at RubyKaigi 2019
sonots
1
4.8k
Introduction of Cumo, and Integration to Red Chainer
sonots
1
1.2k
Implementation of Cumo, a CUDA-aware version of Ruby/Numo
sonots
1
2.1k
Fast Numerical Computing and Deep Learning in Ruby with Cumo
sonots
0
10k
CuPy improvments around memory
sonots
3
1.8k
DeNA AIシステム部におけるクラウドを活用した機械学習基盤の構築
sonots
4
6.4k
Triglav - Data Driven Workflow Tool
sonots
1
4.3k
DeNA流データエンジニアリングの極意
sonots
17
13k
Other Decks in Technology
See All in Technology
AIっぽい文章を採点して人間らしく直すアプリを作ってみた
yama3133
2
110
2026TECHFRESH畢業分享會 - 原生還是跨平台? App 開發踩坑實錄
line_developers_tw
PRO
0
580
新しいVibe Codingと”自走”について
watany
5
270
AI-DLCを活用した高品質・安全なAI駆動開発実践 / AI Driven Development with AI-DLC
yoshidashingo
0
160
MIERUNE JCT 発表資料「宇宙から伊能忠敬ごっこ」
syuchimu
0
200
10倍の生産性を実現するAI駆動並列エージェントのすべて
kumaiu
4
1.3k
EventBridge Connection
_kensh
5
680
Reliability in the Age of AI: Engineering for AI Velocity
rrreeeyyy
0
120
2026TECHFRESH畢業分享會 - Lightning Talk - 資料也要 CI/CD? 用 Airbyte 自動化資料同步
line_developers_tw
PRO
0
570
ChatworkとBPaaS 異なる特性で学んだAI機能開発の ベストプラクティス
kubell_hr
2
3.4k
FDE という解 ― 暗黙知と明示知をつなぐ、伴走型エンジニアリング ―
otanet
0
110
"何を作るか"を任される エンジニアは、どう育つのか
yutaokafuji
1
500
Featured
See All Featured
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
8.2k
Pawsitive SEO: Lessons from My Dog (and Many Mistakes) on Thriving as a Consultant in the Age of AI
davidcarrasco
0
160
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
380
Making Projects Easy
brettharned
120
6.7k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.3k
Accessibility Awareness
sabderemane
1
130
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
160
RailsConf 2023
tenderlove
30
1.5k
Abbi's Birthday
coloredviolet
2
8k
Amusing Abliteration
ianozsvald
1
200
Site-Speed That Sticks
csswizardry
13
1.2k
Transcript
#JH2VFSZͷςʔϒϧΛ .JHSBUF ΧϥϜՃɺআɺ ܕมߋ ͢Δ 2016/04/08 @sonots #bq_sushi 3
ࣗݾհ • ඌར @sonots • DeNA ੳج൫ • Fluentd ίϛολ
• Ruby ίϛολ • ࠷ۙ embulk ۀ • embulk-output-bigquery • embulk-filter-column, etc
• 4݄23ൃചʂ • σʔλऩूಛू • Fluentd / Embulk • DeNA
/ Cookpad ͷࣄྫ
ΞδΣϯμ • ฐࣾͰͷ BigQuery ར༻ • εΩʔϚมߋͷඞཁੑ • BigQuery ʹ͓͚ΔεΩʔϚมߋͷࠔ͞
• εΩʔϚมߋͷઓུ
ฐࣾͰͷ BigQuery ར༻ • West (US) Ͱ̍Ҏ্લ͔Βར༻ • JP ͰϘνϘν͍࢝Ί͍ͯΔ
• σʔλҠߦπʔϧ࡞ͬͯΔ • hdfs2bigquery • vertica2bigquery • bigquery2hdfs • bigquery2vertica
ฐࣾͰͷੳۀ • σʔλҠߦ Hadoop/Vertica ӡ༻ج൫νʔϜ • ੳۀΞφϦετ͕ߦ͏ • BigQuery ʹΫΤϦΛ͛ΔͷΞφϦετ
εΩʔϚมߋͷඞཁੑ(1) • ϩάʹΧϥϜ͕Ճ͞Εͨ • ͬͺΓΧϥϜ͕ফ͞Εͨ • ΧϥϜͷܕΛؒҧ͑ͨ • INTEGER ͬΆ͍ͱࢥͬͯͨΒ
11,12 Έ͍ͨͳ ͕ೖͬͯΔߦ͕͋ͬͯ STRING ͡Όͳ͍ͱμϝ ͩͬͨͱ͔͋Δ͋Δ
εΩʔϚมߋͷඞཁੑ(2) • BigQuery ςʔϒϧ໊ϕετϓϥΫςΟε • ςʔϒϧ໊લஔࢺ_ˋY%m%d • ຖ৽͘͠ςʔϒϧΛ࡞Δ • ຖεΩʔϚ࠶ఆٛͷνϟϯε͕͋Δ
• εΩʔϚมߋ͠ͳͯ͘ྑ͍͡ΌΜʁ
εΩʔϚมߋͷඞཁੑ(3) • ̎ͭͷςʔϒϧͰΧϥϜͷܕ͕ҧ͏ͱΤϥʔʂ SELECT name FROM TABLE_DATE_RANGE(data.people_, TIMESTAMP('2014-03-26'), TIMESTAMP('2014-03-27')) WHERE
age >= 35 • Ωϟετ͢Δͱ͍͏ख͋Δ͕ɺੳ࣌ଞͷ͜ ͱʹ಄Λ͍͍ͨͷͰආ͚͍ͨ
εΩʔϚมߋπʔϧͷఏڙ • ͋Δ͖࢟(εΩʔϚ)Λఏࣔ͢Δͱɺ • ΧϥϜͷՃ • ΧϥϜͷআ • ΧϥϜͷܕมߋ •
Λࣗಈผͯ͠ɺεΩʔϚมߋͰ͖ΔΑ͏ʹ ͍ͯ͋͛ͨ͠
BigQuery ʹ͓͚Δ εΩʔϚมߋͷࠔ͞
εΩʔϚมߋͷࠔ • BigQuery ʹ ALTER TABLE ͕ͳ͍ • ΧϥϜՃͷAPI͋Δ •
ΧϥϜআɺܕมߋͷ API ͕ͳ͍ Ͳ͏͢Δ͔ʁͱ͍͏
ΧϥϜՃ • patch_table (or update_table) API ͰͰ͖Δ client.patch_table(project_id, dataset_id, table_id,
{ schema: { fields: [ {name:"time", type:"TIMESTAMP", mode:"NULLABLE"}, {name:"id", type:"INTEGER", mode:"NULLABLE"}, ] } }) google-api-ruby-client
ΧϥϜՃ(ҙ) • ͢Ͱʹ͋ΔεΩʔϚ + Ճ͢ΔΧϥϜ • get_table ͰεΩʔϚΛऔಘͯ͠Ϛʔδͯ͛͠Δ response =
get_table(project_id, dataset_id, table_id) columns = response.schema.fields.map {|col| col.to_h } columns << {name:"id", type:"INTEGER", mode:"NULLABLE"} google-api-ruby-client
patch tables ͷ੍ • Ճͨ͠ΧϥϜඌʹՃ͞ΕΔ • ՃͰ͖Δͷ NULLABLE ·ͨ REPEATED
͚ͩ • mode: REQUIRED ͳΧϥϜ͕ՃͰ͖ͳ͍ • มߋ REQUIRED => NULLABLE ͚ͩ • NULLABLE Λ REQUIRED ʹͰ͖ͳ͍ • REPEATED ʹͰ͖ͳ͍
ΧϥϜআɺܕมߋ
ΧϥϜআɺܕมߋ • ̎ͭͷઓུ • (1) export & filter & load
• (2) select & copy
(1) export & filter & load • gcs ʹ export
• embulk ΒͳʹΒͰ download ͭͭ͠Λม ͢Δ filtering ॲཧΛߦ͏ • BQ ʹ load ͠ͳ͓͢
(1) export & filter & load • ར • ՝ۚ͞Εͳ͍
• ܽ • ҰϩʔΧϧʹμϯϩʔυ্ͯ͛͢͠ ͜ͱʹͳΔͷͰඇৗʹ͍ɻɻɻ
(2) select & copy • insert_job API ʹ query ͱ
destination_table Λࢦఆ insert_job(project_id, { configuration: { query: { query:"SELECT ... FROM [...]", destination_table: { dataset_id: dataset_id, table_id: table_id }, } } }, {})
(2) select & copy ͷྫ SELECT STRING(business_id) AS business_id, STRING(full_address)
AS full_address, schools, BOOLEAN(open) AS open, FROM [dataset_id.table_id] • ΩϟετͰܕมߋ • ੍: ܕมߋ͢Δͱ mode: NULLABLE ʹͳΔ • ࢦఆ͠ͳ͔ͬͨΧϥϜআ͞ΕΔ
(2) select & copy • ར • ͍ • ܽ
• ՝ۚ͞ΕΔ
(2) select & copy Λ࠾༻ (1) export & filter &
load ΔͳΒ HDFS͔ΒσʔλLoadΓͯ࣌ؒ͋͠·ΓมΘΒͳ͍ ...
Further dive into select & copy
RECORD ܕͷΩϟετํ๏ SELECT INTEGER(votes.funny) AS votes.funny, INTEGER(votes.useful) AS votes.useful, INTEGER(votes.cool)
AS votes.cool, FROM [dataset_id.table_id] • υοτ۠ΓͰࢦఆ͢Δ
select & copy ͰͷΧϥϜՃ SELECT column1, INTEGER(NULL) AS column2, INTEGER(NULL)
AS (record.column3), FROM [dataset_id.table_id] • INTEGER ܕͷ column2 ΛՃ • RECORD ܕͷ record ΧϥϜͰͳ͘ record_column3 ͱ͍͏໊લͷΧϥϜ͕Ͱ͖Δɻɻɻ • patch table API ͬͨ΄͏͕ྑͦ͞͏ɻɻɻ
mode: REPEATED ΧϥϜͷࢦఆ • SELECT ͰࢦఆͰ͖Δ REPEATED ΧϥϜ̍ͭͩ ͚ɺͳͲͷ੍͕͋ͬͨΓ •
SELECT ͢Δͱߦ͕૿͑ΔͷͰɺREPEATED Χϥ Ϝͷͳ͍ߦ͕૿͑ͨςʔϒϧΛ࡞Δ͜ͱʹͳΔ • Ͳ͏ݫͦ͠͏
Atomic ͳςʔϒϧͷஔ • ௨ৗͷઓུ • มલͷςʔϒϧ => มޙ • atomic
ʹ swap • BigQuery ʹ rename ͕ͳ͍ʂແཧʂʁ • copy ͷ destination_table Λࣗࣗʹࢦఆ • atomic ʹ swap ͞ΕΔʂʂ
·ͱΊ
·ͱΊΔͱ • ΧϥϜͷՃ͕ඞཁͳ߹ɺ·ͣ patch table • ΧϥϜͷআɺ·ͨܕมߋ͕ඞཁͳ߹ɺ ͔ͦ͜Β͞Βʹ select &
copy • copyઌࣗࣗΛࢦఆ͢Δ͜ͱͰ atomic ʹ swap Ͱ͖Δ
੍ • mode: REPEATED ΧϥϜΛѻ͑ͳ͍ • mode: NULLABLE ΧϥϜͷΈՃՄೳ •
ܕมߋ͢Δͱ mode: NULLABLE ʹͳΔ
https://github.com/sonots/bigquery_migration/
͍ํ require 'bigquery_migration' config = { json_keyfile: '/path/to/your-project-000.json' dataset: 'your_dataset_name'
table: 'your_table_name' } columns = [ { name: 'string', type: 'STRING' }, { name: 'record', type: 'RECORD', fields: [ { name: 'integer', type: 'INTEGER' }, { name: 'timestamp', type: 'TIMESTAMP' }, ] } ] migrator = BigqueryMigration.new(config) migrator.migrate_table(columns: columns)
• 4݄23ൃചʂ • σʔλऩूಛू • Fluentd / Embulk • DeNA
/ Cookpad ͷࣄྫ