Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up
for free
BigQuery Schema Migration #bq_sushi
Naotoshi Seo
April 08, 2016
Technology
2
5.4k
BigQuery Schema Migration #bq_sushi
Naotoshi Seo
April 08, 2016
Tweet
Share
More Decks by Naotoshi Seo
See All by Naotoshi Seo
sonots
4
3.6k
sonots
1
2.4k
sonots
1
570
sonots
0
1.2k
sonots
0
6.9k
sonots
3
1.1k
sonots
4
5.1k
sonots
1
3.2k
sonots
16
11k
Other Decks in Technology
See All in Technology
subroh0508
4
230
go5paopao
8
720
imdigitallab
0
430
xecus
0
190
zak3
1
200
1027kg
0
200
yshr1200
0
170
shoichiron
1
140
miyake
1
430
torisoup
11
5.9k
binarymist
0
1.3k
robcrowley
0
150
Featured
See All Featured
morganepeng
18
1.2k
malarkey
392
61k
deanohume
294
28k
qrush
285
19k
philnash
9
580
rasmusluckow
318
18k
ammeep
655
54k
orderedlist
PRO
328
36k
sugarenia
233
850k
ufuk
56
5.4k
myddelton
109
11k
dougneiner
55
5.4k
Transcript
#JH2VFSZͷςʔϒϧΛ .JHSBUF ΧϥϜՃɺআɺ ܕมߋ ͢Δ 2016/04/08 @sonots #bq_sushi 3
ࣗݾհ • ඌར @sonots • DeNA ੳج൫ • Fluentd ίϛολ
• Ruby ίϛολ • ࠷ۙ embulk ۀ • embulk-output-bigquery • embulk-filter-column, etc
• 4݄23ൃചʂ • σʔλऩूಛू • Fluentd / Embulk • DeNA
/ Cookpad ͷࣄྫ
ΞδΣϯμ • ฐࣾͰͷ BigQuery ར༻ • εΩʔϚมߋͷඞཁੑ • BigQuery ʹ͓͚ΔεΩʔϚมߋͷࠔ͞
• εΩʔϚมߋͷઓུ
ฐࣾͰͷ BigQuery ར༻ • West (US) Ͱ̍Ҏ্લ͔Βར༻ • JP ͰϘνϘν͍࢝Ί͍ͯΔ
• σʔλҠߦπʔϧ࡞ͬͯΔ • hdfs2bigquery • vertica2bigquery • bigquery2hdfs • bigquery2vertica
ฐࣾͰͷੳۀ • σʔλҠߦ Hadoop/Vertica ӡ༻ج൫νʔϜ • ੳۀΞφϦετ͕ߦ͏ • BigQuery ʹΫΤϦΛ͛ΔͷΞφϦετ
εΩʔϚมߋͷඞཁੑ(1) • ϩάʹΧϥϜ͕Ճ͞Εͨ • ͬͺΓΧϥϜ͕ফ͞Εͨ • ΧϥϜͷܕΛؒҧ͑ͨ • INTEGER ͬΆ͍ͱࢥͬͯͨΒ
11,12 Έ͍ͨͳ ͕ೖͬͯΔߦ͕͋ͬͯ STRING ͡Όͳ͍ͱμϝ ͩͬͨͱ͔͋Δ͋Δ
εΩʔϚมߋͷඞཁੑ(2) • BigQuery ςʔϒϧ໊ϕετϓϥΫςΟε • ςʔϒϧ໊લஔࢺ_ˋY%m%d • ຖ৽͘͠ςʔϒϧΛ࡞Δ • ຖεΩʔϚ࠶ఆٛͷνϟϯε͕͋Δ
• εΩʔϚมߋ͠ͳͯ͘ྑ͍͡ΌΜʁ
εΩʔϚมߋͷඞཁੑ(3) • ̎ͭͷςʔϒϧͰΧϥϜͷܕ͕ҧ͏ͱΤϥʔʂ SELECT name FROM TABLE_DATE_RANGE(data.people_, TIMESTAMP('2014-03-26'), TIMESTAMP('2014-03-27')) WHERE
age >= 35 • Ωϟετ͢Δͱ͍͏ख͋Δ͕ɺੳ࣌ଞͷ͜ ͱʹ಄Λ͍͍ͨͷͰආ͚͍ͨ
εΩʔϚมߋπʔϧͷఏڙ • ͋Δ͖࢟(εΩʔϚ)Λఏࣔ͢Δͱɺ • ΧϥϜͷՃ • ΧϥϜͷআ • ΧϥϜͷܕมߋ •
Λࣗಈผͯ͠ɺεΩʔϚมߋͰ͖ΔΑ͏ʹ ͍ͯ͋͛ͨ͠
BigQuery ʹ͓͚Δ εΩʔϚมߋͷࠔ͞
εΩʔϚมߋͷࠔ • BigQuery ʹ ALTER TABLE ͕ͳ͍ • ΧϥϜՃͷAPI͋Δ •
ΧϥϜআɺܕมߋͷ API ͕ͳ͍ Ͳ͏͢Δ͔ʁͱ͍͏
ΧϥϜՃ • patch_table (or update_table) API ͰͰ͖Δ client.patch_table(project_id, dataset_id, table_id,
{ schema: { fields: [ {name:"time", type:"TIMESTAMP", mode:"NULLABLE"}, {name:"id", type:"INTEGER", mode:"NULLABLE"}, ] } }) google-api-ruby-client
ΧϥϜՃ(ҙ) • ͢Ͱʹ͋ΔεΩʔϚ + Ճ͢ΔΧϥϜ • get_table ͰεΩʔϚΛऔಘͯ͠Ϛʔδͯ͛͠Δ response =
get_table(project_id, dataset_id, table_id) columns = response.schema.fields.map {|col| col.to_h } columns << {name:"id", type:"INTEGER", mode:"NULLABLE"} google-api-ruby-client
patch tables ͷ੍ • Ճͨ͠ΧϥϜඌʹՃ͞ΕΔ • ՃͰ͖Δͷ NULLABLE ·ͨ REPEATED
͚ͩ • mode: REQUIRED ͳΧϥϜ͕ՃͰ͖ͳ͍ • มߋ REQUIRED => NULLABLE ͚ͩ • NULLABLE Λ REQUIRED ʹͰ͖ͳ͍ • REPEATED ʹͰ͖ͳ͍
ΧϥϜআɺܕมߋ
ΧϥϜআɺܕมߋ • ̎ͭͷઓུ • (1) export & filter & load
• (2) select & copy
(1) export & filter & load • gcs ʹ export
• embulk ΒͳʹΒͰ download ͭͭ͠Λม ͢Δ filtering ॲཧΛߦ͏ • BQ ʹ load ͠ͳ͓͢
(1) export & filter & load • ར • ՝ۚ͞Εͳ͍
• ܽ • ҰϩʔΧϧʹμϯϩʔυ্ͯ͛͢͠ ͜ͱʹͳΔͷͰඇৗʹ͍ɻɻɻ
(2) select & copy • insert_job API ʹ query ͱ
destination_table Λࢦఆ insert_job(project_id, { configuration: { query: { query:"SELECT ... FROM [...]", destination_table: { dataset_id: dataset_id, table_id: table_id }, } } }, {})
(2) select & copy ͷྫ SELECT STRING(business_id) AS business_id, STRING(full_address)
AS full_address, schools, BOOLEAN(open) AS open, FROM [dataset_id.table_id] • ΩϟετͰܕมߋ • ੍: ܕมߋ͢Δͱ mode: NULLABLE ʹͳΔ • ࢦఆ͠ͳ͔ͬͨΧϥϜআ͞ΕΔ
(2) select & copy • ར • ͍ • ܽ
• ՝ۚ͞ΕΔ
(2) select & copy Λ࠾༻ (1) export & filter &
load ΔͳΒ HDFS͔ΒσʔλLoadΓͯ࣌ؒ͋͠·ΓมΘΒͳ͍ ...
Further dive into select & copy
RECORD ܕͷΩϟετํ๏ SELECT INTEGER(votes.funny) AS votes.funny, INTEGER(votes.useful) AS votes.useful, INTEGER(votes.cool)
AS votes.cool, FROM [dataset_id.table_id] • υοτ۠ΓͰࢦఆ͢Δ
select & copy ͰͷΧϥϜՃ SELECT column1, INTEGER(NULL) AS column2, INTEGER(NULL)
AS (record.column3), FROM [dataset_id.table_id] • INTEGER ܕͷ column2 ΛՃ • RECORD ܕͷ record ΧϥϜͰͳ͘ record_column3 ͱ͍͏໊લͷΧϥϜ͕Ͱ͖Δɻɻɻ • patch table API ͬͨ΄͏͕ྑͦ͞͏ɻɻɻ
mode: REPEATED ΧϥϜͷࢦఆ • SELECT ͰࢦఆͰ͖Δ REPEATED ΧϥϜ̍ͭͩ ͚ɺͳͲͷ੍͕͋ͬͨΓ •
SELECT ͢Δͱߦ͕૿͑ΔͷͰɺREPEATED Χϥ Ϝͷͳ͍ߦ͕૿͑ͨςʔϒϧΛ࡞Δ͜ͱʹͳΔ • Ͳ͏ݫͦ͠͏
Atomic ͳςʔϒϧͷஔ • ௨ৗͷઓུ • มલͷςʔϒϧ => มޙ • atomic
ʹ swap • BigQuery ʹ rename ͕ͳ͍ʂແཧʂʁ • copy ͷ destination_table Λࣗࣗʹࢦఆ • atomic ʹ swap ͞ΕΔʂʂ
·ͱΊ
·ͱΊΔͱ • ΧϥϜͷՃ͕ඞཁͳ߹ɺ·ͣ patch table • ΧϥϜͷআɺ·ͨܕมߋ͕ඞཁͳ߹ɺ ͔ͦ͜Β͞Βʹ select &
copy • copyઌࣗࣗΛࢦఆ͢Δ͜ͱͰ atomic ʹ swap Ͱ͖Δ
੍ • mode: REPEATED ΧϥϜΛѻ͑ͳ͍ • mode: NULLABLE ΧϥϜͷΈՃՄೳ •
ܕมߋ͢Δͱ mode: NULLABLE ʹͳΔ
https://github.com/sonots/bigquery_migration/
͍ํ require 'bigquery_migration' config = { json_keyfile: '/path/to/your-project-000.json' dataset: 'your_dataset_name'
table: 'your_table_name' } columns = [ { name: 'string', type: 'STRING' }, { name: 'record', type: 'RECORD', fields: [ { name: 'integer', type: 'INTEGER' }, { name: 'timestamp', type: 'TIMESTAMP' }, ] } ] migrator = BigqueryMigration.new(config) migrator.migrate_table(columns: columns)
• 4݄23ൃചʂ • σʔλऩूಛू • Fluentd / Embulk • DeNA
/ Cookpad ͷࣄྫ