Slide 1

Slide 1 text

#JH2VFSZͷςʔϒϧΛ .JHSBUF ΧϥϜ௥Ճɺ࡟আɺ ܕมߋ ͢Δ 2016/04/08 @sonots #bq_sushi 3

Slide 2

Slide 2 text

ࣗݾ঺հ • ੉ඌ௚ར @sonots • DeNA ෼ੳج൫ • Fluentd ίϛολ • Ruby ίϛολ • ࠷ۙ͸ embulk ۀ • embulk-output-bigquery • embulk-filter-column, etc

Slide 3

Slide 3 text

• 4݄23೔ൃചʂ • σʔλऩूಛू • Fluentd / Embulk • DeNA / Cookpad ͷࣄྫ

Slide 4

Slide 4 text

ΞδΣϯμ • ฐࣾͰͷ BigQuery ར༻ • εΩʔϚมߋͷඞཁੑ • BigQuery ʹ͓͚ΔεΩʔϚมߋͷࠔ೉͞ • εΩʔϚมߋͷઓུ

Slide 5

Slide 5 text

ฐࣾͰͷ BigQuery ར༻ • West (US) Ͱ̍೥Ҏ্લ͔Βར༻ • JP Ͱ΋ϘνϘν࢖͍࢝Ί͍ͯΔ • σʔλҠߦπʔϧ࡞ͬͯΔ • hdfs2bigquery • vertica2bigquery • bigquery2hdfs • bigquery2vertica

Slide 6

Slide 6 text

ฐࣾͰͷ෼ੳۀ຿ • σʔλҠߦ΍ Hadoop/Vertica ӡ༻͸ج൫νʔϜ • ෼ੳۀ຿͸ΞφϦετ͕ߦ͏ • BigQuery ʹΫΤϦΛ౤͛Δͷ͸ΞφϦετ

Slide 7

Slide 7 text

εΩʔϚมߋͷඞཁੑ(1) • ϩάʹΧϥϜ͕௥Ճ͞Εͨ • ΍ͬͺΓΧϥϜ͕ফ͞Εͨ • ΧϥϜͷܕΛؒҧ͑ͨ • INTEGER ͬΆ͍ͱࢥͬͯͨΒ 11,12 Έ͍ͨͳ஋ ͕ೖͬͯΔߦ͕͋ͬͯ STRING ͡Όͳ͍ͱμϝ ͩͬͨͱ͔͋Δ͋Δ

Slide 8

Slide 8 text

εΩʔϚมߋͷඞཁੑ(2) • BigQuery ςʔϒϧ໊ϕετϓϥΫςΟε • ςʔϒϧ໊લஔࢺ_ˋY%m%d • ຖ೔৽͘͠ςʔϒϧΛ࡞Δ • ຖ೔εΩʔϚ࠶ఆٛͷνϟϯε͕͋Δ • εΩʔϚมߋ͠ͳͯ͘΋ྑ͍͡ΌΜʁ

Slide 9

Slide 9 text

εΩʔϚมߋͷඞཁੑ(3) • ̎ͭͷςʔϒϧͰΧϥϜͷܕ͕ҧ͏ͱΤϥʔʂ SELECT name FROM TABLE_DATE_RANGE(data.people_, TIMESTAMP('2014-03-26'), TIMESTAMP('2014-03-27')) WHERE age >= 35 • Ωϟετ͢Δͱ͍͏ख͸͋Δ͕ɺ෼ੳ࣌͸ଞͷ͜ ͱʹ಄Λ࢖͍͍ͨͷͰආ͚͍ͨ

Slide 10

Slide 10 text

εΩʔϚมߋπʔϧͷఏڙ • ͋Δ΂͖࢟(εΩʔϚ)Λఏࣔ͢Δͱɺ • ΧϥϜͷ௥Ճ • ΧϥϜͷ࡟আ • ΧϥϜͷܕมߋ • Λࣗಈ൑ผͯ͠ɺεΩʔϚมߋͰ͖ΔΑ͏ʹ ͍ͯ͋͛ͨ͠

Slide 11

Slide 11 text

BigQuery ʹ͓͚Δ εΩʔϚมߋͷࠔ೉͞

Slide 12

Slide 12 text

εΩʔϚมߋͷࠔ೉ • BigQuery ʹ͸ ALTER TABLE ͕ͳ͍ • ΧϥϜ௥ՃͷAPI͸͋Δ • ΧϥϜ࡟আɺܕมߋͷ API ͕ͳ͍ Ͳ͏͢Δ͔ʁͱ͍͏࿩

Slide 13

Slide 13 text

ΧϥϜ௥Ճ • patch_table (or update_table) API ͰͰ͖Δ client.patch_table(project_id, dataset_id, table_id, { schema: { fields: [ {name:"time", type:"TIMESTAMP", mode:"NULLABLE"}, {name:"id", type:"INTEGER", mode:"NULLABLE"}, ] } }) google-api-ruby-client

Slide 14

Slide 14 text

ΧϥϜ௥Ճ(஫ҙ఺) • ͢Ͱʹ͋ΔεΩʔϚ + ௥Ճ͢ΔΧϥϜ • get_table ͰεΩʔϚΛऔಘͯ͠Ϛʔδͯ͠౤͛Δ response = get_table(project_id, dataset_id, table_id) columns = response.schema.fields.map {|col| col.to_h } columns << {name:"id", type:"INTEGER", mode:"NULLABLE"} google-api-ruby-client

Slide 15

Slide 15 text

patch tables ͷ੍໿ • ௥Ճͨ͠ΧϥϜ͸຤ඌʹ௥Ճ͞ΕΔ • ௥ՃͰ͖Δͷ͸ NULLABLE ·ͨ͸ REPEATED ͚ͩ • mode: REQUIRED ͳΧϥϜ͕௥ՃͰ͖ͳ͍ • มߋ͸ REQUIRED => NULLABLE ͚ͩ • NULLABLE Λ REQUIRED ʹͰ͖ͳ͍ • REPEATED ʹ΋Ͱ͖ͳ͍

Slide 16

Slide 16 text

ΧϥϜ࡟আɺܕมߋ

Slide 17

Slide 17 text

ΧϥϜ࡟আɺܕมߋ • ̎ͭͷઓུ • (1) export & filter & load • (2) select & copy

Slide 18

Slide 18 text

(1) export & filter & load • gcs ʹ export • embulk ΍Βͳʹ΍ΒͰ download ͭͭ͠஋Λม ׵͢Δ filtering ॲཧΛߦ͏ • BQ ʹ load ͠ͳ͓͢

Slide 19

Slide 19 text

(1) export & filter & load • ར఺ • ՝ۚ͞Εͳ͍ • ܽ఺ • Ұ౓ϩʔΧϧʹμ΢ϯϩʔυ্ͯ͛͠௚͢ ͜ͱʹͳΔͷͰඇৗʹ஗͍ɻɻɻ

Slide 20

Slide 20 text

(2) select & copy • insert_job API ʹ query ͱ destination_table Λࢦఆ insert_job(project_id, { configuration: { query: { query:"SELECT ... FROM [...]", destination_table: { dataset_id: dataset_id, table_id: table_id }, } } }, {})

Slide 21

Slide 21 text

(2) select & copy ͷྫ SELECT STRING(business_id) AS business_id, STRING(full_address) AS full_address, schools, BOOLEAN(open) AS open, FROM [dataset_id.table_id] • ΩϟετͰܕมߋ • ੍໿: ܕมߋ͢Δͱ mode: NULLABLE ʹͳΔ • ࢦఆ͠ͳ͔ͬͨΧϥϜ͸࡟আ͞ΕΔ

Slide 22

Slide 22 text

(2) select & copy • ར఺ • ଎͍ • ܽ఺ • ՝ۚ͞ΕΔ

Slide 23

Slide 23 text

(2) select & copy Λ࠾༻ (1) export & filter & load ΍ΔͳΒ HDFS͔ΒσʔλLoad΍Γ௚ͯ͠΋࣌ؒ͋·ΓมΘΒͳ͍ ...

Slide 24

Slide 24 text

Further dive into select & copy

Slide 25

Slide 25 text

RECORD ܕͷΩϟετํ๏ SELECT INTEGER(votes.funny) AS votes.funny, INTEGER(votes.useful) AS votes.useful, INTEGER(votes.cool) AS votes.cool, FROM [dataset_id.table_id] • υοτ۠੾ΓͰࢦఆ͢Δ

Slide 26

Slide 26 text

select & copy ͰͷΧϥϜ௥Ճ SELECT column1, INTEGER(NULL) AS column2, INTEGER(NULL) AS (record.column3), FROM [dataset_id.table_id] • INTEGER ܕͷ column2 Λ௥Ճ • RECORD ܕͷ record ΧϥϜͰ͸ͳ͘ record_column3 ͱ͍͏໊લͷΧϥϜ͕Ͱ͖Δɻɻɻ • patch table API ࢖ͬͨ΄͏͕ྑͦ͞͏ɻɻɻ

Slide 27

Slide 27 text

mode: REPEATED ΧϥϜͷࢦఆ • SELECT ͰࢦఆͰ͖Δ REPEATED ΧϥϜ͸̍ͭͩ ͚ɺͳͲͷ੍໿͕͋ͬͨΓ • SELECT ͢Δͱߦ਺͕૿͑ΔͷͰɺREPEATED Χϥ Ϝͷͳ͍ߦ਺͕૿͑ͨςʔϒϧΛ࡞Δ͜ͱʹͳΔ • Ͳ͏΋ݫͦ͠͏

Slide 28

Slide 28 text

Atomic ͳςʔϒϧͷஔ׵ • ௨ৗͷઓུ • ม׵લͷςʔϒϧ => ม׵ޙ • atomic ʹ swap • BigQuery ʹ͸ rename ͕ͳ͍ʂແཧʂʁ • copy ͷ destination_table Λࣗ෼ࣗ਎ʹࢦఆ • atomic ʹ swap ͞ΕΔʂʂ

Slide 29

Slide 29 text

·ͱΊ

Slide 30

Slide 30 text

·ͱΊΔͱ • ΧϥϜͷ௥Ճ͕ඞཁͳ৔߹͸ɺ·ͣ patch table • ΧϥϜͷ࡟আɺ·ͨ͸ܕมߋ͕ඞཁͳ৔߹͸ɺ ͔ͦ͜Β͞Βʹ select & copy • copyઌ͸ࣗ෼ࣗ਎Λࢦఆ͢Δ͜ͱͰ atomic ʹ swap Ͱ͖Δ

Slide 31

Slide 31 text

੍໿ • mode: REPEATED ΧϥϜΛѻ͑ͳ͍ • mode: NULLABLE ΧϥϜͷΈ௥ՃՄೳ • ܕมߋ͢Δͱ mode: NULLABLE ʹͳΔ

Slide 32

Slide 32 text

׬੒඼ https://github.com/sonots/bigquery_migration/

Slide 33

Slide 33 text

࢖͍ํ require 'bigquery_migration' config = { json_keyfile: '/path/to/your-project-000.json' dataset: 'your_dataset_name' table: 'your_table_name' } columns = [ { name: 'string', type: 'STRING' }, { name: 'record', type: 'RECORD', fields: [ { name: 'integer', type: 'INTEGER' }, { name: 'timestamp', type: 'TIMESTAMP' }, ] } ] migrator = BigqueryMigration.new(config) migrator.migrate_table(columns: columns)

Slide 34

Slide 34 text

• 4݄23೔ൃചʂ • σʔλऩूಛू • Fluentd / Embulk • DeNA / Cookpad ͷࣄྫ