Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BigQuery Schema Migration #bq_sushi

BigQuery Schema Migration #bq_sushi

Naotoshi Seo

April 08, 2016
Tweet

More Decks by Naotoshi Seo

Other Decks in Technology

Transcript

  1. ࣗݾ঺հ • ੉ඌ௚ར @sonots • DeNA ෼ੳج൫ • Fluentd ίϛολ

    • Ruby ίϛολ • ࠷ۙ͸ embulk ۀ • embulk-output-bigquery • embulk-filter-column, etc
  2. ฐࣾͰͷ BigQuery ར༻ • West (US) Ͱ̍೥Ҏ্લ͔Βར༻ • JP Ͱ΋ϘνϘν࢖͍࢝Ί͍ͯΔ

    • σʔλҠߦπʔϧ࡞ͬͯΔ • hdfs2bigquery • vertica2bigquery • bigquery2hdfs • bigquery2vertica
  3. ΧϥϜ௥Ճ • patch_table (or update_table) API ͰͰ͖Δ client.patch_table(project_id, dataset_id, table_id,

    { schema: { fields: [ {name:"time", type:"TIMESTAMP", mode:"NULLABLE"}, {name:"id", type:"INTEGER", mode:"NULLABLE"}, ] } }) google-api-ruby-client
  4. ΧϥϜ௥Ճ(஫ҙ఺) • ͢Ͱʹ͋ΔεΩʔϚ + ௥Ճ͢ΔΧϥϜ • get_table ͰεΩʔϚΛऔಘͯ͠Ϛʔδͯ͠౤͛Δ response =

    get_table(project_id, dataset_id, table_id) columns = response.schema.fields.map {|col| col.to_h } columns << {name:"id", type:"INTEGER", mode:"NULLABLE"} google-api-ruby-client
  5. patch tables ͷ੍໿ • ௥Ճͨ͠ΧϥϜ͸຤ඌʹ௥Ճ͞ΕΔ • ௥ՃͰ͖Δͷ͸ NULLABLE ·ͨ͸ REPEATED

    ͚ͩ • mode: REQUIRED ͳΧϥϜ͕௥ՃͰ͖ͳ͍ • มߋ͸ REQUIRED => NULLABLE ͚ͩ • NULLABLE Λ REQUIRED ʹͰ͖ͳ͍ • REPEATED ʹ΋Ͱ͖ͳ͍
  6. (1) export & filter & load • gcs ʹ export

    • embulk ΍Βͳʹ΍ΒͰ download ͭͭ͠஋Λม ׵͢Δ filtering ॲཧΛߦ͏ • BQ ʹ load ͠ͳ͓͢
  7. (1) export & filter & load • ར఺ • ՝ۚ͞Εͳ͍

    • ܽ఺ • Ұ౓ϩʔΧϧʹμ΢ϯϩʔυ্ͯ͛͠௚͢ ͜ͱʹͳΔͷͰඇৗʹ஗͍ɻɻɻ
  8. (2) select & copy • insert_job API ʹ query ͱ

    destination_table Λࢦఆ insert_job(project_id, { configuration: { query: { query:"SELECT ... FROM [...]", destination_table: { dataset_id: dataset_id, table_id: table_id }, } } }, {})
  9. (2) select & copy ͷྫ SELECT STRING(business_id) AS business_id, STRING(full_address)

    AS full_address, schools, BOOLEAN(open) AS open, FROM [dataset_id.table_id] • ΩϟετͰܕมߋ • ੍໿: ܕมߋ͢Δͱ mode: NULLABLE ʹͳΔ • ࢦఆ͠ͳ͔ͬͨΧϥϜ͸࡟আ͞ΕΔ
  10. (2) select & copy Λ࠾༻ (1) export & filter &

    load ΍ΔͳΒ HDFS͔ΒσʔλLoad΍Γ௚ͯ͠΋࣌ؒ͋·ΓมΘΒͳ͍ ...
  11. select & copy ͰͷΧϥϜ௥Ճ SELECT column1, INTEGER(NULL) AS column2, INTEGER(NULL)

    AS (record.column3), FROM [dataset_id.table_id] • INTEGER ܕͷ column2 Λ௥Ճ • RECORD ܕͷ record ΧϥϜͰ͸ͳ͘ record_column3 ͱ͍͏໊લͷΧϥϜ͕Ͱ͖Δɻɻɻ • patch table API ࢖ͬͨ΄͏͕ྑͦ͞͏ɻɻɻ
  12. mode: REPEATED ΧϥϜͷࢦఆ • SELECT ͰࢦఆͰ͖Δ REPEATED ΧϥϜ͸̍ͭͩ ͚ɺͳͲͷ੍໿͕͋ͬͨΓ •

    SELECT ͢Δͱߦ਺͕૿͑ΔͷͰɺREPEATED Χϥ Ϝͷͳ͍ߦ਺͕૿͑ͨςʔϒϧΛ࡞Δ͜ͱʹͳΔ • Ͳ͏΋ݫͦ͠͏
  13. Atomic ͳςʔϒϧͷஔ׵ • ௨ৗͷઓུ • ม׵લͷςʔϒϧ => ม׵ޙ • atomic

    ʹ swap • BigQuery ʹ͸ rename ͕ͳ͍ʂແཧʂʁ • copy ͷ destination_table Λࣗ෼ࣗ਎ʹࢦఆ • atomic ʹ swap ͞ΕΔʂʂ
  14. ࢖͍ํ require 'bigquery_migration' config = { json_keyfile: '/path/to/your-project-000.json' dataset: 'your_dataset_name'

    table: 'your_table_name' } columns = [ { name: 'string', type: 'STRING' }, { name: 'record', type: 'RECORD', fields: [ { name: 'integer', type: 'INTEGER' }, { name: 'timestamp', type: 'TIMESTAMP' }, ] } ] migrator = BigqueryMigration.new(config) migrator.migrate_table(columns: columns)