BigQuery Schema Migration #bq_sushi

BigQuery Schema Migration #bq_sushi

Aef92c3acc29ad8543e04135687fc4f1?s=128

Naotoshi Seo

April 08, 2016
Tweet

Transcript

  1. #JH2VFSZͷςʔϒϧΛ .JHSBUF ΧϥϜ௥Ճɺ࡟আɺ ܕมߋ ͢Δ 2016/04/08 @sonots #bq_sushi 3

  2. ࣗݾ঺հ • ੉ඌ௚ར @sonots • DeNA ෼ੳج൫ • Fluentd ίϛολ

    • Ruby ίϛολ • ࠷ۙ͸ embulk ۀ • embulk-output-bigquery • embulk-filter-column, etc
  3. • 4݄23೔ൃചʂ • σʔλऩूಛू • Fluentd / Embulk • DeNA

    / Cookpad ͷࣄྫ
  4. ΞδΣϯμ • ฐࣾͰͷ BigQuery ར༻ • εΩʔϚมߋͷඞཁੑ • BigQuery ʹ͓͚ΔεΩʔϚมߋͷࠔ೉͞

    • εΩʔϚมߋͷઓུ
  5. ฐࣾͰͷ BigQuery ར༻ • West (US) Ͱ̍೥Ҏ্લ͔Βར༻ • JP Ͱ΋ϘνϘν࢖͍࢝Ί͍ͯΔ

    • σʔλҠߦπʔϧ࡞ͬͯΔ • hdfs2bigquery • vertica2bigquery • bigquery2hdfs • bigquery2vertica
  6. ฐࣾͰͷ෼ੳۀ຿ • σʔλҠߦ΍ Hadoop/Vertica ӡ༻͸ج൫νʔϜ • ෼ੳۀ຿͸ΞφϦετ͕ߦ͏ • BigQuery ʹΫΤϦΛ౤͛Δͷ͸ΞφϦετ

  7. εΩʔϚมߋͷඞཁੑ(1) • ϩάʹΧϥϜ͕௥Ճ͞Εͨ • ΍ͬͺΓΧϥϜ͕ফ͞Εͨ • ΧϥϜͷܕΛؒҧ͑ͨ • INTEGER ͬΆ͍ͱࢥͬͯͨΒ

    11,12 Έ͍ͨͳ஋ ͕ೖͬͯΔߦ͕͋ͬͯ STRING ͡Όͳ͍ͱμϝ ͩͬͨͱ͔͋Δ͋Δ
  8. εΩʔϚมߋͷඞཁੑ(2) • BigQuery ςʔϒϧ໊ϕετϓϥΫςΟε • ςʔϒϧ໊લஔࢺ_ˋY%m%d • ຖ೔৽͘͠ςʔϒϧΛ࡞Δ • ຖ೔εΩʔϚ࠶ఆٛͷνϟϯε͕͋Δ

    • εΩʔϚมߋ͠ͳͯ͘΋ྑ͍͡ΌΜʁ
  9. εΩʔϚมߋͷඞཁੑ(3) • ̎ͭͷςʔϒϧͰΧϥϜͷܕ͕ҧ͏ͱΤϥʔʂ SELECT name FROM TABLE_DATE_RANGE(data.people_, TIMESTAMP('2014-03-26'), TIMESTAMP('2014-03-27')) WHERE

    age >= 35 • Ωϟετ͢Δͱ͍͏ख͸͋Δ͕ɺ෼ੳ࣌͸ଞͷ͜ ͱʹ಄Λ࢖͍͍ͨͷͰආ͚͍ͨ
  10. εΩʔϚมߋπʔϧͷఏڙ • ͋Δ΂͖࢟(εΩʔϚ)Λఏࣔ͢Δͱɺ • ΧϥϜͷ௥Ճ • ΧϥϜͷ࡟আ • ΧϥϜͷܕมߋ •

    Λࣗಈ൑ผͯ͠ɺεΩʔϚมߋͰ͖ΔΑ͏ʹ ͍ͯ͋͛ͨ͠
  11. BigQuery ʹ͓͚Δ εΩʔϚมߋͷࠔ೉͞

  12. εΩʔϚมߋͷࠔ೉ • BigQuery ʹ͸ ALTER TABLE ͕ͳ͍ • ΧϥϜ௥ՃͷAPI͸͋Δ •

    ΧϥϜ࡟আɺܕมߋͷ API ͕ͳ͍ Ͳ͏͢Δ͔ʁͱ͍͏࿩
  13. ΧϥϜ௥Ճ • patch_table (or update_table) API ͰͰ͖Δ client.patch_table(project_id, dataset_id, table_id,

    { schema: { fields: [ {name:"time", type:"TIMESTAMP", mode:"NULLABLE"}, {name:"id", type:"INTEGER", mode:"NULLABLE"}, ] } }) google-api-ruby-client
  14. ΧϥϜ௥Ճ(஫ҙ఺) • ͢Ͱʹ͋ΔεΩʔϚ + ௥Ճ͢ΔΧϥϜ • get_table ͰεΩʔϚΛऔಘͯ͠Ϛʔδͯ͠౤͛Δ response =

    get_table(project_id, dataset_id, table_id) columns = response.schema.fields.map {|col| col.to_h } columns << {name:"id", type:"INTEGER", mode:"NULLABLE"} google-api-ruby-client
  15. patch tables ͷ੍໿ • ௥Ճͨ͠ΧϥϜ͸຤ඌʹ௥Ճ͞ΕΔ • ௥ՃͰ͖Δͷ͸ NULLABLE ·ͨ͸ REPEATED

    ͚ͩ • mode: REQUIRED ͳΧϥϜ͕௥ՃͰ͖ͳ͍ • มߋ͸ REQUIRED => NULLABLE ͚ͩ • NULLABLE Λ REQUIRED ʹͰ͖ͳ͍ • REPEATED ʹ΋Ͱ͖ͳ͍
  16. ΧϥϜ࡟আɺܕมߋ

  17. ΧϥϜ࡟আɺܕมߋ • ̎ͭͷઓུ • (1) export & filter & load

    • (2) select & copy
  18. (1) export & filter & load • gcs ʹ export

    • embulk ΍Βͳʹ΍ΒͰ download ͭͭ͠஋Λม ׵͢Δ filtering ॲཧΛߦ͏ • BQ ʹ load ͠ͳ͓͢
  19. (1) export & filter & load • ར఺ • ՝ۚ͞Εͳ͍

    • ܽ఺ • Ұ౓ϩʔΧϧʹμ΢ϯϩʔυ্ͯ͛͠௚͢ ͜ͱʹͳΔͷͰඇৗʹ஗͍ɻɻɻ
  20. (2) select & copy • insert_job API ʹ query ͱ

    destination_table Λࢦఆ insert_job(project_id, { configuration: { query: { query:"SELECT ... FROM [...]", destination_table: { dataset_id: dataset_id, table_id: table_id }, } } }, {})
  21. (2) select & copy ͷྫ SELECT STRING(business_id) AS business_id, STRING(full_address)

    AS full_address, schools, BOOLEAN(open) AS open, FROM [dataset_id.table_id] • ΩϟετͰܕมߋ • ੍໿: ܕมߋ͢Δͱ mode: NULLABLE ʹͳΔ • ࢦఆ͠ͳ͔ͬͨΧϥϜ͸࡟আ͞ΕΔ
  22. (2) select & copy • ར఺ • ଎͍ • ܽ఺

    • ՝ۚ͞ΕΔ
  23. (2) select & copy Λ࠾༻ (1) export & filter &

    load ΍ΔͳΒ HDFS͔ΒσʔλLoad΍Γ௚ͯ͠΋࣌ؒ͋·ΓมΘΒͳ͍ ...
  24. Further dive into select & copy

  25. RECORD ܕͷΩϟετํ๏ SELECT INTEGER(votes.funny) AS votes.funny, INTEGER(votes.useful) AS votes.useful, INTEGER(votes.cool)

    AS votes.cool, FROM [dataset_id.table_id] • υοτ۠੾ΓͰࢦఆ͢Δ
  26. select & copy ͰͷΧϥϜ௥Ճ SELECT column1, INTEGER(NULL) AS column2, INTEGER(NULL)

    AS (record.column3), FROM [dataset_id.table_id] • INTEGER ܕͷ column2 Λ௥Ճ • RECORD ܕͷ record ΧϥϜͰ͸ͳ͘ record_column3 ͱ͍͏໊લͷΧϥϜ͕Ͱ͖Δɻɻɻ • patch table API ࢖ͬͨ΄͏͕ྑͦ͞͏ɻɻɻ
  27. mode: REPEATED ΧϥϜͷࢦఆ • SELECT ͰࢦఆͰ͖Δ REPEATED ΧϥϜ͸̍ͭͩ ͚ɺͳͲͷ੍໿͕͋ͬͨΓ •

    SELECT ͢Δͱߦ਺͕૿͑ΔͷͰɺREPEATED Χϥ Ϝͷͳ͍ߦ਺͕૿͑ͨςʔϒϧΛ࡞Δ͜ͱʹͳΔ • Ͳ͏΋ݫͦ͠͏
  28. Atomic ͳςʔϒϧͷஔ׵ • ௨ৗͷઓུ • ม׵લͷςʔϒϧ => ม׵ޙ • atomic

    ʹ swap • BigQuery ʹ͸ rename ͕ͳ͍ʂແཧʂʁ • copy ͷ destination_table Λࣗ෼ࣗ਎ʹࢦఆ • atomic ʹ swap ͞ΕΔʂʂ
  29. ·ͱΊ

  30. ·ͱΊΔͱ • ΧϥϜͷ௥Ճ͕ඞཁͳ৔߹͸ɺ·ͣ patch table • ΧϥϜͷ࡟আɺ·ͨ͸ܕมߋ͕ඞཁͳ৔߹͸ɺ ͔ͦ͜Β͞Βʹ select &

    copy • copyઌ͸ࣗ෼ࣗ਎Λࢦఆ͢Δ͜ͱͰ atomic ʹ swap Ͱ͖Δ
  31. ੍໿ • mode: REPEATED ΧϥϜΛѻ͑ͳ͍ • mode: NULLABLE ΧϥϜͷΈ௥ՃՄೳ •

    ܕมߋ͢Δͱ mode: NULLABLE ʹͳΔ
  32. ׬੒඼ https://github.com/sonots/bigquery_migration/

  33. ࢖͍ํ require 'bigquery_migration' config = { json_keyfile: '/path/to/your-project-000.json' dataset: 'your_dataset_name'

    table: 'your_table_name' } columns = [ { name: 'string', type: 'STRING' }, { name: 'record', type: 'RECORD', fields: [ { name: 'integer', type: 'INTEGER' }, { name: 'timestamp', type: 'TIMESTAMP' }, ] } ] migrator = BigqueryMigration.new(config) migrator.migrate_table(columns: columns)
  34. • 4݄23೔ൃചʂ • σʔλऩूಛू • Fluentd / Embulk • DeNA

    / Cookpad ͷࣄྫ