Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BigQuery Schema Migration #bq_sushi

BigQuery Schema Migration #bq_sushi

Naotoshi Seo

April 08, 2016
Tweet

More Decks by Naotoshi Seo

Other Decks in Technology

Transcript

  1. #JH2VFSZͷςʔϒϧΛ
    .JHSBUF ΧϥϜ௥Ճɺ࡟আɺ
    ܕมߋ
    ͢Δ
    2016/04/08
    @sonots
    #bq_sushi 3

    View Slide

  2. ࣗݾ঺հ
    • ੉ඌ௚ར @sonots
    • DeNA ෼ੳج൫
    • Fluentd ίϛολ
    • Ruby ίϛολ
    • ࠷ۙ͸ embulk ۀ
    • embulk-output-bigquery
    • embulk-filter-column, etc

    View Slide

  3. • 4݄23೔ൃചʂ
    • σʔλऩूಛू
    • Fluentd / Embulk
    • DeNA / Cookpad ͷࣄྫ

    View Slide

  4. ΞδΣϯμ
    • ฐࣾͰͷ BigQuery ར༻
    • εΩʔϚมߋͷඞཁੑ
    • BigQuery ʹ͓͚ΔεΩʔϚมߋͷࠔ೉͞
    • εΩʔϚมߋͷઓུ

    View Slide

  5. ฐࣾͰͷ BigQuery ར༻
    • West (US) Ͱ̍೥Ҏ্લ͔Βར༻
    • JP Ͱ΋ϘνϘν࢖͍࢝Ί͍ͯΔ
    • σʔλҠߦπʔϧ࡞ͬͯΔ
    • hdfs2bigquery
    • vertica2bigquery
    • bigquery2hdfs
    • bigquery2vertica

    View Slide

  6. ฐࣾͰͷ෼ੳۀ຿
    • σʔλҠߦ΍ Hadoop/Vertica ӡ༻͸ج൫νʔϜ
    • ෼ੳۀ຿͸ΞφϦετ͕ߦ͏
    • BigQuery ʹΫΤϦΛ౤͛Δͷ͸ΞφϦετ

    View Slide

  7. εΩʔϚมߋͷඞཁੑ(1)
    • ϩάʹΧϥϜ͕௥Ճ͞Εͨ
    • ΍ͬͺΓΧϥϜ͕ফ͞Εͨ
    • ΧϥϜͷܕΛؒҧ͑ͨ
    • INTEGER ͬΆ͍ͱࢥͬͯͨΒ 11,12 Έ͍ͨͳ஋
    ͕ೖͬͯΔߦ͕͋ͬͯ STRING ͡Όͳ͍ͱμϝ
    ͩͬͨͱ͔͋Δ͋Δ

    View Slide

  8. εΩʔϚมߋͷඞཁੑ(2)
    • BigQuery ςʔϒϧ໊ϕετϓϥΫςΟε
    • ςʔϒϧ໊લஔࢺ_ˋY%m%d
    • ຖ೔৽͘͠ςʔϒϧΛ࡞Δ
    • ຖ೔εΩʔϚ࠶ఆٛͷνϟϯε͕͋Δ
    • εΩʔϚมߋ͠ͳͯ͘΋ྑ͍͡ΌΜʁ

    View Slide

  9. εΩʔϚมߋͷඞཁੑ(3)
    • ̎ͭͷςʔϒϧͰΧϥϜͷܕ͕ҧ͏ͱΤϥʔʂ
    SELECT
    name
    FROM
    TABLE_DATE_RANGE(data.people_,
    TIMESTAMP('2014-03-26'),
    TIMESTAMP('2014-03-27'))
    WHERE
    age >= 35
    • Ωϟετ͢Δͱ͍͏ख͸͋Δ͕ɺ෼ੳ࣌͸ଞͷ͜
    ͱʹ಄Λ࢖͍͍ͨͷͰආ͚͍ͨ

    View Slide

  10. εΩʔϚมߋπʔϧͷఏڙ
    • ͋Δ΂͖࢟(εΩʔϚ)Λఏࣔ͢Δͱɺ
    • ΧϥϜͷ௥Ճ
    • ΧϥϜͷ࡟আ
    • ΧϥϜͷܕมߋ
    • Λࣗಈ൑ผͯ͠ɺεΩʔϚมߋͰ͖ΔΑ͏ʹ
    ͍ͯ͋͛ͨ͠

    View Slide

  11. BigQuery ʹ͓͚Δ
    εΩʔϚมߋͷࠔ೉͞

    View Slide

  12. εΩʔϚมߋͷࠔ೉
    • BigQuery ʹ͸ ALTER TABLE ͕ͳ͍
    • ΧϥϜ௥ՃͷAPI͸͋Δ
    • ΧϥϜ࡟আɺܕมߋͷ API ͕ͳ͍
    Ͳ͏͢Δ͔ʁͱ͍͏࿩

    View Slide

  13. ΧϥϜ௥Ճ
    • patch_table (or update_table) API ͰͰ͖Δ
    client.patch_table(project_id, dataset_id, table_id, {
    schema: {
    fields: [
    {name:"time", type:"TIMESTAMP", mode:"NULLABLE"},
    {name:"id", type:"INTEGER", mode:"NULLABLE"},
    ]
    }
    })
    google-api-ruby-client

    View Slide

  14. ΧϥϜ௥Ճ(஫ҙ఺)
    • ͢Ͱʹ͋ΔεΩʔϚ + ௥Ճ͢ΔΧϥϜ
    • get_table ͰεΩʔϚΛऔಘͯ͠Ϛʔδͯ͠౤͛Δ
    response = get_table(project_id, dataset_id, table_id)
    columns = response.schema.fields.map {|col| col.to_h }
    columns << {name:"id", type:"INTEGER", mode:"NULLABLE"}
    google-api-ruby-client

    View Slide

  15. patch tables ͷ੍໿
    • ௥Ճͨ͠ΧϥϜ͸຤ඌʹ௥Ճ͞ΕΔ
    • ௥ՃͰ͖Δͷ͸ NULLABLE ·ͨ͸ REPEATED ͚ͩ
    • mode: REQUIRED ͳΧϥϜ͕௥ՃͰ͖ͳ͍
    • มߋ͸ REQUIRED => NULLABLE ͚ͩ
    • NULLABLE Λ REQUIRED ʹͰ͖ͳ͍
    • REPEATED ʹ΋Ͱ͖ͳ͍

    View Slide

  16. ΧϥϜ࡟আɺܕมߋ

    View Slide

  17. ΧϥϜ࡟আɺܕมߋ
    • ̎ͭͷઓུ
    • (1) export & filter & load
    • (2) select & copy

    View Slide

  18. (1) export & filter & load
    • gcs ʹ export
    • embulk ΍Βͳʹ΍ΒͰ download ͭͭ͠஋Λม
    ׵͢Δ filtering ॲཧΛߦ͏
    • BQ ʹ load ͠ͳ͓͢

    View Slide

  19. (1) export & filter & load
    • ར఺
    • ՝ۚ͞Εͳ͍
    • ܽ఺
    • Ұ౓ϩʔΧϧʹμ΢ϯϩʔυ্ͯ͛͠௚͢
    ͜ͱʹͳΔͷͰඇৗʹ஗͍ɻɻɻ

    View Slide

  20. (2) select & copy
    • insert_job API ʹ query ͱ destination_table Λࢦఆ
    insert_job(project_id, {
    configuration: {
    query: {
    query:"SELECT ... FROM [...]",
    destination_table: {
    dataset_id: dataset_id,
    table_id: table_id
    },
    }
    }
    }, {})

    View Slide

  21. (2) select & copy ͷྫ
    SELECT
    STRING(business_id) AS business_id,
    STRING(full_address) AS full_address,
    schools,
    BOOLEAN(open) AS open,
    FROM
    [dataset_id.table_id]
    • ΩϟετͰܕมߋ
    • ੍໿: ܕมߋ͢Δͱ mode: NULLABLE ʹͳΔ
    • ࢦఆ͠ͳ͔ͬͨΧϥϜ͸࡟আ͞ΕΔ

    View Slide

  22. (2) select & copy
    • ར఺
    • ଎͍
    • ܽ఺
    • ՝ۚ͞ΕΔ

    View Slide

  23. (2) select & copy
    Λ࠾༻
    (1) export & filter & load ΍ΔͳΒ
    HDFS͔ΒσʔλLoad΍Γ௚ͯ͠΋࣌ؒ͋·ΓมΘΒͳ͍ ...

    View Slide

  24. Further dive into
    select & copy

    View Slide

  25. RECORD ܕͷΩϟετํ๏
    SELECT
    INTEGER(votes.funny) AS votes.funny,
    INTEGER(votes.useful) AS votes.useful,
    INTEGER(votes.cool) AS votes.cool,
    FROM
    [dataset_id.table_id]
    • υοτ۠੾ΓͰࢦఆ͢Δ

    View Slide

  26. select & copy ͰͷΧϥϜ௥Ճ
    SELECT
    column1,
    INTEGER(NULL) AS column2,
    INTEGER(NULL) AS (record.column3),
    FROM
    [dataset_id.table_id]
    • INTEGER ܕͷ column2 Λ௥Ճ
    • RECORD ܕͷ record ΧϥϜͰ͸ͳ͘
    record_column3 ͱ͍͏໊લͷΧϥϜ͕Ͱ͖Δɻɻɻ
    • patch table API ࢖ͬͨ΄͏͕ྑͦ͞͏ɻɻɻ

    View Slide

  27. mode: REPEATED ΧϥϜͷࢦఆ
    • SELECT ͰࢦఆͰ͖Δ REPEATED ΧϥϜ͸̍ͭͩ
    ͚ɺͳͲͷ੍໿͕͋ͬͨΓ
    • SELECT ͢Δͱߦ਺͕૿͑ΔͷͰɺREPEATED Χϥ
    Ϝͷͳ͍ߦ਺͕૿͑ͨςʔϒϧΛ࡞Δ͜ͱʹͳΔ
    • Ͳ͏΋ݫͦ͠͏

    View Slide

  28. Atomic ͳςʔϒϧͷஔ׵
    • ௨ৗͷઓུ
    • ม׵લͷςʔϒϧ => ม׵ޙ
    • atomic ʹ swap
    • BigQuery ʹ͸ rename ͕ͳ͍ʂແཧʂʁ
    • copy ͷ destination_table Λࣗ෼ࣗ਎ʹࢦఆ
    • atomic ʹ swap ͞ΕΔʂʂ

    View Slide

  29. ·ͱΊ

    View Slide

  30. ·ͱΊΔͱ
    • ΧϥϜͷ௥Ճ͕ඞཁͳ৔߹͸ɺ·ͣ patch table
    • ΧϥϜͷ࡟আɺ·ͨ͸ܕมߋ͕ඞཁͳ৔߹͸ɺ
    ͔ͦ͜Β͞Βʹ select & copy
    • copyઌ͸ࣗ෼ࣗ਎Λࢦఆ͢Δ͜ͱͰ atomic ʹ
    swap Ͱ͖Δ

    View Slide

  31. ੍໿
    • mode: REPEATED ΧϥϜΛѻ͑ͳ͍
    • mode: NULLABLE ΧϥϜͷΈ௥ՃՄೳ
    • ܕมߋ͢Δͱ mode: NULLABLE ʹͳΔ

    View Slide

  32. ׬੒඼
    https://github.com/sonots/bigquery_migration/

    View Slide

  33. ࢖͍ํ
    require 'bigquery_migration'
    config = {
    json_keyfile: '/path/to/your-project-000.json'
    dataset: 'your_dataset_name'
    table: 'your_table_name'
    }
    columns = [
    { name: 'string', type: 'STRING' },
    { name: 'record', type: 'RECORD', fields: [
    { name: 'integer', type: 'INTEGER' },
    { name: 'timestamp', type: 'TIMESTAMP' },
    ] }
    ]
    migrator = BigqueryMigration.new(config)
    migrator.migrate_table(columns: columns)

    View Slide

  34. • 4݄23೔ൃചʂ
    • σʔλऩूಛू
    • Fluentd / Embulk
    • DeNA / Cookpad ͷࣄྫ

    View Slide