Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[coscup] Reading and modifying the source code of the dbt adapter

[coscup] Reading and modifying the source code of the dbt adapter

https://coscup.org/2023/zh-TW/session/HL88HZ

In order to calculate the cost of each dbt build execution through dbt-bigquery, I started reading and modifying the source code of the dbt adapter. As long as you have a basic understanding of Python syntax and some simple trace code techniques, such as using “string” that do not change and the stack trace when exceptions occur, combined with a debugger, you can easily achieve the desired effect.

為了想要透過 dbt-bigquery 計算每一次執行 dbt build 的成本,而開始邊讀邊改 dbt adapter 的原始碼。只要你對 Python 有基本的語法概念,再加上簡易的 trace code 技巧:依賴不變的字串與發生例外時的 stack trace,再配合 debugger 就能簡單地改出想要的效果囉!

Ching Yi Chan

July 28, 2023
Tweet

More Decks by Ching Yi Chan

Other Decks in Programming

Transcript

  1. 3 :: about me •也來讀讀 Open Source 函式庫怎麼實作 Discord Gateway

    的部分 •Discord 基礎範例調查兵團 [1] •Discord 基礎範例調查兵團 [2] •Discord 基礎範例調查兵團 [3] •DrogonTest 準備好上戰場了嗎? •把玩 Spring Security [1] 先讓⼀部分動起來 •把玩 Spring Security [2] 探索 Access Control 功能 •把玩 Spring Security [3] 拼上最後⼀哩路 AuthenticationProvider •ktor 是怎麼打造出來的 [1] •ktor 是怎麼打造出來的 [2] •ktor 是怎麼打造出來的 [3] •ktor 是怎麼打造出來的 [4] Notes on Open Source Projects
  2. 4 :: working at a startup company Build the quality

    for your data pipeline github.com/InfuseAI/piperider https://piperider.io/discord
  3. 5 :: basic idea I use the piperider and dbt

    to build my data pipeline. Could I know the cost to each building?
  4. 6 :: what is a data pipeline? 1 2 3

    4 5 Data from application logs Sensor events From the web crawler
  5. 7 :: what is a data pipeline? 1 2 3

    4 5 Data from application logs Sensor events From the web crawler Raw data collecting
  6. 8 :: what is a data pipeline? 1 2 3

    4 5 Data from application logs Sensor events From the web crawler Save to the Database SQL is friendly to Data People {scientist, analyst, engineer}. One database as the Single Source Of Truth CREATE TABLE FOOBAR …; 
 INSERT INTO FOOBAR …;
  7. 9 :: what is a data pipeline? 1 2 3

    4 5 Data from application logs Sensor events From the web crawler Build the facts Dimensional modeling treats data as facts and dimensions. https://docs.getdbt.com/terms/dimensional-modeling CREATE TABLE XYZ SELECT a, b, c FROM FOOBAR; 4 5 Build the dimensions Get business insights A speci fi c perspective view of data JOIN Table OR Create View
  8. 10 :: what is the dbt project? 1 2 3

    4 5 Data from application logs Sensor events From the web crawler Make transformations versioned and knowing data dependencies
  9. 11 :: what is the dbt-adapter project? 1 2 3

    4 5 Data from application logs Sensor events From the web crawler All connections are operated by the dbt-adapter. The driver of your database. jaffle_shop: outputs: dev: dataset: piperdier job_execution_timeout_seconds: 300 job_retries: 1 location: US method: service-account priority: interactive project: piperdier-lab threads: 1 type: bigquery keyfile: /path/to/key.json target: dev
  10. 12 :: the thought processes to tackle the problem I

    am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Knowledge • How to do the cost estimation • Python syntax and BigQuery library
  11. 13 :: the thought processes to tackle the problem I

    am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Knowledge Setup the baseline for hacking • Setup dbt project • Setup development environment (debugger)
  12. 14 :: the thought processes to tackle the problem I

    am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Knowledge Discovery trial and error Using the power of knowledge
  13. 15 :: the thought processes to tackle the problem I

    am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Know the cost per query for BigQuery Knowledge • Proof of the concept • Patch or a new dbt-adapter
  14. 17

  15. 18

  16. 20 from google.cloud import bigquery # Construct a BigQuery client

    object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query( ( "SELECT name, COUNT(*) as name_count " "FROM `bigquery-public-data.usa_names.usa_1910_2013` " "WHERE state = 'WA' " "GROUP BY name" ), job_config=job_config, ) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Basic python syntax
  17. 22 from google.cloud import bigquery # Construct a BigQuery client

    object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Different perspectives on basic Python syntax Which parts cannot be changed in the most of cases?
  18. 23 from google.cloud import bigquery # Construct a BigQuery client

    object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Different perspectives on basic Python syntax Import the module from package
  19. 24 from google.cloud import bigquery # Construct a BigQuery client

    object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Different perspectives on basic Python syntax Import the module from package . └── google ├── __init__.py └── cloud ├── __init__.py └── bigquery.py
  20. from google.cloud import bigquery # Construct a BigQuery client object.

    client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) 25 Different perspectives on basic Python syntax Invariant SYBMOL lives in the module scope variables, class and function
  21. from google.cloud import bigquery # Construct a BigQuery client object.

    client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) 26 Different perspectives on basic Python syntax Which parts cannot be changed in the most of cases?
  22. Handle (Client instance) raw_execute(client) execute(…) -> BigQueryAdapterResponse @dataclass class BigQueryAdapterResponse(AdapterResponse):

    bytes_processed: Optional[int] = None bytes_billed: Optional[int] = None location: Optional[str] = None project_id: Optional[str] = None job_id: Optional[str] = None slot_ms: Optional[int] = None <demo> 32