[coscup] Reading and modifying the source code of the dbt adapter

Reading and modifying the source code of the dbt adapter
[email protected] 1

2 :: about me TWJUG LITE JA V A

3 :: about me •也來讀讀 Open Source 函式庫怎麼實作 Discord Gateway
的部分 •Discord 基礎範例調查兵團 [1] •Discord 基礎範例調查兵團 [2] •Discord 基礎範例調查兵團 [3] •DrogonTest 準備好上戰場了嗎？ •把玩 Spring Security [1] 先讓⼀部分動起來 •把玩 Spring Security [2] 探索 Access Control 功能 •把玩 Spring Security [3] 拼上最後⼀哩路 AuthenticationProvider •ktor 是怎麼打造出來的 [1] •ktor 是怎麼打造出來的 [2] •ktor 是怎麼打造出來的 [3] •ktor 是怎麼打造出來的 [4] Notes on Open Source Projects

4 :: working at a startup company Build the quality
for your data pipeline github.com/InfuseAI/piperider https://piperider.io/discord

5 :: basic idea I use the piperider and dbt
to build my data pipeline. Could I know the cost to each building?

6 :: what is a data pipeline? 1 2 3
4 5 Data from application logs Sensor events From the web crawler

4 5 Data from application logs Sensor events From the web crawler Raw data collecting

4 5 Data from application logs Sensor events From the web crawler Save to the Database SQL is friendly to Data People {scientist, analyst, engineer}. One database as the Single Source Of Truth CREATE TABLE FOOBAR …;   INSERT INTO FOOBAR …;

4 5 Data from application logs Sensor events From the web crawler Build the facts Dimensional modeling treats data as facts and dimensions. https://docs.getdbt.com/terms/dimensional-modeling CREATE TABLE XYZ SELECT a, b, c FROM FOOBAR; 4 5 Build the dimensions Get business insights A speci fi c perspective view of data JOIN Table OR Create View

10 :: what is the dbt project? 1 2 3
4 5 Data from application logs Sensor events From the web crawler Make transformations versioned and knowing data dependencies

11 :: what is the dbt-adapter project? 1 2 3
4 5 Data from application logs Sensor events From the web crawler All connections are operated by the dbt-adapter. The driver of your database. jaffle_shop: outputs: dev: dataset: piperdier job_execution_timeout_seconds: 300 job_retries: 1 location: US method: service-account priority: interactive project: piperdier-lab threads: 1 type: bigquery keyfile: /path/to/key.json target: dev

12 :: the thought processes to tackle the problem I
am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Knowledge • How to do the cost estimation • Python syntax and BigQuery library

am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Knowledge Setup the baseline for hacking • Setup dbt project • Setup development environment (debugger)

am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Knowledge Discovery trial and error Using the power of knowledge

am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Know the cost per query for BigQuery Knowledge • Proof of the concept • Patch or a new dbt-adapter

16 Let’s GO

Apply this and get the results https://cloud.google.com/bigquery/docs/estimate-costs 19

20 from google.cloud import bigquery # Construct a BigQuery client
object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query( ( "SELECT name, COUNT(*) as name_count " "FROM `bigquery-public-data.usa_names.usa_1910_2013` " "WHERE state = 'WA' " "GROUP BY name" ), job_config=job_config, ) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Basic python syntax

21 Different perspectives on basic Python syntax https://bit.ly/3Q0FEYg

object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Different perspectives on basic Python syntax Which parts cannot be changed in the most of cases?

object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Different perspectives on basic Python syntax Import the module from package

object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Different perspectives on basic Python syntax Import the module from package . └── google ├── __init__.py └── cloud ├── __init__.py └── bigquery.py

from google.cloud import bigquery # Construct a BigQuery client object.
client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) 25 Different perspectives on basic Python syntax Invariant SYBMOL lives in the module scope variables, class and function

from google.cloud import bigquery # Construct a BigQuery client object.
client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) 26 Different perspectives on basic Python syntax Which parts cannot be changed in the most of cases?

27 Setup the environment Debugger can check details at runtime
<demo>

Setup the script path Setup the working directory <demo> 28

Search the invariants <demo> 29

Breakpoint Useful call stack <demo> 30

31 DEMO

Handle (Client instance) raw_execute(client) execute(…) -> BigQueryAdapterResponse @dataclass class BigQueryAdapterResponse(AdapterResponse):
bytes_processed: Optional[int] = None bytes_billed: Optional[int] = None location: Optional[str] = None project_id: Optional[str] = None job_id: Optional[str] = None slot_ms: Optional[int] = None <demo> 32

Q & A 33

[coscup] Reading and modifying the source code...

[coscup] Reading and modifying the source code of the dbt adapter

Ching Yi Chan

More Decks by Ching Yi Chan

Other Decks in Programming

Featured

Transcript

Reading and modifying the source code of the dbt adapter

2 :: about me TWJUG LITE JA V A

3 :: about me •也來讀讀 Open Source 函式庫怎麼實作 Discord Gateway

4 :: working at a startup company Build the quality

5 :: basic idea I use the piperider and dbt

6 :: what is a data pipeline? 1 2 3

7 :: what is a data pipeline? 1 2 3

8 :: what is a data pipeline? 1 2 3

9 :: what is a data pipeline? 1 2 3

10 :: what is the dbt project? 1 2 3

11 :: what is the dbt-adapter project? 1 2 3

12 :: the thought processes to tackle the problem I

13 :: the thought processes to tackle the problem I

14 :: the thought processes to tackle the problem I

15 :: the thought processes to tackle the problem I

16 Let’s GO

17

18

Apply this and get the results https://cloud.google.com/bigquery/docs/estimate-costs 19

20 from google.cloud import bigquery # Construct a BigQuery client

21 Different perspectives on basic Python syntax https://bit.ly/3Q0FEYg

22 from google.cloud import bigquery # Construct a BigQuery client

23 from google.cloud import bigquery # Construct a BigQuery client

24 from google.cloud import bigquery # Construct a BigQuery client

from google.cloud import bigquery # Construct a BigQuery client object.