[coscup] Reading and modifying the source code of the dbt adapter

Slide 1

Slide 1 text

Reading and modifying the source code of the dbt adapter [email protected] 1

Slide 2

Slide 2 text

2 :: about me TWJUG LITE JA V A

Slide 3

Slide 3 text

3 :: about me •也來讀讀 Open Source 函式庫怎麼實作 Discord Gateway 的部分 •Discord 基礎範例調查兵團 [1] •Discord 基礎範例調查兵團 [2] •Discord 基礎範例調查兵團 [3] •DrogonTest 準備好上戰場了嗎？ •把玩 Spring Security [1] 先讓⼀部分動起來 •把玩 Spring Security [2] 探索 Access Control 功能 •把玩 Spring Security [3] 拼上最後⼀哩路 AuthenticationProvider •ktor 是怎麼打造出來的 [1] •ktor 是怎麼打造出來的 [2] •ktor 是怎麼打造出來的 [3] •ktor 是怎麼打造出來的 [4] Notes on Open Source Projects

Slide 4

Slide 4 text

4 :: working at a startup company Build the quality for your data pipeline github.com/InfuseAI/piperider https://piperider.io/discord

Slide 5

Slide 5 text

5 :: basic idea I use the piperider and dbt to build my data pipeline. Could I know the cost to each building?

Slide 6

Slide 6 text

6 :: what is a data pipeline? 1 2 3 4 5 Data from application logs Sensor events From the web crawler

Slide 7

Slide 7 text

7 :: what is a data pipeline? 1 2 3 4 5 Data from application logs Sensor events From the web crawler Raw data collecting

Slide 8

Slide 8 text

8 :: what is a data pipeline? 1 2 3 4 5 Data from application logs Sensor events From the web crawler Save to the Database SQL is friendly to Data People {scientist, analyst, engineer}. One database as the Single Source Of Truth CREATE TABLE FOOBAR …;   INSERT INTO FOOBAR …;

Slide 9

Slide 9 text

9 :: what is a data pipeline? 1 2 3 4 5 Data from application logs Sensor events From the web crawler Build the facts Dimensional modeling treats data as facts and dimensions. https://docs.getdbt.com/terms/dimensional-modeling CREATE TABLE XYZ SELECT a, b, c FROM FOOBAR; 4 5 Build the dimensions Get business insights A speci fi c perspective view of data JOIN Table OR Create View

Slide 10

Slide 10 text

10 :: what is the dbt project? 1 2 3 4 5 Data from application logs Sensor events From the web crawler Make transformations versioned and knowing data dependencies

Slide 11

Slide 11 text

11 :: what is the dbt-adapter project? 1 2 3 4 5 Data from application logs Sensor events From the web crawler All connections are operated by the dbt-adapter. The driver of your database. jaffle_shop: outputs: dev: dataset: piperdier job_execution_timeout_seconds: 300 job_retries: 1 location: US method: service-account priority: interactive project: piperdier-lab threads: 1 type: bigquery keyfile: /path/to/key.json target: dev

Slide 12

Slide 12 text

12 :: the thought processes to tackle the problem I am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Knowledge • How to do the cost estimation • Python syntax and BigQuery library

Slide 13

Slide 13 text

13 :: the thought processes to tackle the problem I am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Knowledge Setup the baseline for hacking • Setup dbt project • Setup development environment (debugger)

Slide 14

Slide 14 text

14 :: the thought processes to tackle the problem I am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Knowledge Discovery trial and error Using the power of knowledge

Slide 15

Slide 15 text

15 :: the thought processes to tackle the problem I am using the dbt-bigquery adapter Is it possible to know the cost per query. Before After Know the cost per query for BigQuery Knowledge • Proof of the concept • Patch or a new dbt-adapter

Slide 16

Slide 16 text

16 Let’s GO

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Apply this and get the results https://cloud.google.com/bigquery/docs/estimate-costs 19

Slide 20

Slide 20 text

20 from google.cloud import bigquery # Construct a BigQuery client object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query( ( "SELECT name, COUNT(*) as name_count " "FROM `bigquery-public-data.usa_names.usa_1910_2013` " "WHERE state = 'WA' " "GROUP BY name" ), job_config=job_config, ) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Basic python syntax

Slide 21

Slide 21 text

21 Different perspectives on basic Python syntax https://bit.ly/3Q0FEYg

Slide 22

Slide 22 text

22 from google.cloud import bigquery # Construct a BigQuery client object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Different perspectives on basic Python syntax Which parts cannot be changed in the most of cases?

Slide 23

Slide 23 text

23 from google.cloud import bigquery # Construct a BigQuery client object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Different perspectives on basic Python syntax Import the module from package

Slide 24

Slide 24 text

24 from google.cloud import bigquery # Construct a BigQuery client object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) Different perspectives on basic Python syntax Import the module from package . └── google ├── __init__.py └── cloud ├── __init__.py └── bigquery.py

Slide 25

Slide 25 text

from google.cloud import bigquery # Construct a BigQuery client object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) 25 Different perspectives on basic Python syntax Invariant SYBMOL lives in the module scope variables, class and function

Slide 26

Slide 26 text

from google.cloud import bigquery # Construct a BigQuery client object. client = bigquery.Client() job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False) # Start the query, passing in the extra configuration. query_job = client.query("sql-to-query", job_config=job_config) # Make an API request. # A dry run query completes immediately. print("This query will process {} bytes.".format(query_job.total_bytes_processed)) 26 Different perspectives on basic Python syntax Which parts cannot be changed in the most of cases?

Slide 27

Slide 27 text

27 Setup the environment Debugger can check details at runtime

Slide 28

Slide 28 text

Setup the script path Setup the working directory 28

Slide 29

Slide 29 text

Search the invariants 29

Slide 30

Slide 30 text

Breakpoint Useful call stack 30

Slide 31

Slide 31 text

31 DEMO

Slide 32

Slide 32 text

Handle (Client instance) raw_execute(client) execute(…) -> BigQueryAdapterResponse @dataclass class BigQueryAdapterResponse(AdapterResponse): bytes_processed: Optional[int] = None bytes_billed: Optional[int] = None location: Optional[str] = None project_id: Optional[str] = None job_id: Optional[str] = None slot_ms: Optional[int] = None 32

Slide 33

Slide 33 text

Q & A 33