Slide 1

Slide 1 text

Try using dbtplyr 2024/12/04 Japan.R 2024 (#JapanR) Takuto Kotsubo (@airspace_nobo) It has been automatically translated. Sorry if there are any inaccurate expressions.

Slide 2

Slide 2 text

Self Introduction Have been working as a data scientist for five years now ❏ Recently working in Analytics Engineering team to improve data modeling ❏ Participating in today's event as a staff member of the venue Hobbies ❏ Watching comedy, Futsal

Slide 3

Slide 3 text

Memories Past presentations at Tokyo. ❏ Use R from python @Tokyo. ❏ How to predict the wind direction of tomorrow @Tokyo. This is the second time since Tokyo.R #71 that I have given a presentation in front of Hadley wickham.

Slide 4

Slide 4 text

To the main topic. ❏ Introducing a package that incorporates dplyr notation into a modern data pipeline tool called dbt, which I have recently started using in my work. I think there are many people who have never seen dbt before, so I hope you can understand what it can do!

Slide 5

Slide 5 text

What we want to tell you today ❏ dbt SQL can be written in a dplyr-like way ❏ Combined with Dbt , preprocessing written in R may be replaced with SQL? ❏ dplyr is still amazing.There are endless possibilities.

Slide 6

Slide 6 text

Presentation Flow ❏ Introduction to data modeling ❏ Introduction to dbt ❏ What can dbt solve? ❏ What can dbt solve when writing SQL? ❏ What can be solved with macro ❏ What can be solved with dbtplyr ❏ Summary

Slide 7

Slide 7 text

What is Data Modeling? Data modeling is the process of designing and building logical structures to organize data and make it easier to analyze ❏ Improve the speed and reduce the effort of analysis and querying. Centralize management of intermediate processes Example of a not-so-good situation ❏ All users are individually performing pre-processing->totalization->visualization, etc. on information similar to raw data.

Slide 8

Slide 8 text

What is Data Modeling? Improved condition ❏ Pre-processing common to all users is centralized ❏ No need to follow secret sauce-like information when querying raw data Preprocessi ng/Transfor mation Clean Tables Use here

Slide 9

Slide 9 text

What is dbt? Code that combines SQL and Jinja, a template language used in Python ❏ Easy to implement data modeling and full documentation and visualization of the process ❏ There is a function called "macro", which allows for processing that would be difficult to perform using SQL alone.

Slide 10

Slide 10 text

select id as order_id, user_id as customer_id, order_date, status from {{ source('main', 'raw_orders') }} select id as order_id, user_id as customer_id, order_date, status from "tutorial"."main"."raw_orders" What is dbt? ❏ When you create the query shown on the left as stg_orders.sql and run dbt, the appropriate DB and table are referenced from the configuration. ❏ From the query, you can visualize the dependency of the process "stg_orders was created by referencing raw_orders".

Slide 11

Slide 11 text

What is dbt? Preprocessing and joins of other tables in the same PROJECT can make a nice relationship chart. ❏ Convert raw_payments to stg_payments in the same way ❏ Create a table named orders by joining stg_orders and stg_payments

Slide 12

Slide 12 text

What can dbt solve? With dbt, you can easily implement data modeling based on SQL. ❏ On the other hand, dbt has a feature called "macro" for processing that is difficult to do using SQL alone. ❏ dbtplyr is a package that allows you to use dplyr-like notation in SQL.

Slide 13

Slide 13 text

select avg(num_a) as avg_num_a, avg(num_b) as avg_num_b, avg(num_c) as avg_nu_c, from data data |> summarise(across(starts_with('num'), ~mean(.x, na.rm=TRUE), .names = "avg_{.col}")) Concerns when writing SQL ❏ So it is necessary to write repeatedly when processing multiple columns in a regular manner, since there is no function. ❏ Repetition also increases the possibility of typos (e.g., mistakenly writing "num" as "nu").

Slide 14

Slide 14 text

with order_payments as ( select order_id, sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount, sum(case when payment_method = 'coupon' then amount else 0 end) as coupon_amount, sum(case when payment_method = 'bank_transfer' then amount else 0 end) as bank_transfer_amount, sum(case when payment_method = 'gift_card' then amount else 0 end) as gift_card_amount, sum(amount) as total_amount What can be solved with macro ❏ When you want to aggregate the data horizontally for each payment_method, you need to write it repeatedly in SQL.

Slide 15

Slide 15 text

{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %} with order_payments as ( select order_id, {% for payment_method in payment_methods -%} sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) as {{ payment_method }}_amount, {% endfor -%} sum(amount) as total_amount What can be solved with macro ❏ dbt's macro can be used to generalize for loop and processing

Slide 16

Slide 16 text

{% macro get_payment_methods() %} ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] {% endmacro %} {% macro calculate_method_amounts(payment_methods, amount_column='amount', payment_method_column='payment_method') %} {% for payment_method in payment_methods -%} sum(case when {{ payment_method_column }} = '{{ payment_method }}' then {{ amount_column }} else 0 end) as {{ payment_method }}_amount, {% endfor -%} {% endmacro %} What can be solved with macro ❏ macro as a function and can be used in other queries (e.g. macro/util.sql)

Slide 17

Slide 17 text

What dbtplyr can solve ❏ dbtplyr is a package designed to implement dplyr-like notation in dbt macro. ❏ A package developed by Emily Riederer , whose X profile also conveys her love of R "Three R's in my last name, but it's not enough #rstats for me!"

Slide 18

Slide 18 text

{% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set cols_num = dbtplyr.starts_with('num_', cols) %} select customer_id, {{ dbtplyr.across(cols_num, "sum({{var}}) as {{var}}_tot") }} from {{ ref('customers') }} group by 1 select customer_id, sum(num_total_orders) as num_total_orders_tot, sum(num_total_amount) as num_total_amount_tot from "tutorial"."main"."customers" group by 1 What dbtplyr can solve ❏ Use across to summarize recurring totals

Slide 19

Slide 19 text

{% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set cols_date = dbtplyr.starts_with('date_', cols) %} select customer_id, {{ dbtplyr.across(cols_date, "cast({{var}} as timestamp) as {{var}}")}} from {{ ref('customers') }} select customer_id, cast(date_first_order as date) as date_first_order, cast(date_most_recent_order as date) as date_most_recent_order from "tutorial"."main"."customers" What dbtplyr can solve ❏ Use across to summarize repetitive conversions

Slide 20

Slide 20 text

{% set cols_date = dbtplyr.starts_with('date_', cols) %} ... where {% for col in cols_date -%} {{col}} >= cast('2018-02-01' as date) or {% endfor %} FALSE ... where date_first_order >= cast('2018-02-01' as date) or date_most_recent_order >= cast('2018-02-01' as date) or FALSE What dbtplyr can solve ❏ Can be used in combination with for loop for where condition

Slide 21

Slide 21 text

SUMMARY ❏ The use of Data Engineering (dbt) × Data Science (dplyr) was introduced. ❏ The process used in SQL is simple and convenient, so dplyr may become popular in the dbt world. ❏ I will write a bit more about it in dbt Advent Calendar 2024 including the dbt installation process!

Slide 22

Slide 22 text

dbtplyr: Bringing Column-Name Contracts from R to dbt - posit::conf(2023) What, exactly, is dbt? Apply a function (or functions) across multiple columns bibliography

Slide 23

Slide 23 text

Appendix: Source Code ❏ https://github.com/takuto0831/japanr_dbtplyr

Slide 24

Slide 24 text

appendix: targets package ❏ Rが生産性を高める 〜データ分析ワークフロー効率化の実践 and others have discussed the data pipeline tool called targets. ❏ Data pipeline + visualization with ggplot and other outputs can be incorporated into the pipeline.