Try using dbtplyr

Try using dbtplyr 2024/12/04 Japan.R 2024 (#JapanR) Takuto Kotsubo (@airspace_nobo)
It has been automatically translated. Sorry if there are any inaccurate expressions.

Self Introduction Have been working as a data scientist for
ﬁve years now ❏ Recently working in Analytics Engineering team to improve data modeling ❏ Participating in today's event as a staff member of the venue Hobbies ❏ Watching comedy, Futsal

Memories Past presentations at Tokyo. ❏ Use R from python
@Tokyo. ❏ How to predict the wind direction of tomorrow @Tokyo. This is the second time since Tokyo.R #71 that I have given a presentation in front of Hadley wickham.

To the main topic. ❏ Introducing a package that incorporates
dplyr notation into a modern data pipeline tool called dbt, which I have recently started using in my work. I think there are many people who have never seen dbt before, so I hope you can understand what it can do!

What we want to tell you today ❏ dbt SQL
can be written in a dplyr-like way ❏ Combined with Dbt , preprocessing written in R may be replaced with SQL? ❏ dplyr is still amazing.There are endless possibilities.

Presentation Flow ❏ Introduction to data modeling ❏ Introduction to
dbt ❏ What can dbt solve? ❏ What can dbt solve when writing SQL? ❏ What can be solved with macro ❏ What can be solved with dbtplyr ❏ Summary

What is Data Modeling? Data modeling is the process of
designing and building logical structures to organize data and make it easier to analyze ❏ Improve the speed and reduce the effort of analysis and querying. Centralize management of intermediate processes Example of a not-so-good situation ❏ All users are individually performing pre-processing->totalization->visualization, etc. on information similar to raw data.

What is Data Modeling? Improved condition ❏ Pre-processing common to
all users is centralized ❏ No need to follow secret sauce-like information when querying raw data Preprocessi ng/Transfor mation Clean Tables Use here

What is dbt? Code that combines SQL and Jinja, a
template language used in Python ❏ Easy to implement data modeling and full documentation and visualization of the process ❏ There is a function called "macro", which allows for processing that would be difﬁcult to perform using SQL alone.

select id as order_id, user_id as customer_id, order_date, status from
{{ source('main', 'raw_orders') }} select id as order_id, user_id as customer_id, order_date, status from "tutorial"."main"."raw_orders" What is dbt? ❏ When you create the query shown on the left as stg_orders.sql and run dbt, the appropriate DB and table are referenced from the conﬁguration. ❏ From the query, you can visualize the dependency of the process "stg_orders was created by referencing raw_orders".

What is dbt? Preprocessing and joins of other tables in
the same PROJECT can make a nice relationship chart. ❏ Convert raw_payments to stg_payments in the same way ❏ Create a table named orders by joining stg_orders and stg_payments

What can dbt solve? With dbt, you can easily implement
data modeling based on SQL. ❏ On the other hand, dbt has a feature called "macro" for processing that is difﬁcult to do using SQL alone. ❏ dbtplyr is a package that allows you to use dplyr-like notation in SQL.

select avg(num_a) as avg_num_a, avg(num_b) as avg_num_b, avg(num_c) as avg_nu_c,
from data data |> summarise(across(starts_with('num'), ~mean(.x, na.rm=TRUE), .names = "avg_{.col}")) Concerns when writing SQL ❏ So it is necessary to write repeatedly when processing multiple columns in a regular manner, since there is no function. ❏ Repetition also increases the possibility of typos (e.g., mistakenly writing "num" as "nu").

with order_payments as ( select order_id, sum(case when payment_method =
'credit_card' then amount else 0 end) as credit_card_amount, sum(case when payment_method = 'coupon' then amount else 0 end) as coupon_amount, sum(case when payment_method = 'bank_transfer' then amount else 0 end) as bank_transfer_amount, sum(case when payment_method = 'gift_card' then amount else 0 end) as gift_card_amount, sum(amount) as total_amount What can be solved with macro ❏ When you want to aggregate the data horizontally for each payment_method, you need to write it repeatedly in SQL.

{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %} with
order_payments as ( select order_id, {% for payment_method in payment_methods -%} sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) as {{ payment_method }}_amount, {% endfor -%} sum(amount) as total_amount What can be solved with macro ❏ dbt's macro can be used to generalize for loop and processing

{% macro get_payment_methods() %} ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] {% endmacro
%} {% macro calculate_method_amounts(payment_methods, amount_column='amount', payment_method_column='payment_method') %} {% for payment_method in payment_methods -%} sum(case when {{ payment_method_column }} = '{{ payment_method }}' then {{ amount_column }} else 0 end) as {{ payment_method }}_amount, {% endfor -%} {% endmacro %} What can be solved with macro ❏ macro as a function and can be used in other queries (e.g. macro/util.sql)

What dbtplyr can solve ❏ dbtplyr is a package designed
to implement dplyr-like notation in dbt macro. ❏ A package developed by Emily Riederer , whose X proﬁle also conveys her love of R "Three R's in my last name, but it's not enough #rstats for me!"

{% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set
cols_num = dbtplyr.starts_with('num_', cols) %} select customer_id, {{ dbtplyr.across(cols_num, "sum({{var}}) as {{var}}_tot") }} from {{ ref('customers') }} group by 1 select customer_id, sum(num_total_orders) as num_total_orders_tot, sum(num_total_amount) as num_total_amount_tot from "tutorial"."main"."customers" group by 1 What dbtplyr can solve ❏ Use across to summarize recurring totals

{% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set
cols_date = dbtplyr.starts_with('date_', cols) %} select customer_id, {{ dbtplyr.across(cols_date, "cast({{var}} as timestamp) as {{var}}")}} from {{ ref('customers') }} select customer_id, cast(date_first_order as date) as date_first_order, cast(date_most_recent_order as date) as date_most_recent_order from "tutorial"."main"."customers" What dbtplyr can solve ❏ Use across to summarize repetitive conversions

{% set cols_date = dbtplyr.starts_with('date_', cols) %} ... where {%
for col in cols_date -%} {{col}} >= cast('2018-02-01' as date) or {% endfor %} FALSE ... where date_first_order >= cast('2018-02-01' as date) or date_most_recent_order >= cast('2018-02-01' as date) or FALSE What dbtplyr can solve ❏ Can be used in combination with for loop for where condition

SUMMARY ❏ The use of Data Engineering (dbt) × Data
Science (dplyr) was introduced. ❏ The process used in SQL is simple and convenient, so dplyr may become popular in the dbt world. ❏ I will write a bit more about it in dbt Advent Calendar 2024 including the dbt installation process!

dbtplyr: Bringing Column-Name Contracts from R to dbt - posit::conf(2023)
What, exactly, is dbt? Apply a function (or functions) across multiple columns bibliography

Appendix: Source Code ❏ https://github.com/takuto0831/japanr_dbtplyr

appendix: targets package ❏ Rが生産性を高める〜データ分析ワークフロー効率化の実践 and others have discussed
the data pipeline tool called targets. ❏ Data pipeline + visualization with ggplot and other outputs can be incorporated into the pipeline.

Try using dbtplyr

Try using dbtplyr

airspace_nobo

More Decks by airspace_nobo

Featured

Transcript

Try using dbtplyr 2024/12/04 Japan.R 2024 (#JapanR) Takuto Kotsubo (@airspace_nobo)

Self Introduction Have been working as a data scientist for

Memories Past presentations at Tokyo. ❏ Use R from python

To the main topic. ❏ Introducing a package that incorporates

What we want to tell you today ❏ dbt SQL

Presentation Flow ❏ Introduction to data modeling ❏ Introduction to

What is Data Modeling? Data modeling is the process of

What is Data Modeling? Improved condition ❏ Pre-processing common to

What is dbt? Code that combines SQL and Jinja, a

select id as order_id, user_id as customer_id, order_date, status from

What is dbt? Preprocessing and joins of other tables in

What can dbt solve? With dbt, you can easily implement

select avg(num_a) as avg_num_a, avg(num_b) as avg_num_b, avg(num_c) as avg_nu_c,

with order_payments as ( select order_id, sum(case when payment_method =

{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %} with

{% macro get_payment_methods() %} ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] {% endmacro

What dbtplyr can solve ❏ dbtplyr is a package designed

{% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set

{% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set

{% set cols_date = dbtplyr.starts_with('date_', cols) %} ... where {%

SUMMARY ❏ The use of Data Engineering (dbt) × Data

dbtplyr: Bringing Column-Name Contracts from R to dbt - posit::conf(2023)

Appendix: Source Code ❏ https://github.com/takuto0831/japanr_dbtplyr

appendix: targets package ❏ Rが生産性を高める〜データ分析ワークフロー効率化の実践 and others have discussed