Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Try using dbtplyr

airspace_nobo
December 06, 2024
7

Try using dbtplyr

#Japan.R 2024 のLT資料の英語翻訳バージョン
https://japanr.connpass.com/event/334154/

airspace_nobo

December 06, 2024
Tweet

Transcript

  1. Try using dbtplyr 2024/12/04 Japan.R 2024 (#JapanR) Takuto Kotsubo (@airspace_nobo)

    It has been automatically translated. Sorry if there are any inaccurate expressions.
  2. Self Introduction Have been working as a data scientist for

    five years now ❏ Recently working in Analytics Engineering team to improve data modeling ❏ Participating in today's event as a staff member of the venue Hobbies ❏ Watching comedy, Futsal
  3. Memories Past presentations at Tokyo. ❏ Use R from python

    @Tokyo. ❏ How to predict the wind direction of tomorrow @Tokyo. This is the second time since Tokyo.R #71 that I have given a presentation in front of Hadley wickham.
  4. To the main topic. ❏ Introducing a package that incorporates

    dplyr notation into a modern data pipeline tool called dbt, which I have recently started using in my work. I think there are many people who have never seen dbt before, so I hope you can understand what it can do!
  5. What we want to tell you today ❏ dbt SQL

    can be written in a dplyr-like way ❏ Combined with Dbt , preprocessing written in R may be replaced with SQL? ❏ dplyr is still amazing.There are endless possibilities.
  6. Presentation Flow ❏ Introduction to data modeling ❏ Introduction to

    dbt ❏ What can dbt solve? ❏ What can dbt solve when writing SQL? ❏ What can be solved with macro ❏ What can be solved with dbtplyr ❏ Summary
  7. What is Data Modeling? Data modeling is the process of

    designing and building logical structures to organize data and make it easier to analyze ❏ Improve the speed and reduce the effort of analysis and querying. Centralize management of intermediate processes Example of a not-so-good situation ❏ All users are individually performing pre-processing->totalization->visualization, etc. on information similar to raw data.
  8. What is Data Modeling? Improved condition ❏ Pre-processing common to

    all users is centralized ❏ No need to follow secret sauce-like information when querying raw data Preprocessi ng/Transfor mation Clean Tables Use here
  9. What is dbt? Code that combines SQL and Jinja, a

    template language used in Python ❏ Easy to implement data modeling and full documentation and visualization of the process ❏ There is a function called "macro", which allows for processing that would be difficult to perform using SQL alone.
  10. select id as order_id, user_id as customer_id, order_date, status from

    {{ source('main', 'raw_orders') }} select id as order_id, user_id as customer_id, order_date, status from "tutorial"."main"."raw_orders" What is dbt? ❏ When you create the query shown on the left as stg_orders.sql and run dbt, the appropriate DB and table are referenced from the configuration. ❏ From the query, you can visualize the dependency of the process "stg_orders was created by referencing raw_orders".
  11. What is dbt? Preprocessing and joins of other tables in

    the same PROJECT can make a nice relationship chart. ❏ Convert raw_payments to stg_payments in the same way ❏ Create a table named orders by joining stg_orders and stg_payments
  12. What can dbt solve? With dbt, you can easily implement

    data modeling based on SQL. ❏ On the other hand, dbt has a feature called "macro" for processing that is difficult to do using SQL alone. ❏ dbtplyr is a package that allows you to use dplyr-like notation in SQL.
  13. select avg(num_a) as avg_num_a, avg(num_b) as avg_num_b, avg(num_c) as avg_nu_c,

    from data data |> summarise(across(starts_with('num'), ~mean(.x, na.rm=TRUE), .names = "avg_{.col}")) Concerns when writing SQL ❏ So it is necessary to write repeatedly when processing multiple columns in a regular manner, since there is no function. ❏ Repetition also increases the possibility of typos (e.g., mistakenly writing "num" as "nu").
  14. with order_payments as ( select order_id, sum(case when payment_method =

    'credit_card' then amount else 0 end) as credit_card_amount, sum(case when payment_method = 'coupon' then amount else 0 end) as coupon_amount, sum(case when payment_method = 'bank_transfer' then amount else 0 end) as bank_transfer_amount, sum(case when payment_method = 'gift_card' then amount else 0 end) as gift_card_amount, sum(amount) as total_amount What can be solved with macro ❏ When you want to aggregate the data horizontally for each payment_method, you need to write it repeatedly in SQL.
  15. {% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %} with

    order_payments as ( select order_id, {% for payment_method in payment_methods -%} sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) as {{ payment_method }}_amount, {% endfor -%} sum(amount) as total_amount What can be solved with macro ❏ dbt's macro can be used to generalize for loop and processing
  16. {% macro get_payment_methods() %} ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] {% endmacro

    %} {% macro calculate_method_amounts(payment_methods, amount_column='amount', payment_method_column='payment_method') %} {% for payment_method in payment_methods -%} sum(case when {{ payment_method_column }} = '{{ payment_method }}' then {{ amount_column }} else 0 end) as {{ payment_method }}_amount, {% endfor -%} {% endmacro %} What can be solved with macro ❏ macro as a function and can be used in other queries (e.g. macro/util.sql)
  17. What dbtplyr can solve ❏ dbtplyr is a package designed

    to implement dplyr-like notation in dbt macro. ❏ A package developed by Emily Riederer , whose X profile also conveys her love of R "Three R's in my last name, but it's not enough #rstats for me!"
  18. {% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set

    cols_num = dbtplyr.starts_with('num_', cols) %} select customer_id, {{ dbtplyr.across(cols_num, "sum({{var}}) as {{var}}_tot") }} from {{ ref('customers') }} group by 1 select customer_id, sum(num_total_orders) as num_total_orders_tot, sum(num_total_amount) as num_total_amount_tot from "tutorial"."main"."customers" group by 1 What dbtplyr can solve ❏ Use across to summarize recurring totals
  19. {% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set

    cols_date = dbtplyr.starts_with('date_', cols) %} select customer_id, {{ dbtplyr.across(cols_date, "cast({{var}} as timestamp) as {{var}}")}} from {{ ref('customers') }} select customer_id, cast(date_first_order as date) as date_first_order, cast(date_most_recent_order as date) as date_most_recent_order from "tutorial"."main"."customers" What dbtplyr can solve ❏ Use across to summarize repetitive conversions
  20. {% set cols_date = dbtplyr.starts_with('date_', cols) %} ... where {%

    for col in cols_date -%} {{col}} >= cast('2018-02-01' as date) or {% endfor %} FALSE ... where date_first_order >= cast('2018-02-01' as date) or date_most_recent_order >= cast('2018-02-01' as date) or FALSE What dbtplyr can solve ❏ Can be used in combination with for loop for where condition
  21. SUMMARY ❏ The use of Data Engineering (dbt) × Data

    Science (dplyr) was introduced. ❏ The process used in SQL is simple and convenient, so dplyr may become popular in the dbt world. ❏ I will write a bit more about it in dbt Advent Calendar 2024 including the dbt installation process!
  22. dbtplyr: Bringing Column-Name Contracts from R to dbt - posit::conf(2023)

    What, exactly, is dbt? Apply a function (or functions) across multiple columns bibliography
  23. appendix: targets package ❏ Rが生産性を高める 〜データ分析ワークフロー効率化の実践 and others have discussed

    the data pipeline tool called targets. ❏ Data pipeline + visualization with ggplot and other outputs can be incorporated into the pipeline.