Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PipeRider: Data Reliability Automated

PipeRider: Data Reliability Automated

PipeRider ensures data reliability through automating constant data testing and monitoring.

website: https://www.piperider.io/
GitHub: https://github.com/infuseai/piperider

Liang Bin Hsueh

July 21, 2022
Tweet

More Decks by Liang Bin Hsueh

Other Decks in Technology

Transcript

  1. Data Reliability Automated
    Liang-Bin Hsueh (hlb), Cofounder and COO of InfuseAI
    PipeRider
    https://www.piperider.io/

    View Slide

  2. About Me
    • Cofounder of InfuseAI

    • ❤ Open Source

    • Reach me

    • Twitter: hlb

    • Facebook: iamhlb

    • LinkedIn: iamhlb

    View Slide

  3. About InfuseAI
    Data Reliability Automated
    Streamline MLOps
    Our Products
    trusted by research institutes and leaders in sectors
    including FSI, manufacturing, and healthcare
    projects our product 

    manage for clients
    GPUs our product

    manage for clients
    500+
    100+

    View Slide

  4. Unity Software Inc.
    (NASDAQ:U) Q1 2022 Earnings Call Transcript
    “We lost the value of a portion of our data, training data
    due in part to us ingesting bad data from a large
    customer. We estimate the impact to our business of
    approximately $110 million in 2022 …”

    View Slide

  5. Data Reliability
    • data is complete and accurate, and it is a crucial foundation for building data
    trust across the organization

    • Types of Data

    • Operational Data: data produced operationally. customer impressions,
    transaction records, …

    • Analytical data: data used analytically. Marketing churn, clickthrough rates,
    impressions by global region, …
    • Operational data runs your business; Analytical data manages your business

    View Slide

  6. Fighting with Bad Data
    • Duplicate records
    • Incomplete
    fi
    elds like missing values

    • Inaccurate data entries like “InfuseAI”, “Infuse AI”, ”InfuseBI”, “Infuse CI”

    • Incompatible software migration like from one database to another

    • Schema or semantic changes

    View Slide

  7. Data Analysis in Reality :(

    View Slide

  8. Data Pipeline
    Thank to Karen Hsieh [Link to deck]

    View Slide

  9. Thank to Karen Hsieh [Link to Miro board]
    😍

    View Slide

  10. Data Quality Issue across Data Pipeline

    View Slide

  11. Data Quality Issue across Data Pipeline
    💣
    👻
    👻
    👻

    errors happened here
    data a
    ff
    ected
    data a
    ff
    ected
    data a
    ff
    ected
    business impact

    View Slide

  12. Data Quality Issue across Data Pipeline












    View Slide

  13. Data Reliability Automated
    https://www.piperider.io/

    View Slide

  14. PipeRider: Data Reliability Automated
    https://www.piperider.io/
    • PipeRider ensures data reliability through automating data testing and
    monitoring

    • Key Features

    • Instant quality assessment in HTML report

    • Report comparison

    • Extensible custom assertions

    • Works with existing dbt projects

    • Automatic test recommendations

    View Slide

  15. How to use
    # install

    $ pip install -U piperider


    # init piperider inside your project

    $ piperider init


    # run and generate report

    $ piperider run


    # compare different reports over time

    $ piperider compare
    -
    reports

    View Slide

  16. Assertions
    Column Assertions


    - assert_column_exist


    - assert_column_in_types


    - assert_column_min_in_range


    - assert_column_max_in_range


    - assert_column_not_null


    - assert_column_null


    - assert_column_type


    - assert_column_unique


    Table Assertions


    - assert_row_count_in_range


    Customize assertions by writing plugins:

    https:
    / /
    docs.piperider.io/data
    -
    quality
    -
    assertions/custom
    -
    assertions

    View Slide

  17. Integrate with dbt
    https://github.com/dbt-labs/dbt-core
    dbt enables data analysts to custom-write transformations through SQL statements.

    piperider integrates with your dbt project so you can init with zero setup.
    dataSources:


    - name: my_dbt_project


    type: postgres


    dbt:


    prof
    i
    le: my_dbt_project


    projectDir: . # the path to dbt_project.yml


    prof
    i
    lesDir: ~/.dbt # the path to the directory of dbt prof
    i
    les.yml



    View Slide

  18. Report

    View Slide

  19. Comparison

    View Slide

  20. CI Integration
    GitHub Actions as example
    https://docs.piperider.io/how-to/github-action

    View Slide

  21. View Slide

  22. DEMO
    Data Source: Video Games Sales Dataset

    View Slide

  23. Roadmap
    https://github.com/infuseai/piperider
    • better report formats

    • better assertion recommendations

    • serve command to run a local web server

    • more integrations like AWS redshift

    View Slide

  24. Thank You
    [email protected]
    https://piperider.io

    View Slide