Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PipeRider: Data Reliability Automated

PipeRider: Data Reliability Automated

PipeRider ensures data reliability through automating constant data testing and monitoring.

website: https://www.piperider.io/
GitHub: https://github.com/infuseai/piperider

Liang Bin Hsueh

July 21, 2022
Tweet

More Decks by Liang Bin Hsueh

Other Decks in Technology

Transcript

  1. Data Reliability Automated Liang-Bin Hsueh (hlb), Cofounder and COO of

    InfuseAI PipeRider https://www.piperider.io/
  2. About Me • Cofounder of InfuseAI • ❤ Open Source

    • Reach me • Twitter: hlb • Facebook: iamhlb • LinkedIn: iamhlb
  3. About InfuseAI Data Reliability Automated Streamline MLOps Our Products trusted

    by research institutes and leaders in sectors including FSI, manufacturing, and healthcare projects our product 
 manage for clients GPUs our product
 manage for clients 500+ 100+
  4. Unity Software Inc. (NASDAQ:U) Q1 2022 Earnings Call Transcript “We

    lost the value of a portion of our data, training data due in part to us ingesting bad data from a large customer. We estimate the impact to our business of approximately $110 million in 2022 …”
  5. Data Reliability • data is complete and accurate, and it

    is a crucial foundation for building data trust across the organization • Types of Data • Operational Data: data produced operationally. customer impressions, transaction records, … • Analytical data: data used analytically. Marketing churn, clickthrough rates, impressions by global region, … • Operational data runs your business; Analytical data manages your business
  6. Fighting with Bad Data • Duplicate records • Incomplete fi

    elds like missing values • Inaccurate data entries like “InfuseAI”, “Infuse AI”, ”InfuseBI”, “Infuse CI” • Incompatible software migration like from one database to another • Schema or semantic changes
  7. Data Quality Issue across Data Pipeline 💣 👻 👻 👻

    ⚠ errors happened here data a ff ected data a ff ected data a ff ected business impact
  8. PipeRider: Data Reliability Automated https://www.piperider.io/ • PipeRider ensures data reliability

    through automating data testing and monitoring • Key Features • Instant quality assessment in HTML report • Report comparison • Extensible custom assertions • Works with existing dbt projects • Automatic test recommendations
  9. How to use # install 
 $ pip install -U

    piperider # init piperider inside your project 
 $ piperider init # run and generate report 
 $ piperider run # compare different reports over time 
 $ piperider compare - reports
  10. Assertions Column Assertions - assert_column_exist - assert_column_in_types - assert_column_min_in_range -

    assert_column_max_in_range - assert_column_not_null - assert_column_null - assert_column_type - assert_column_unique Table Assertions - assert_row_count_in_range Customize assertions by writing plugins: 
 https: / / docs.piperider.io/data - quality - assertions/custom - assertions
  11. Integrate with dbt https://github.com/dbt-labs/dbt-core dbt enables data analysts to custom-write

    transformations through SQL statements. piperider integrates with your dbt project so you can init with zero setup. dataSources: - name: my_dbt_project type: postgres dbt: prof i le: my_dbt_project projectDir: . # the path to dbt_project.yml prof i lesDir: ~/.dbt # the path to the directory of dbt prof i les.yml …
  12. Roadmap https://github.com/infuseai/piperider • better report formats • better assertion recommendations

    • serve command to run a local web server • more integrations like AWS redshift