Slide 1

Slide 1 text

Data Reliability Automated Liang-Bin Hsueh (hlb), Cofounder and COO of InfuseAI PipeRider https://www.piperider.io/

Slide 2

Slide 2 text

About Me • Cofounder of InfuseAI • ❤ Open Source • Reach me • Twitter: hlb • Facebook: iamhlb • LinkedIn: iamhlb

Slide 3

Slide 3 text

About InfuseAI Data Reliability Automated Streamline MLOps Our Products trusted by research institutes and leaders in sectors including FSI, manufacturing, and healthcare projects our product 
 manage for clients GPUs our product
 manage for clients 500+ 100+

Slide 4

Slide 4 text

Unity Software Inc. (NASDAQ:U) Q1 2022 Earnings Call Transcript “We lost the value of a portion of our data, training data due in part to us ingesting bad data from a large customer. We estimate the impact to our business of approximately $110 million in 2022 …”

Slide 5

Slide 5 text

Data Reliability • data is complete and accurate, and it is a crucial foundation for building data trust across the organization • Types of Data • Operational Data: data produced operationally. customer impressions, transaction records, … • Analytical data: data used analytically. Marketing churn, clickthrough rates, impressions by global region, … • Operational data runs your business; Analytical data manages your business

Slide 6

Slide 6 text

Fighting with Bad Data • Duplicate records • Incomplete fi elds like missing values • Inaccurate data entries like “InfuseAI”, “Infuse AI”, ”InfuseBI”, “Infuse CI” • Incompatible software migration like from one database to another • Schema or semantic changes

Slide 7

Slide 7 text

Data Analysis in Reality :(

Slide 8

Slide 8 text

Data Pipeline Thank to Karen Hsieh [Link to deck]

Slide 9

Slide 9 text

Thank to Karen Hsieh [Link to Miro board] 😍

Slide 10

Slide 10 text

Data Quality Issue across Data Pipeline

Slide 11

Slide 11 text

Data Quality Issue across Data Pipeline 💣 👻 👻 👻 ⚠ errors happened here data a ff ected data a ff ected data a ff ected business impact

Slide 12

Slide 12 text

Data Quality Issue across Data Pipeline ⚠ ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ✅ ✅ ❌

Slide 13

Slide 13 text

Data Reliability Automated https://www.piperider.io/

Slide 14

Slide 14 text

PipeRider: Data Reliability Automated https://www.piperider.io/ • PipeRider ensures data reliability through automating data testing and monitoring • Key Features • Instant quality assessment in HTML report • Report comparison • Extensible custom assertions • Works with existing dbt projects • Automatic test recommendations

Slide 15

Slide 15 text

How to use # install 
 $ pip install -U piperider # init piperider inside your project 
 $ piperider init # run and generate report 
 $ piperider run # compare different reports over time 
 $ piperider compare - reports

Slide 16

Slide 16 text

Assertions Column Assertions - assert_column_exist - assert_column_in_types - assert_column_min_in_range - assert_column_max_in_range - assert_column_not_null - assert_column_null - assert_column_type - assert_column_unique Table Assertions - assert_row_count_in_range Customize assertions by writing plugins: 
 https: / / docs.piperider.io/data - quality - assertions/custom - assertions

Slide 17

Slide 17 text

Integrate with dbt https://github.com/dbt-labs/dbt-core dbt enables data analysts to custom-write transformations through SQL statements. piperider integrates with your dbt project so you can init with zero setup. dataSources: - name: my_dbt_project type: postgres dbt: prof i le: my_dbt_project projectDir: . # the path to dbt_project.yml prof i lesDir: ~/.dbt # the path to the directory of dbt prof i les.yml …

Slide 18

Slide 18 text

Report

Slide 19

Slide 19 text

Comparison

Slide 20

Slide 20 text

CI Integration GitHub Actions as example https://docs.piperider.io/how-to/github-action

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

DEMO Data Source: Video Games Sales Dataset

Slide 23

Slide 23 text

Roadmap https://github.com/infuseai/piperider • better report formats • better assertion recommendations • serve command to run a local web server • more integrations like AWS redshift

Slide 24

Slide 24 text

Thank You [email protected] https://piperider.io