AI/ML Model Characterization for Performance, Interpretability, Fairness and Reliability

Big Data & AI Conference Dallas, Texas June 27 –
29, 2019 www.BigDataAIconference.com

AI Fairness and Interpretability KC Tung, PhD AI Architect Microsoft
[email protected]

Disclaimer Products or services mentioned are for demonstration only, not
for endorsement or comparison. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Defnitions • Protected Attributes: race, color, national origin, religion, gender
(including pregnancy), disability, age, and citizenship status. • Bias: a tendency to skew results or lean towards a predetermined ideological or sociological idea based on both personal beliefs and experiences. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

The Ethics of AI Fairness Reliability Transpare ncy Friday, June
28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Fairness • Ensure model performs consistently for all groups. •
Difculty: • We expect model to be fair, yet training data is seldom fair. • Defnition for fairness varies, depending on your perspective. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Fairness in Model Performance • Error rates are not similar
for all groups. • Majority of training data are Caucasian male. • Suggested Remedy 1: Test each group separately. • Suggested Remedy 2 : Enhance minority data representation in training. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX 1 % 12 35 7 %

Fairness by Group Selection Rate • Also known as 4/5th
rule, 80% rule, Adverse Impact Analysis, Demographic Parity. • The highest selection rate serves as a benchmark. • Are other group’s rate within 80% of the benchmark? • If not, then that group is said to be adversely impacted. • Pro: No need for collecting and curating protected attributes. • Con: Random hiring to meet quota. Group Selected Pct(%) Exceed benchm ark 32% * Impacte d A 40 Yes No B 35 Yes No C 20 No Yes * 32% is the 4/5th of the highest selection (Group A) Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Fairness by Feature Selection Rate Gender Selecte d Not Selecte
d Selectio n Pct (%) M 10 90 10 F 10 72 12 • Equity in outcome. • Same number of selections are made at each feature value. • Pro: No feature bias. • Con: Favorable to smaller group members. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Fairness by Similarity • Similar individuals should be treated in
a similar way. • But it’s not trivial to measure similarity between individuals. • Pro: Relevant to a person. • Con: ‘Distance’ or clustering is abstract and not easy to justify. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Point of Reference for Fairness Groups Features Individual Matter of
concern Are groups represented in a familiar way? Are all feature values represented equally? Are similar people treated similarly? Assessment Is proportion same as general population? Are M and F selected equal in numbers? Am I more like the selected or rejected people? Metric Ratio Count Similarity Consideration for protected attributes None Yes Varies Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Reliability • Ensure model performance is consistent during deployment. •
Active monitoring for model drift or bias. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Reliability as a Goal for Continuous Improvement • Requires active
review and revision process. • Human reviewer verifes prediction. • Labeled data enables model improvement. • Acquiring truth usually takes time. Deploy & Score Predicti on Results Labeled Data Train Model Reviewe r Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Reliability Monitoring Data Protected Attributes (PA) Data w.o. PA* Train
Model Training Data Holdout Data Testing Results Compare results by PA Are results similar for all PA? No Yes Deploy Model Review & Monitor *PA = Protected Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Training Data in Azure Storage Blob Store run records Experiment
Datastor e Execute on compute target Train and test Query and tune Azure ML Service Azure Containe r Registry Model Registry Azure Machine Learning authenticates request HTTP reques t HTTP respons e Create Docker image Data Engineering and Model Development Depl oy Train Continuous Improveme nt ML Model Life Cycle

True Positive Rate • True positive (TP) rate measures proportion
of actual positives. • TP should be measured for each group (i.e., demographic) • Diferences in TP across groups indicate possible bias. Aka: True Positive Rate, sensitivity. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Suggested Metrics for Reviewer • Once ground truth is available,
compare metrics by groups. • Equal opportunity: True positive are same for all groups. • Equal odds: True positive and false positive are same for all groups. • Useful single metric: Mean and variance of TP or FP. Gender Classifcati on Example Male Female Caucasia n TP, FP TP,FP Colored TP, FP TP, FP Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Transparency • Provide explanations of main elements of AI systems
• Be able to explain how decisions are made Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Titanic Sank • 2:20 am, April 15, 1912. • “Deeply
regret advise you Titanic sank this morning ffteenth after collision iceberg resulting serious loss life further particulars later” – Bruce Ismay (Director of the White Star Line) Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Titanic Dataset for Survival Prediction • Why did this female
child not survive ? Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Attributions to Survival Chances Friday, June 28, 2019 KC Tung,
PhD. Big Data and AI Conference, Dallas TX

Individual Explanation •This female passenger did not survive. Why? •Model’s
interpretation: She had fve siblings, was in lowest passenger class, age 10. •Understandable explanation: Social-economic status is a driver for survival chances. •Model helps identify bias in chance for survival. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Conclusions • Defnition for fairness depends on perspectives. • A
model cannot conform to more than a few fairness metrics at the same time. • Use AI to discover potential bias and help with understandable explanation. • Human judgement is needed to ensure decision making is fair and reliable. • Early technical progress is underway, but much more is needed. • Data augmentation techniques help create synthetic data to enhance minority data representation. • Model interpretation state-of-the-art: Local Interpretable Model Agnostic Explanations (LIME). Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Future Works and Directions • Decision justifcation at individual level.
• Data augmentation techniques for various cases and data types. • Model training framework that is non-discriminatory and not subject to data imbalance. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX

Thank You Friday, June 28, 2019 KC Tung, PhD. Big
Data and AI Conference, Dallas TX

AI/ML Model Characterization for Performance, I...

AI/ML Model Characterization for Performance, Interpretability, Fairness and Reliability

finid

More Decks by finid

Other Decks in Technology

Featured

Transcript

Big Data & AI Conference Dallas, Texas June 27 –

AI Fairness and Interpretability KC Tung, PhD AI Architect Microsoft

Disclaimer Products or services mentioned are for demonstration only, not

Defnitions • Protected Attributes: race, color, national origin, religion, gender

The Ethics of AI Fairness Reliability Transpare ncy Friday, June

Fairness • Ensure model performs consistently for all groups. •

Fairness in Model Performance • Error rates are not similar

Fairness by Group Selection Rate • Also known as 4/5th

Fairness by Feature Selection Rate Gender Selecte d Not Selecte

Fairness by Similarity • Similar individuals should be treated in

Point of Reference for Fairness Groups Features Individual Matter of

Reliability • Ensure model performance is consistent during deployment. •

Reliability as a Goal for Continuous Improvement • Requires active

Reliability Monitoring Data Protected Attributes (PA) Data w.o. PA* Train

Training Data in Azure Storage Blob Store run records Experiment

True Positive Rate • True positive (TP) rate measures proportion

Suggested Metrics for Reviewer • Once ground truth is available,

Transparency • Provide explanations of main elements of AI systems

Titanic Sank • 2:20 am, April 15, 1912. • “Deeply

Titanic Dataset for Survival Prediction • Why did this female

Attributions to Survival Chances Friday, June 28, 2019 KC Tung,

Individual Explanation •This female passenger did not survive. Why? •Model’s

Conclusions • Defnition for fairness depends on perspectives. • A

Future Works and Directions • Decision justifcation at individual level.

Thank You Friday, June 28, 2019 KC Tung, PhD. Big