Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AI/ML Model Characterization for Performance, Interpretability, Fairness and Reliability

finid
June 28, 2019

AI/ML Model Characterization for Performance, Interpretability, Fairness and Reliability

Artificial intelligence (AI) has been transformed by machine learning (ML) methodologies due to ML’s advantage in scalability in cloud and vast choices of open-source as well as commercial, off-the-shelf (COTS) solutions. However, the complexity of ML puts AI at the center of discussion regarding decision fairness or transparency, ethics, as well as reliability. As these concerns impact feasibility and adoptions of AI in enterprises, there is a need for generalized guidelines or approaches to evaluate ML model performance in the context of these concerns

This talk will provide a framework that can help determine the performance, fairness and transparency through ML model characterization. Methods and approaches to model characterization will be proposed. After this talk, audiences will have a better understanding in: 1. How to evaluate and understand the limit of ML model built by data science team; 2. Functional knowledge in relevant model key performance metrics (KPI) that helps decision makers in model validation, adoption or deployment. 3. For ML practitioners and engineers, common practices in fields for dealing with data class imbalance or sampling bias.

finid

June 28, 2019
Tweet

More Decks by finid

Other Decks in Technology

Transcript

  1. Big Data & AI Conference Dallas, Texas June 27 –

    29, 2019 www.BigDataAIconference.com
  2. Disclaimer Products or services mentioned are for demonstration only, not

    for endorsement or comparison. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  3. Defnitions • Protected Attributes: race, color, national origin, religion, gender

    (including pregnancy), disability, age, and citizenship status. • Bias: a tendency to skew results or lean towards a predetermined ideological or sociological idea based on both personal beliefs and experiences. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  4. The Ethics of AI Fairness Reliability Transpare ncy Friday, June

    28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  5. Fairness • Ensure model performs consistently for all groups. •

    Difculty: • We expect model to be fair, yet training data is seldom fair. • Defnition for fairness varies, depending on your perspective. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  6. Fairness in Model Performance • Error rates are not similar

    for all groups. • Majority of training data are Caucasian male. • Suggested Remedy 1: Test each group separately. • Suggested Remedy 2 : Enhance minority data representation in training. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX 1 % 12 35 7 %
  7. Fairness by Group Selection Rate • Also known as 4/5th

    rule, 80% rule, Adverse Impact Analysis, Demographic Parity. • The highest selection rate serves as a benchmark. • Are other group’s rate within 80% of the benchmark? • If not, then that group is said to be adversely impacted. • Pro: No need for collecting and curating protected attributes. • Con: Random hiring to meet quota. Group Selected Pct(%) Exceed benchm ark 32% * Impacte d A 40 Yes No B 35 Yes No C 20 No Yes * 32% is the 4/5th of the highest selection (Group A) Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  8. Fairness by Feature Selection Rate Gender Selecte d Not Selecte

    d Selectio n Pct (%) M 10 90 10 F 10 72 12 • Equity in outcome. • Same number of selections are made at each feature value. • Pro: No feature bias. • Con: Favorable to smaller group members. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  9. Fairness by Similarity • Similar individuals should be treated in

    a similar way. • But it’s not trivial to measure similarity between individuals. • Pro: Relevant to a person. • Con: ‘Distance’ or clustering is abstract and not easy to justify. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  10. Point of Reference for Fairness Groups Features Individual Matter of

    concern Are groups represented in a familiar way? Are all feature values represented equally? Are similar people treated similarly? Assessment Is proportion same as general population? Are M and F selected equal in numbers? Am I more like the selected or rejected people? Metric Ratio Count Similarity Consideration for protected attributes None Yes Varies Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  11. Reliability • Ensure model performance is consistent during deployment. •

    Active monitoring for model drift or bias. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  12. Reliability as a Goal for Continuous Improvement • Requires active

    review and revision process. • Human reviewer verifes prediction. • Labeled data enables model improvement. • Acquiring truth usually takes time. Deploy & Score Predicti on Results Labeled Data Train Model Reviewe r Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  13. Reliability Monitoring Data Protected Attributes (PA) Data w.o. PA* Train

    Model Training Data Holdout Data Testing Results Compare results by PA Are results similar for all PA? No Yes Deploy Model Review & Monitor *PA = Protected Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  14. Training Data in Azure Storage Blob Store run records Experiment

    Datastor e Execute on compute target Train and test Query and tune Azure ML Service Azure Containe r Registry Model Registry Azure Machine Learning authenticates request HTTP reques t HTTP respons e Create Docker image Data Engineering and Model Development Depl oy Train Continuous Improveme nt ML Model Life Cycle
  15. True Positive Rate • True positive (TP) rate measures proportion

    of actual positives. • TP should be measured for each group (i.e., demographic) • Diferences in TP across groups indicate possible bias. Aka: True Positive Rate, sensitivity. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  16. Suggested Metrics for Reviewer • Once ground truth is available,

    compare metrics by groups. • Equal opportunity: True positive are same for all groups. • Equal odds: True positive and false positive are same for all groups. • Useful single metric: Mean and variance of TP or FP. Gender Classifcati on Example Male Female Caucasia n TP, FP TP,FP Colored TP, FP TP, FP Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  17. Transparency • Provide explanations of main elements of AI systems

    • Be able to explain how decisions are made Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  18. Titanic Sank • 2:20 am, April 15, 1912. • “Deeply

    regret advise you Titanic sank this morning ffteenth after collision iceberg resulting serious loss life further particulars later” – Bruce Ismay (Director of the White Star Line) Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  19. Titanic Dataset for Survival Prediction • Why did this female

    child not survive ? Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  20. Attributions to Survival Chances Friday, June 28, 2019 KC Tung,

    PhD. Big Data and AI Conference, Dallas TX
  21. Individual Explanation •This female passenger did not survive. Why? •Model’s

    interpretation: She had fve siblings, was in lowest passenger class, age 10. •Understandable explanation: Social-economic status is a driver for survival chances. •Model helps identify bias in chance for survival. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  22. Conclusions • Defnition for fairness depends on perspectives. • A

    model cannot conform to more than a few fairness metrics at the same time. • Use AI to discover potential bias and help with understandable explanation. • Human judgement is needed to ensure decision making is fair and reliable. • Early technical progress is underway, but much more is needed. • Data augmentation techniques help create synthetic data to enhance minority data representation. • Model interpretation state-of-the-art: Local Interpretable Model Agnostic Explanations (LIME). Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  23. Future Works and Directions • Decision justifcation at individual level.

    • Data augmentation techniques for various cases and data types. • Model training framework that is non-discriminatory and not subject to data imbalance. Friday, June 28, 2019 KC Tung, PhD. Big Data and AI Conference, Dallas TX
  24. Thank You Friday, June 28, 2019 KC Tung, PhD. Big

    Data and AI Conference, Dallas TX