Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fairness and Privacy in ML-Systems (INNOQ-Technology Day)

Fairness and Privacy in ML-Systems (INNOQ-Technology Day)

The public is beginning to recognize the effects of ML-based decision-making. This is not the only reason why it is important to consider non-functional characteristics such as fairness or data protection. How can we ensure that ML-based decisions are made "fairly" and without algorithmic bias? At the same time, the testing of ML-based software is an open field without established best practices. What can we do to meet these challenges? And what exactly makes ML testing in running systems so complicated?



December 10, 2020


  1. 0 9 . 1 2 . 2 0 2 0

    T e c h n o l o g y D a y Fairness und Privacy in ML- Systemen Isabel Bär
  2. Isabel Bär Werkstudentin bei INNOQ Deutschland GmbH

  3. Ethical AI on the Rise Are ML Models Inherently Neutral?

  4. On the Dangers of Stochastic Parrots: Can Language Models Be

    Too Big? Timnit Gebru Former AI Ethics Researcher at Google
  5. Gender Shades (2018) Racial and Commercial Bias in Commercial Face

    Recognition (but would the ethical risk be minimized if these systems worked accurately?) [Buolamwini+18] http://gendershades.org/overview.html
  6. Diversity in Faces (IBM) From Fairness to Privacy Concerns [Hao+19]

    https://www.ibm.com/blogs/research/2019/01/diversity-in- faces/
  7. Real World Impact of Algorithmic Bias Face Scans at Airports

    („Biometric Exit“) Face Recognition in Law Enforcement [Center on Privacy&Technology] https://www.perpetuallineup.org
  8. Real World Impact of Algorithmic Bias Potential Harms http://gendershades.org/overview.html

  9. Deployment Gap https://algorithmia.com/state-of-ml?utm_medium=website&utm_source=interactive-page&utm_campaign=IC-1912- 2020-State-of-ML&_hsenc=p2ANqtz-_WbXKYLnpgf4zi4OZTNYmNgCRPIFFEqmW- Cqi2Px_T1K2wkIJvDt7KdCxB5vXAPmGirLi7ukZTykxeUh9vmHdn7dRF9g&_hsmi=81660946 …87 % of data science

    project never get deployed [VentureBeat+19]
  10. Nature of ML- Systems • Decision Logic obtained via data-

    driven training process • Changes over time • Designed to give answers to questions to which answers are unknown • Multilateral dependencies between the three components https://ml-ops.org
  11. ML-Hierarchy of Needs https://ml-ops.org

  12. ML-Testing-Agenda [Zhang+20]

  13. Dimensions of ML-Testing How to test? Testing workflow Why to

    test? ML-Properties ML- Workflow ML- Properties Ethical AI
  14. The ML-Workflow Online Testing is needed because Offline Testing… Is

    not representative of future data Cannot cover live performance Cannot test live data Is exluded from business metrics ML-Workflow [Zhang+20]
  15. Model Governance • Post-Deployment: Ongoing evaluation of the system •

    Raise alert in case of deviation from metrics https://ml-ops.org
  16. Dimensions of ML-Testing How to test? Testing workflow Why to

    test? ML-Properties ML- Workflow ML- Properties Ethical AI
  17. ML- Properties Functional • Overfitting and Correctness Non-Functional • Fairness

    • Privacy • Robustness • Interpretability
  18. Challenges of ML-Properties Related to Knowledge Context and domain-based definion

    of properties Limited knowledge Conflicts between different properties (fairness vs accuracy) Lack of metrics Related to methods Runtime monitoring Setting thresholds Related to Knolwedge Related to Methodsy [Horkoff+19]
  19. Fairness- Protected Attributes • Gender • Religion • Colour •

    Citizenship • Age • Genetic Information
  20. How to Define Fairness? Different Approaches • Fairness Through Unawareness

    (exclusion of protected attributes) • Individual Fairness (similar predictions on similar individuals) • Group Fairness (groups selected based on sensitive attributes have the same probability of output) [Zhang+20] • Counter-Factual Fairness (output does not change when the protected attribute is turned into the opposite) [Kusner+17]
  21. How to Define Fairness? Pitfalls • Metrics and criteria are

    not broadly applicable • No consideration of differences between user subsets (Simpsons Paradoxum, Berkely admissions) • No inclusion of explanatory factors • Lack of scalability [Tramer+17]
  22. Sources of Unfairness ML-Models learn what humans teach… • Data

    labels are biased • Limited features • Smaller sample size of minority groups • Features are proxies of excluded sensitive attributes [Zhang+20]
  23. Risks of Language Models Large data sets and inscrutable models

    (refering to the yet unpublished Gebru- paper) • Large amounts of data needed to feed data-hungry language models: risk of sampling abusive language in the training data • Shifts in language might not be taken into account (dismission of emerging cultural norms, e.g. Black-Lives-Matter or Covid-vocabulary) • Possible failure to capture language and norms of countries with smaller online presence (AI language might be shaped by leading countries) • Challenge of detecting and tracing embedded bias in large data sets [Hao+20]
  24. Framework • „Unwarranted Associations Framework“ • Tamer et al. Introduced

    “FairTest“ as a tool for checking whether ML- based decision-making violates fairness requirements • Is there an automated approach for Fairness Bug detection? Practical Approaches Fairness The use case is based on the paper “FairTest: Discovering Unwarranted Associations in Data- Driven Applications”” [Tramer+17]
  25. Introducing a New Kind of Bug Fairness Bug Defined "as

    any statistically significant association, in a semantically meaningful user subpopulation, between a protected attribute and an algorithmic output, where the association has no accompanying explanatory factor“ [Tramer+17]
  26. Goals 1. Discovery of Association Bugs With Limited Prior Knowledge

    • Google Photos [Guyinn+15] • Prior anticipation of bugs might be difficult 2. Testing for Suspected Bugs • Staple‘s Pricing Scheme [Valentino-Devries+12] • Urgency for methods testing the impact of a suspected bug across different user subpopulations 3. Error Profiling of an ML-Model Over User Populations • Healthcare Predictions • ML-models can provide different levels of benefit for users (classification disparities) [Tramer+17]
  27. Methodology 1. Data Collection and Pre-Processing • Output O: Output

    to be tested for associations • Protected attribute S: Feature on which to look for associations with 0 • Contextual Attribute X: Features used to split user population in subpopulation • Explanatory Attribute E: User features on which differentiation is accepted [Tramer+17]
  28. Methodology 2. Integrating Explanatory Factors • Associations might result out

    of utility requirements • Identification of confounders which might explain bugs • Explanatory factors are expressed through explanatory attributes E on which the statistical association between S and O can be conditioned [Tramer+17]
  29. Methodology 3. Selecting Metrics • Pearson Correlation measures the strength

    of linear associations between O and S (Testing) • Regression: estimation of a label‘s association with S based on regression coefficient for each label (Discovery) [Tramer+17]
  30. Methodology 4. Statistical Testing Accross Subpopulations • Testing for associations

    in whole usergroups is not enough • Subsets of the global population by assigning users based on the value of contextual features X to maintain comparability • Association-guided tree construction for splitting user subsets into smaller subsets with increasingly strong associations (and by applying statistical techniques to maintain validity) 5. Adaptive Debugging • Running multiple sequential tests • Adding detected confounders (explanatory attributes) [Tramer+17]
  31. The Framework ML-Workflow [Tramer+17]

  32. Association Bug Report [Tramer+17]

  33. AI and the Environment Bert (Google‘s language model) Language model

    with neural architecture search [Strubell+19] [Hao+20]
  34. Estimated Co2 Emissions from Language Model Training [Strubell+19]

  35. Evolution of Required Costs for Training Language Models [Strubell+19] [Hao+20]

  36. Approaches Priorization of More Efficient Algorithms • Replace brute-force grid

    search for hyperparameter tuning with Bayesian framework • Limited interoperability with popular Deep Learning frameworks Narrowing the Gap Between Industry and Academia • Shared access to centralized computation resources [Strubell+19]
  37. Ethical AI on the Rise Are ML Models Inherently Neutral?

  38. Sources [Breck+17] Breck, E., Cai, S., Nielsen, E., Salib, M.,

    & Sculley, D. (2017, December). The ml test score: A rubric for ml production readiness and technical debt reduction. In 2017 IEEE International Conference on Big Data (Big Data) (pp. 1123- 1132). IEEE. [Buolamwini +18] Buolamwini, J., & Gebru, T. (2018, January). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (pp. 77-91). [Center on Privacy & Technology ]Center on Privacy & Technology. Airport Face Scans, https://www.airportfacescans.com, accessed 06/12/2020 [Center on Privacy & Technology ] Center on Privacy & Technology. The Perpetual Line Up, https://www.perpetuallineup.org , accessed 06/12/2020 [Darmiani+20] Damiani, E., & Ardagna, C. A. (2020, January). Certified Machine- Learning Models. In International Conference on Current Trends in Theory and Practice of Informatics (pp. 3-15). Springer, Cham. [Hao+19] Hao, Karen, MIT, „IBMs photo scraping scandal shows what a weird bubble AI researchers live in“, https://www.technologyreview.com/2019/03/15/136593/ibms-photo-scraping- scandal-shows-what-a-weird-bubble-ai-researchers-live-in, March 2015,accessed 2020/12/6 [Hao+20] Hao, Karen, MIT, „We read the paper that forced Timnit Gebru out of Google“, https://www.technologyreview.com/2020/12/04/1013294/google-ai- ethics-research-paper-forced-out-timnit-gebru, December 2020, accessed 2020/12/6
  39. Sources [Horkoff+19] Horkoff, J. (2019, September). Non-functional requirements for machine

    learning: Challenges and new directions. In 2019 IEEE 27th International Requirements Engineering Conference (RE)(pp. 386-391). IEEE. [Guynn+15] J. Guynn, “Google photos labeled black people ’gorillas’,” http://www.usatoday.com/story/tech/2015/ 07/01/google-apologizes-after-photos- identify-black-people-as-gorillas/29567465/, July 2015, 2020/12/6 [Kusner+17] Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. In Advances in neural information processing systems (pp. 4066-4076). [Strubell+2019] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243. [Tramer+17] Tramer, F., Atlidakis, V., Geambasu, R., Hsu, D., Hubaux, J. P., Humbert, M., ... & Lin, H. (2017, April). FairTest: Discovering unwarranted associations in data-driven applications. In 2017 IEEE European Symposium on Security and Privacy (EuroS&P) (pp. 401-416). IEEE. [Valentino-Devries+12] J. Valentino-Devries, J. Singer-Vine, and A. Soltani, “Websites vary prices, deals based on users’ infor- mation,” http://www.wsj.com/articles/SB10001424127887323777204578189391813881534, Dec 2012., accessed 2020/12/6 [VentureBeat+19] https://venturebeat.com/2019/07/19/why-do-87-of-data-science- projects-never-make-it-into-production/, accessed 2020/12/6 [Zhang+20] Zhang, J. M., Harman, M., Ma, L., & Liu, Y. (2020). Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering.
  40. Danke! Fragen? Isabel Bär Twitter: @isabel_baer https://ml-ops.org Krischerstr. 100 40789

    Monheim am Rhein Germany +49 2173 3366-0 Ohlauer Str. 43 10999 Berlin Germany +49 2173 3366-0 Ludwigstr. 180E 63067 Offenbach Germany +49 2173 3366-0 Kreuzstr. 16 80331 München Germany +49 2173 3366-0 Hermannstrasse 13 20095 Hamburg Germany +49 2173 3366-0 Gewerbestr. 11 CH-6330 Cham Switzerland +41 41 743 0116 innoQ Deutschland GmbH innoQ Schweiz GmbH www.innoq.com