Fairness and Privacy in ML-Systems (INNOQ-Technology Day)

0 9 . 1 2 . 2 0 2 0
T e c h n o l o g y D a y Fairness und Privacy in ML- Systemen Isabel Bär

Isabel Bär Werkstudentin bei INNOQ Deutschland GmbH

Ethical AI on the Rise Are ML Models Inherently Neutral?

On the Dangers of Stochastic Parrots: Can Language Models Be
Too Big? Timnit Gebru Former AI Ethics Researcher at Google

Gender Shades (2018) Racial and Commercial Bias in Commercial Face
Recognition (but would the ethical risk be minimized if these systems worked accurately?) [Buolamwini+18] http://gendershades.org/overview.html

Diversity in Faces (IBM) From Fairness to Privacy Concerns [Hao+19]
https://www.ibm.com/blogs/research/2019/01/diversity-in- faces/

Real World Impact of Algorithmic Bias Face Scans at Airports
(„Biometric Exit“) Face Recognition in Law Enforcement [Center on Privacy&Technology] https://www.perpetuallineup.org

Real World Impact of Algorithmic Bias Potential Harms http://gendershades.org/overview.html

Deployment Gap https://algorithmia.com/state-of-ml?utm_medium=website&utm_source=interactive-page&utm_campaign=IC-1912- 2020-State-of-ML&_hsenc=p2ANqtz-_WbXKYLnpgf4zi4OZTNYmNgCRPIFFEqmW- Cqi2Px_T1K2wkIJvDt7KdCxB5vXAPmGirLi7ukZTykxeUh9vmHdn7dRF9g&_hsmi=81660946 …87 % of data science
project never get deployed [VentureBeat+19]

Nature of ML- Systems • Decision Logic obtained via data-
driven training process • Changes over time • Designed to give answers to questions to which answers are unknown • Multilateral dependencies between the three components https://ml-ops.org

ML-Hierarchy of Needs https://ml-ops.org

ML-Testing-Agenda [Zhang+20]

Dimensions of ML-Testing How to test? Testing workflow Why to
test? ML-Properties ML- Workflow ML- Properties Ethical AI

The ML-Workflow Online Testing is needed because Offline Testing… Is
not representative of future data Cannot cover live performance Cannot test live data Is exluded from business metrics ML-Workflow [Zhang+20]

Model Governance • Post-Deployment: Ongoing evaluation of the system •
Raise alert in case of deviation from metrics https://ml-ops.org

Dimensions of ML-Testing How to test? Testing workflow Why to
test? ML-Properties ML- Workflow ML- Properties Ethical AI

ML- Properties Functional • Overfitting and Correctness Non-Functional • Fairness
• Privacy • Robustness • Interpretability

Challenges of ML-Properties Related to Knowledge Context and domain-based definion
of properties Limited knowledge Conflicts between different properties (fairness vs accuracy) Lack of metrics Related to methods Runtime monitoring Setting thresholds Related to Knolwedge Related to Methodsy [Horkoff+19]

Fairness- Protected Attributes • Gender • Religion • Colour •
Citizenship • Age • Genetic Information

How to Define Fairness? Different Approaches • Fairness Through Unawareness
(exclusion of protected attributes) • Individual Fairness (similar predictions on similar individuals) • Group Fairness (groups selected based on sensitive attributes have the same probability of output) [Zhang+20] • Counter-Factual Fairness (output does not change when the protected attribute is turned into the opposite) [Kusner+17]

How to Define Fairness? Pitfalls • Metrics and criteria are
not broadly applicable • No consideration of differences between user subsets (Simpsons Paradoxum, Berkely admissions) • No inclusion of explanatory factors • Lack of scalability [Tramer+17]

Sources of Unfairness ML-Models learn what humans teach… • Data
labels are biased • Limited features • Smaller sample size of minority groups • Features are proxies of excluded sensitive attributes [Zhang+20]

Risks of Language Models Large data sets and inscrutable models
(refering to the yet unpublished Gebru- paper) • Large amounts of data needed to feed data-hungry language models: risk of sampling abusive language in the training data • Shifts in language might not be taken into account (dismission of emerging cultural norms, e.g. Black-Lives-Matter or Covid-vocabulary) • Possible failure to capture language and norms of countries with smaller online presence (AI language might be shaped by leading countries) • Challenge of detecting and tracing embedded bias in large data sets [Hao+20]

Framework • „Unwarranted Associations Framework“ • Tamer et al. Introduced
“FairTest“ as a tool for checking whether ML- based decision-making violates fairness requirements • Is there an automated approach for Fairness Bug detection? Practical Approaches Fairness The use case is based on the paper “FairTest: Discovering Unwarranted Associations in Data- Driven Applications”” [Tramer+17]

Introducing a New Kind of Bug Fairness Bug Defined "as
any statistically significant association, in a semantically meaningful user subpopulation, between a protected attribute and an algorithmic output, where the association has no accompanying explanatory factor“ [Tramer+17]

Goals 1. Discovery of Association Bugs With Limited Prior Knowledge
• Google Photos [Guyinn+15] • Prior anticipation of bugs might be difficult 2. Testing for Suspected Bugs • Staple‘s Pricing Scheme [Valentino-Devries+12] • Urgency for methods testing the impact of a suspected bug across different user subpopulations 3. Error Profiling of an ML-Model Over User Populations • Healthcare Predictions • ML-models can provide different levels of benefit for users (classification disparities) [Tramer+17]

Methodology 1. Data Collection and Pre-Processing • Output O: Output
to be tested for associations • Protected attribute S: Feature on which to look for associations with 0 • Contextual Attribute X: Features used to split user population in subpopulation • Explanatory Attribute E: User features on which differentiation is accepted [Tramer+17]

Methodology 2. Integrating Explanatory Factors • Associations might result out
of utility requirements • Identification of confounders which might explain bugs • Explanatory factors are expressed through explanatory attributes E on which the statistical association between S and O can be conditioned [Tramer+17]

Methodology 3. Selecting Metrics • Pearson Correlation measures the strength
of linear associations between O and S (Testing) • Regression: estimation of a label‘s association with S based on regression coefficient for each label (Discovery) [Tramer+17]

Methodology 4. Statistical Testing Accross Subpopulations • Testing for associations
in whole usergroups is not enough • Subsets of the global population by assigning users based on the value of contextual features X to maintain comparability • Association-guided tree construction for splitting user subsets into smaller subsets with increasingly strong associations (and by applying statistical techniques to maintain validity) 5. Adaptive Debugging • Running multiple sequential tests • Adding detected confounders (explanatory attributes) [Tramer+17]

The Framework ML-Workflow [Tramer+17]

Association Bug Report [Tramer+17]

AI and the Environment Bert (Google‘s language model) Language model
with neural architecture search [Strubell+19] [Hao+20]

Estimated Co2 Emissions from Language Model Training [Strubell+19]

Evolution of Required Costs for Training Language Models [Strubell+19] [Hao+20]

Approaches Priorization of More Efficient Algorithms • Replace brute-force grid
search for hyperparameter tuning with Bayesian framework • Limited interoperability with popular Deep Learning frameworks Narrowing the Gap Between Industry and Academia • Shared access to centralized computation resources [Strubell+19]

Ethical AI on the Rise Are ML Models Inherently Neutral?
NO

Sources [Breck+17] Breck, E., Cai, S., Nielsen, E., Salib, M.,
& Sculley, D. (2017, December). The ml test score: A rubric for ml production readiness and technical debt reduction. In 2017 IEEE International Conference on Big Data (Big Data) (pp. 1123- 1132). IEEE. [Buolamwini +18] Buolamwini, J., & Gebru, T. (2018, January). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (pp. 77-91). [Center on Privacy & Technology ]Center on Privacy & Technology. Airport Face Scans, https://www.airportfacescans.com, accessed 06/12/2020 [Center on Privacy & Technology ] Center on Privacy & Technology. The Perpetual Line Up, https://www.perpetuallineup.org , accessed 06/12/2020 [Darmiani+20] Damiani, E., & Ardagna, C. A. (2020, January). Certified Machine- Learning Models. In International Conference on Current Trends in Theory and Practice of Informatics (pp. 3-15). Springer, Cham. [Hao+19] Hao, Karen, MIT, „IBMs photo scraping scandal shows what a weird bubble AI researchers live in“, https://www.technologyreview.com/2019/03/15/136593/ibms-photo-scraping- scandal-shows-what-a-weird-bubble-ai-researchers-live-in, March 2015,accessed 2020/12/6 [Hao+20] Hao, Karen, MIT, „We read the paper that forced Timnit Gebru out of Google“, https://www.technologyreview.com/2020/12/04/1013294/google-ai- ethics-research-paper-forced-out-timnit-gebru, December 2020, accessed 2020/12/6

Sources [Horkoff+19] Horkoff, J. (2019, September). Non-functional requirements for machine
learning: Challenges and new directions. In 2019 IEEE 27th International Requirements Engineering Conference (RE)(pp. 386-391). IEEE. [Guynn+15] J. Guynn, “Google photos labeled black people ’gorillas’,” http://www.usatoday.com/story/tech/2015/ 07/01/google-apologizes-after-photos- identify-black-people-as-gorillas/29567465/, July 2015, 2020/12/6 [Kusner+17] Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. In Advances in neural information processing systems (pp. 4066-4076). [Strubell+2019] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243. [Tramer+17] Tramer, F., Atlidakis, V., Geambasu, R., Hsu, D., Hubaux, J. P., Humbert, M., ... & Lin, H. (2017, April). FairTest: Discovering unwarranted associations in data-driven applications. In 2017 IEEE European Symposium on Security and Privacy (EuroS&P) (pp. 401-416). IEEE. [Valentino-Devries+12] J. Valentino-Devries, J. Singer-Vine, and A. Soltani, “Websites vary prices, deals based on users’ information,” http://www.wsj.com/articles/SB10001424127887323777204578189391813881534, Dec 2012., accessed 2020/12/6 [VentureBeat+19] https://venturebeat.com/2019/07/19/why-do-87-of-data-science- projects-never-make-it-into-production/, accessed 2020/12/6 [Zhang+20] Zhang, J. M., Harman, M., Ma, L., & Liu, Y. (2020). Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering.

Danke! Fragen? Isabel Bär Twitter: @isabel_baer https://ml-ops.org Krischerstr. 100 40789
Monheim am Rhein Germany +49 2173 3366-0 Ohlauer Str. 43 10999 Berlin Germany +49 2173 3366-0 Ludwigstr. 180E 63067 Offenbach Germany +49 2173 3366-0 Kreuzstr. 16 80331 München Germany +49 2173 3366-0 Hermannstrasse 13 20095 Hamburg Germany +49 2173 3366-0 Gewerbestr. 11 CH-6330 Cham Switzerland +41 41 743 0116 innoQ Deutschland GmbH innoQ Schweiz GmbH www.innoq.com

Fairness and Privacy in ML-Systems (INNOQ-Techn...

Fairness and Privacy in ML-Systems (INNOQ-Technology Day)

Other Decks in Science

Featured

Transcript