Slide 1

Slide 1 text

0 9 . 1 2 . 2 0 2 0 T e c h n o l o g y D a y Fairness und Privacy in ML- Systemen Isabel Bär

Slide 2

Slide 2 text

Isabel Bär Werkstudentin bei INNOQ Deutschland GmbH

Slide 3

Slide 3 text

Ethical AI on the Rise Are ML Models Inherently Neutral?

Slide 4

Slide 4 text

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Timnit Gebru Former AI Ethics Researcher at Google

Slide 5

Slide 5 text

Gender Shades (2018) Racial and Commercial Bias in Commercial Face Recognition (but would the ethical risk be minimized if these systems worked accurately?) [Buolamwini+18] http://gendershades.org/overview.html

Slide 6

Slide 6 text

Diversity in Faces (IBM) From Fairness to Privacy Concerns [Hao+19] https://www.ibm.com/blogs/research/2019/01/diversity-in- faces/

Slide 7

Slide 7 text

Real World Impact of Algorithmic Bias Face Scans at Airports („Biometric Exit“) Face Recognition in Law Enforcement [Center on Privacy&Technology] https://www.perpetuallineup.org

Slide 8

Slide 8 text

Real World Impact of Algorithmic Bias Potential Harms http://gendershades.org/overview.html

Slide 9

Slide 9 text

Deployment Gap https://algorithmia.com/state-of-ml?utm_medium=website&utm_source=interactive-page&utm_campaign=IC-1912- 2020-State-of-ML&_hsenc=p2ANqtz-_WbXKYLnpgf4zi4OZTNYmNgCRPIFFEqmW- Cqi2Px_T1K2wkIJvDt7KdCxB5vXAPmGirLi7ukZTykxeUh9vmHdn7dRF9g&_hsmi=81660946 …87 % of data science project never get deployed [VentureBeat+19]

Slide 10

Slide 10 text

Nature of ML- Systems • Decision Logic obtained via data- driven training process • Changes over time • Designed to give answers to questions to which answers are unknown • Multilateral dependencies between the three components https://ml-ops.org

Slide 11

Slide 11 text

ML-Hierarchy of Needs https://ml-ops.org

Slide 12

Slide 12 text

ML-Testing-Agenda [Zhang+20]

Slide 13

Slide 13 text

Dimensions of ML-Testing How to test? Testing workflow Why to test? ML-Properties ML- Workflow ML- Properties Ethical AI

Slide 14

Slide 14 text

The ML-Workflow Online Testing is needed because Offline Testing… Is not representative of future data Cannot cover live performance Cannot test live data Is exluded from business metrics ML-Workflow [Zhang+20]

Slide 15

Slide 15 text

Model Governance • Post-Deployment: Ongoing evaluation of the system • Raise alert in case of deviation from metrics https://ml-ops.org

Slide 16

Slide 16 text

Dimensions of ML-Testing How to test? Testing workflow Why to test? ML-Properties ML- Workflow ML- Properties Ethical AI

Slide 17

Slide 17 text

ML- Properties Functional • Overfitting and Correctness Non-Functional • Fairness • Privacy • Robustness • Interpretability

Slide 18

Slide 18 text

Challenges of ML-Properties Related to Knowledge Context and domain-based definion of properties Limited knowledge Conflicts between different properties (fairness vs accuracy) Lack of metrics Related to methods Runtime monitoring Setting thresholds Related to Knolwedge Related to Methodsy [Horkoff+19]

Slide 19

Slide 19 text

Fairness- Protected Attributes • Gender • Religion • Colour • Citizenship • Age • Genetic Information

Slide 20

Slide 20 text

How to Define Fairness? Different Approaches • Fairness Through Unawareness (exclusion of protected attributes) • Individual Fairness (similar predictions on similar individuals) • Group Fairness (groups selected based on sensitive attributes have the same probability of output) [Zhang+20] • Counter-Factual Fairness (output does not change when the protected attribute is turned into the opposite) [Kusner+17]

Slide 21

Slide 21 text

How to Define Fairness? Pitfalls • Metrics and criteria are not broadly applicable • No consideration of differences between user subsets (Simpsons Paradoxum, Berkely admissions) • No inclusion of explanatory factors • Lack of scalability [Tramer+17]

Slide 22

Slide 22 text

Sources of Unfairness ML-Models learn what humans teach… • Data labels are biased • Limited features • Smaller sample size of minority groups • Features are proxies of excluded sensitive attributes [Zhang+20]

Slide 23

Slide 23 text

Risks of Language Models Large data sets and inscrutable models (refering to the yet unpublished Gebru- paper) • Large amounts of data needed to feed data-hungry language models: risk of sampling abusive language in the training data • Shifts in language might not be taken into account (dismission of emerging cultural norms, e.g. Black-Lives-Matter or Covid-vocabulary) • Possible failure to capture language and norms of countries with smaller online presence (AI language might be shaped by leading countries) • Challenge of detecting and tracing embedded bias in large data sets [Hao+20]

Slide 24

Slide 24 text

Framework • „Unwarranted Associations Framework“ • Tamer et al. Introduced “FairTest“ as a tool for checking whether ML- based decision-making violates fairness requirements • Is there an automated approach for Fairness Bug detection? Practical Approaches Fairness The use case is based on the paper “FairTest: Discovering Unwarranted Associations in Data- Driven Applications”” [Tramer+17]

Slide 25

Slide 25 text

Introducing a New Kind of Bug Fairness Bug Defined "as any statistically significant association, in a semantically meaningful user subpopulation, between a protected attribute and an algorithmic output, where the association has no accompanying explanatory factor“ [Tramer+17]

Slide 26

Slide 26 text

Goals 1. Discovery of Association Bugs With Limited Prior Knowledge • Google Photos [Guyinn+15] • Prior anticipation of bugs might be difficult 2. Testing for Suspected Bugs • Staple‘s Pricing Scheme [Valentino-Devries+12] • Urgency for methods testing the impact of a suspected bug across different user subpopulations 3. Error Profiling of an ML-Model Over User Populations • Healthcare Predictions • ML-models can provide different levels of benefit for users (classification disparities) [Tramer+17]

Slide 27

Slide 27 text

Methodology 1. Data Collection and Pre-Processing • Output O: Output to be tested for associations • Protected attribute S: Feature on which to look for associations with 0 • Contextual Attribute X: Features used to split user population in subpopulation • Explanatory Attribute E: User features on which differentiation is accepted [Tramer+17]

Slide 28

Slide 28 text

Methodology 2. Integrating Explanatory Factors • Associations might result out of utility requirements • Identification of confounders which might explain bugs • Explanatory factors are expressed through explanatory attributes E on which the statistical association between S and O can be conditioned [Tramer+17]

Slide 29

Slide 29 text

Methodology 3. Selecting Metrics • Pearson Correlation measures the strength of linear associations between O and S (Testing) • Regression: estimation of a label‘s association with S based on regression coefficient for each label (Discovery) [Tramer+17]

Slide 30

Slide 30 text

Methodology 4. Statistical Testing Accross Subpopulations • Testing for associations in whole usergroups is not enough • Subsets of the global population by assigning users based on the value of contextual features X to maintain comparability • Association-guided tree construction for splitting user subsets into smaller subsets with increasingly strong associations (and by applying statistical techniques to maintain validity) 5. Adaptive Debugging • Running multiple sequential tests • Adding detected confounders (explanatory attributes) [Tramer+17]

Slide 31

Slide 31 text

The Framework ML-Workflow [Tramer+17]

Slide 32

Slide 32 text

Association Bug Report [Tramer+17]

Slide 33

Slide 33 text

AI and the Environment Bert (Google‘s language model) Language model with neural architecture search [Strubell+19] [Hao+20]

Slide 34

Slide 34 text

Estimated Co2 Emissions from Language Model Training [Strubell+19]

Slide 35

Slide 35 text

Evolution of Required Costs for Training Language Models [Strubell+19] [Hao+20]

Slide 36

Slide 36 text

Approaches Priorization of More Efficient Algorithms • Replace brute-force grid search for hyperparameter tuning with Bayesian framework • Limited interoperability with popular Deep Learning frameworks Narrowing the Gap Between Industry and Academia • Shared access to centralized computation resources [Strubell+19]

Slide 37

Slide 37 text

Ethical AI on the Rise Are ML Models Inherently Neutral? NO

Slide 38

Slide 38 text

Sources [Breck+17] Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017, December). The ml test score: A rubric for ml production readiness and technical debt reduction. In 2017 IEEE International Conference on Big Data (Big Data) (pp. 1123- 1132). IEEE. [Buolamwini +18] Buolamwini, J., & Gebru, T. (2018, January). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (pp. 77-91). [Center on Privacy & Technology ]Center on Privacy & Technology. Airport Face Scans, https://www.airportfacescans.com, accessed 06/12/2020 [Center on Privacy & Technology ] Center on Privacy & Technology. The Perpetual Line Up, https://www.perpetuallineup.org , accessed 06/12/2020 [Darmiani+20] Damiani, E., & Ardagna, C. A. (2020, January). Certified Machine- Learning Models. In International Conference on Current Trends in Theory and Practice of Informatics (pp. 3-15). Springer, Cham. [Hao+19] Hao, Karen, MIT, „IBMs photo scraping scandal shows what a weird bubble AI researchers live in“, https://www.technologyreview.com/2019/03/15/136593/ibms-photo-scraping- scandal-shows-what-a-weird-bubble-ai-researchers-live-in, March 2015,accessed 2020/12/6 [Hao+20] Hao, Karen, MIT, „We read the paper that forced Timnit Gebru out of Google“, https://www.technologyreview.com/2020/12/04/1013294/google-ai- ethics-research-paper-forced-out-timnit-gebru, December 2020, accessed 2020/12/6

Slide 39

Slide 39 text

Sources [Horkoff+19] Horkoff, J. (2019, September). Non-functional requirements for machine learning: Challenges and new directions. In 2019 IEEE 27th International Requirements Engineering Conference (RE)(pp. 386-391). IEEE. [Guynn+15] J. Guynn, “Google photos labeled black people ’gorillas’,” http://www.usatoday.com/story/tech/2015/ 07/01/google-apologizes-after-photos- identify-black-people-as-gorillas/29567465/, July 2015, 2020/12/6 [Kusner+17] Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. In Advances in neural information processing systems (pp. 4066-4076). [Strubell+2019] Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243. [Tramer+17] Tramer, F., Atlidakis, V., Geambasu, R., Hsu, D., Hubaux, J. P., Humbert, M., ... & Lin, H. (2017, April). FairTest: Discovering unwarranted associations in data-driven applications. In 2017 IEEE European Symposium on Security and Privacy (EuroS&P) (pp. 401-416). IEEE. [Valentino-Devries+12] J. Valentino-Devries, J. Singer-Vine, and A. Soltani, “Websites vary prices, deals based on users’ infor- mation,” http://www.wsj.com/articles/SB10001424127887323777204578189391813881534, Dec 2012., accessed 2020/12/6 [VentureBeat+19] https://venturebeat.com/2019/07/19/why-do-87-of-data-science- projects-never-make-it-into-production/, accessed 2020/12/6 [Zhang+20] Zhang, J. M., Harman, M., Ma, L., & Liu, Y. (2020). Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering.

Slide 40

Slide 40 text

Danke! Fragen? Isabel Bär Twitter: @isabel_baer https://ml-ops.org Krischerstr. 100 40789 Monheim am Rhein Germany +49 2173 3366-0 Ohlauer Str. 43 10999 Berlin Germany +49 2173 3366-0 Ludwigstr. 180E 63067 Offenbach Germany +49 2173 3366-0 Kreuzstr. 16 80331 München Germany +49 2173 3366-0 Hermannstrasse 13 20095 Hamburg Germany +49 2173 3366-0 Gewerbestr. 11 CH-6330 Cham Switzerland +41 41 743 0116 innoQ Deutschland GmbH innoQ Schweiz GmbH www.innoq.com