Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Show Your Work - Using Data Science to Peek inside the Black Box

Show Your Work - Using Data Science to Peek inside the Black Box

Delivered at DDDPerth 2021 with Jia Keatnuxsuo. I have provided the notes as well as the slides as you need the information for this one.

It's an introduction to Data Science Explainer Models, Black Boxes and Bias in Machine Learning. There are a lot of recommended Learning Paths at the end.

Michelle Sandford

August 14, 2021
Tweet

More Decks by Michelle Sandford

Other Decks in Technology

Transcript

  1. 1

  2. We acknowledge the traditional custodians of this land, the Whadjuk

    people of the Nyoongar nation. “We wish to acknowledge the traditional custodians of the land we are meeting on, the Whadjuk people. We wish to acknowledge and respect their continuing culture and the contribution they make to the life of this city and this region.” 3
  3. 101 @jkeatnuxsuo #DDDAzure @Codess_aus J: As machine learning becomes increasingly

    integral to decisions that affect health, safety, economic wellbeing, and other aspects of people's lives, it's important to be able to understand how models make predictions; and to be able to explain the rationale for machine learning based decisions. M: This is not an advanced session, it’s for anyone interested in exploring how we use data science to make predictions and it explores what questions we can ask to validate how fair the machine model was in making it’s decisions. J: we’re going to explain the difference between global and local feature importance, we’ll use an explainer to interpret a model and visualize model explanations 4
  4. What…? @jkeatnuxsuo #DDDAzure @Codess_aus M: There are a couple of

    things I always loop back to in developer conference talks… Quantum physics and the meaning of life. And no need to worry, I’m going to talk about both today. Because it occurred to me, as I was re-writing this session for the 42nd time, that if there was ever a tool to uncover the meaning of life, Machine Learning would be it. Humanity could use Data Science to take a peek into the inexplicable mysteries of the universe and find meaning in the unknown algorithms that govern the world we live in. 5
  5. … Please wait @jkeatnuxsuo #DDDAzure @Codess_aus M: In fact, I’m

    not the first one to think of this. Douglas Adams in his epic – The Hitchhikers Guide to the Galaxy, wrote about a super-computer designed to answer the ultimate question. Problem was, it told them it would take 7.5M years to calculate the answer. They did not have scalable cloud computing at very affordable prices back in the day ;- ) 6
  6. 42 Yes, but what is the question !!???? @jkeatnuxsuo #DDDAzure

    @Codess_aus M: When it finally spat out the answer. The programmers were pretty annoyed because it made no sense without the context of the question. So they asked it for that. It said that it could not explain why the answer was 42. It built another computer to calculate the question and said it would be 10M years before it had it. 7
  7. 42 How many Mice will run the world? 6 x

    9 What is ten factorial divided by the number of seconds in a day? What is 101010 What is the 3rd primary pseudoperfect number @jkeatnuxsuo #DDDAzure @Codess_aus M: The original machine learning model was a black box. It then built an explainer model to try and calculate the reasons why the answer was 42. In this case, the explainer model was probably a PFI Explainer Model – it built a mimic and then tried all the permutations of different questions to see which fit. 8
  8. Because it’s the right thing to do. Because it’s the

    logical thing to do. Because it’s the fair thing to do. Because I want To do it. @jkeatnuxsuo #DDDAzure @Codess_aus M. This is where it gets super meta, because when I started writing this session I was thinking about how humans are the ultimate black box technology. We think they have reasons for doing what they do, if we ask them, they can even explain them to you. But are they true? Are they correct? Do they follow a logic-based rule set and spit out a consistent response every time, or do they have a bit of chaos theory thrown in. Is there randomness in the pattern that makes them unpredictable? 9
  9. I need to eat more carbs @jkeatnuxsuo #DDDAzure @Codess_aus J.

    Humans have a mental model of their environment that is updated when something unexpected happens. This update is performed by finding an explanation for the unexpected event. For example, Michelle feels unexpectedly sick and asks, "Why do I feel so sick?". She learns that she gets sick every time she eats noodles from that place on the corner. She updates her mental model and decides that the noodle place caused the sickness and should therefore be avoided. M: But… those noodles are delicious… 10
  10. Do it exactly like this…. @jkeatnuxsuo #DDDAzure @Codess_aus J: When

    Black Box machine learning models are used in research, scientific findings remain completely hidden if the model only gives predictions without explanations. But as machine learning becomes increasingly important to decisions that affect so many parts of people's lives, it's important to be able to understand how models make predictions; and to be able to explain the rationale for machine learning based decisions. 11
  11. Here’s all the info, make it work @jkeatnuxsuo #DDDAzure @Codess_aus

    M: Machine learning is a paradigm shift from "normal programming" where all instructions must be explicitly given to the computer to "indirect programming" that takes place through providing data. 12
  12. @jkeatnuxsuo #DDDAzure @Codess_aus M: When I was in University, I

    did Philosophy – and we learnt about Schrödinger's Cat. Do you all know about Schrödinger's Cat? The idea is that if you have a closed box, with a cat inside, there is no way to know if the Cat is Dead of Alive. Quantum Physics tells us there is a third state – that the cat is both dead and alive until the box is opened, and then it actualises as one of the two states. But we aren’t gonna get quantum here, I don’t have the credit on my azure subscription for that. 13
  13. 0 I @jkeatnuxsuo #DDDAzure @Codess_aus M: So, let’s treat this

    as a binary classification. The Cat is either Dead or the Cat is Alive. And we could feed all the data we have into a machine learning model to try and figure out which. • Is the box airtight? • Is the cat anxious by nature • Is anything in the box with the cat, poison, something sharp etc • How long has the cat been in the box • How big is the cat? • How old is the cat? • Is the Cat sick? 14
  14. 0 I 0 I @jkeatnuxsuo #DDDAzure @Codess_aus M: All of

    this data are what we call features. If we had a couple of hundred boxes all with a cats in them. We would have a reasonable sample size to predict whether when we open a box the cat will be dead or alive. Now, of course, you are thinking if we knew those details about the specific box we had in front of us, then we could make a better prediction ourselves. Dead, it would definitely be dead. But, despite our original plan involving procuring hundreds of cats and hundreds of boxes… J: I told you that wouldn’t work M: The point is data science isn’t about small pieces of data, it’s about taking lots of data, lots and lots of data and making useful predictions or judgements from it. The bigger the sample size, the better the data. Probably. 15
  15. 600 petaflops 6.8M CPU cores 50K GPUs 165K nodes @jkeatnuxsuo

    #DDDAzure @Codess_aus J; During COVID there was lots of data. More than 600 petaflops of data. And the fake news media liked to spread the rumour that the vaccines could not possibly have been tested to the same level as medicines were in the past. Truth is, they are tested in many more iterations and variations today than could ever be possible in the past. In the past, you were limited by computer hardware. COVID-19 High performance computing consortium brings together the Federal government, industry, and academic leaders to provide access to the world’s most powerful high-performance computing resources in support of COVID-19 research. How could vaccines not possibly be tested to the same level?! Moreover, Duke University tested designs for a multi-splitting device for ventilators using over 500,000 compute hours over one weekend. In the past it would have taken years to validate that amount of data for a trial. 16
  16. X Credit Score Salary @jkeatnuxsuo #DDDAzure @Codess_aus J: But…by default,

    machine learning models pick up biases from the training data, the same as humans do. This can turn your machine learning models into racists that discriminate against underrepresented groups. Interpretability is a useful debugging tool for detecting bias in machine learning models. It might happen that the machine learning model you have trained for automatic approval or rejection of credit applications discriminates against a minority that has been historically disenfranchised. Your main goal is to grant loans only to people who will eventually repay them. The incompleteness of the problem formulation in this case lies in the fact that you not only want to minimize loan defaults, but are also obliged not to discriminate on the basis of certain demographics. 17
  17. $ @jkeatnuxsuo #DDDAzure @Codess_aus M: To put it simply –

    you want to grant loans only to people who do not need loans, because they already have the money to pay back the loans i.e. Rich People. But you aren’t allowed to discriminate against people who don’t have that much money i.e. Poor People. 18
  18. Lies, more lies and Statistics @jkeatnuxsuo #DDDAzure @Codess_aus J: Data

    can be manipulated to support any conclusion. Such manipulation can sometimes happen unintentionally. As humans, we all have bias, and it's often difficult to consciously know when you are introducing bias in data. Guaranteeing fairness in AI and machine learning remains a complex sociotechnical challenge. Meaning that it cannot be addressed from either purely social or technical perspectives. M: But what do we mean by unfairness? Well…"Unfairness" encompasses negative impacts, or "harms", for a group of people, such as those defined in terms of race, gender, age, or disability status. 19
  19. Resilient Passionate Empowered Bubbly Brave @jkeatnuxsuo #DDDAzure @Codess_aus M: Here

    are a couple of examples of fairness-related harms: * Allocation, if a gender or ethnicity for example is favored over another. An example would be an experimental hiring tool developed by a large corporation to screen candidates. The tool systemically discriminated against one gender by using the models were trained to prefer words associated with another. It resulted in penalizing candidates whose resumes contained them. 20
  20. Place hands beneath senor to dispense soap No soap for

    you! @jkeatnuxsuo #DDDAzure @Codess_aus J: Quality of service. If you train the data for one specific scenario but reality is much more complex, it leads to a poor performing service. J: A notorious example was an automated hand soap dispenser that could not seem to be able to sense people with dark skin. M: It’s really important that you don’t just use white men in your dataset. There are other people in the world, women and children are generally smaller, there is a range of skin colours, accents, abilities and although it’s super-easy to make the first test on your internal team – if you aren’t seeing a lot of diversity amongst them, the dataset will not be truly representative and this might lead to bad results, sometimes even deaths. 21
  21. Elementary Lucifer Sherlock House The Mentalist Love Island Bones @jkeatnuxsuo

    #DDDAzure @Codess_aus M: So you see bias in many datasets might come from unbalanced data and lack of re- training on those features. M: Machine learning models can only be debugged and audited when they can be interpreted. Even in low risk environments, such as movie recommendations, the ability to interpret is valuable in the research and development phase as well as after deployment. Later, when a model is used in a product, things can go wrong. An interpretation for an erroneous prediction helps to understand the cause of the error. It delivers a direction for how to fix the system. J: There are a lot of domains that would benefit from understandable models, health, finance, law, and there are even more that demand this interpretability. Being able to audit the model for these critical domains is very important. People need to believe computers are fair. Fairer than the humans that designed them. M: Understanding the most important features of a model gives us insights into its inner workings and gives directions for improving its performance and removing bias. Why don’t you show us how to create a model Jia, and then we’ll talk about features after that? 22
  22. Demo: Create a model @jkeatnuxsuo #DDDAzure @Codess_aus In this demo,

    I’m creating a simple binary classification model using LogisticRegression using student hiring dataset. Imagine your boss is telling you to create a ML system to screen student candidates based on historical hiring data which include their github contribution score, marks, no. of hackathons and no. of volunteering activities. First, we need to install required packages including Interpret-community and raiwidgets. Then, we load the dataset using panda library, separate features and labels and splitting data into train and test sets. It’s 70/30 using train_test_split function from sklearn library. You can see the output that there are 4 features and 1 label column at the end. 1 is hired and 0 is not-hired. Next, we create a classification model which will makes decision whether a student candidate will be a good fit or not. We use LogisticRegression from sklearn and see how the model perform by accumulating different metrics such as accuracy, AUC, precision, recall and fscore. Honestly, this is not the best performing model as you see the accuracy is 60%. I'm still learning how to create a model and play around with hyperparameters but we will push on as we're are focusing on understanding how model makes its decision 23
  23. Born? Degree: Master of IT + Master of Commerce +

    Bachelor of Business Administration + International Business Management Favourite Book? Honours and Awards: Energy Hack - 1st Prize, Transport Hack 2019 - BEST YOUNG TEAM, WiTWA Tech [+] 20 Awards, NASA Space Apps Challenge 2018 - Global Nominee, The Winner of UWA Student Innovation Challenge Family? IQ? Hobbies: Triathlete Languages: English, Thai, French, Spanish Codes: Python, JavaScript, HTML, CSS Boards: ACS Branch Executive Committee Weight? Pronouns: She/Her Name: 0.98 Role: 0.67 Employer: 0.54 Name: Jiaranai Keatnuxsuo Role: Associate Cloud Solution Architect Employer: Microsoft @jkeatnuxsuo #DDDAzure @Codess_aus M: So Jia, we didn’t really introduce ourselves and here we are, well into our session. J: That’s true. M: We are about to talk to Features. Global Features and Local Features and Feature Importance. J: I see you’ve put a Data-Set for me up on the slide. M: Yes, and there are some entries that are missing J: The ones in white would be the features normally quantified as most important in influencing predication in the dataset. We call those the Global Features. M: Why don’t you tell us the data for those features? J: My name is Jia Keatnuxsuo, I am an Associate Cloud Solution Architect and I work at Microsoft. M: Why do you think those features had priority in the dataset? 24
  24. J: That’s what people want to know when they meet

    you, your name, what you do and who you do it for. People judge you first on those things before delving into other data. 24
  25. GitHub: codess-aus Twitter: codess_aus Dogs: Leo the Lion, Snickers Role:

    Developer Engagement PM for Data Science and Data Engineering Degrees: MSC Comp Sci + BA Philosophy Hons Employer: Microsoft Languages: English, French Pronouns: She/Her Age: GenX Weight: Over Mentors: She Codes, Muses Code JS, GPN, ABCN Codes: Python, JavaScript Speaker: 0.67 Boards: 0.54 Tag: 0.98 Speaker: Tedx Perth, DDDPerth Boards: ACS Chair, FutureNow Director Tag: Michelle @ Microsoft @jkeatnuxsuo #DDDAzure @Codess_aus J: What about you, in your case, those things are not the most important features M: Exactly, I always said community advocate even when I was working as a Service Delivery Manager – because although SDM was the job title I had, it didn’t describe who I was. J: There is data missing from your dataset too. M: Yes, but it’s not significant for anything I want to do with my life, and if those features took on a higher weight, it would definitely reflect bias in the dataset. J: So, when different features have a different priority in a particular instance from what is globally perceived to be important, we call that Local Feature importance. M: you got it 25
  26. GitHub Portfolio Qualifications Recommendations Experience Awards Previous tenure @jkeatnuxsuo #DDDAzure

    @Codess_aus M: So - Global feature importance quantifies the relative importance of each feature in the test dataset as a whole. It provides a general comparison of the extent to which each feature in the dataset influences prediction. J: For example, a binary classification model, like I used in the demo, to predict if a candidate will be hired, an explainer might then use a sufficiently representative test dataset to produce global feature importance values. It could then show that the model was trained from features such as qualifications, experience, previous tenure, and skills to predict a label of 1 for candidates that are likely to be good hires, and 0 for candidates that have a significant risk of failure in role (and therefore shouldn't be approved). 26
  27. Awards GitHub Portfolio Recommendations Volunteering Previous Tenure Qualifications @jkeatnuxsuo #DDDAzure

    @Codess_aus M: When we hired Jia though, a completely different set of features weighed higher in that process. This was because Jia was hired under the Graduate Scheme rather than the Professional pathway. We didn’t expect her to be able to show a long tenure in previous roles, or have experience working at Microsoft Partners or a multitude of recommendations. In the Grad Hire route – the feature with the highest importance is her degree. She needs to be a Graduate, or she cannot qualify for this hiring route. Her Awards, Volunteering Experience and GitHub Portfolio might also be heavily weighted features. This is what we call Local Feature Importance. In the overall dataset these things do not have the highest weighting, but in a run looking at grad hires – these features are given more weight. Or I should say, more specifically in Jia’s case, these were the features that held more weight. 27
  28. @jkeatnuxsuo #DDDAzure @Codess_aus J: So - You can apply the

    interpretability classes and methods to understand the model’s global behavior or specific predictions. The former is called global explanation and the latter is called local explanation. M: The methods can be also categorized based on whether the method is model agnostic or model specific. Some methods target certain type of models. For example, SHAP’s tree explainer only applies to tree-based models. Some methods treat the model as a black box, such as mimic explainer or SHAP’s kernel explainer. Why don’t you show them Jia? 28
  29. Demo: Create an Explainer & generate feature importance @jkeatnuxsuo #DDDAzure

    @Codess_aus It's easy to create and explainer for the model. We use the Interpretability library we installed earlier. The explainer will calculate feature importance for us which enables us to quantify the relative influence each feature has in predicting whether a student candidate will be a good fit to the role or not. we'll use a Tabular Explainer, which is a "black box" explainer and the cool thing is that TabularExplainer looks at the type of the prediction model and will decide for us the appropriate explainer. You can see the TabularExplainer calls LinearExplainer. Once we initiate an explainer, we want to try explain the model by evaluating the overall feature importance or what we also call the global feature importance. This means it looks at the whole dataset. We call explain_global() method on the explainer we created, then use the get_feature_importance_dict() to get results of importance values. The feature importance is ranked, with the most important feature listed first. So we have an overall view with Global importance, but what about explaining individual observations? This is where we need local feature importance. It measures the influence of each feature value for a specific individual prediction, in other words, it looks at each 29
  30. student candidate. To get local feature importance, we call the

    explain_local() method and specifying the subset of cases we want to explain – the first candidate in this example.Then, use the get_ranked_local_names() and get_ranked_local_values() methods to get the feature names and importance values. Let's look at the output. Since this is a binary classification model, there are only two possible classes hired and not-hired. Each feature's support for one class results in correlatively negative level of support for the other. For this candidate, let's say Jia, the overall support for class 0 (not-hired) is -0.95, and the support for class 1 (hired) is correspondingly 0.95; so support for class 1 is higher than for class 0, Jia is a good fit to the role. The most important feature for a prediction of class 1 is number of hackathons she participated, followed by github score - these are quite different than the global importance. There could be multiple reasons why local importance for an individual prediction varies from global importance for the overall dataset; for example, Jia might have a lower marks than average, but she is a good fit because the company cares about her hands-on experience with hackathons and contribution to github projects. 29
  31. OK, me too When that happens, I do this… @jkeatnuxsuo

    #DDDAzure @Codess_aus M: So earlier I was talking about the computer in the Hitchhikers Guide to the Galaxy, and how it created a mimic model with a PFI explainer to interpret what the question was. This technique can be compared to a child learning to annoy their parents. The child starts with no preconception of what makes their parent angry or not, but they can test their parent by picking a random-set of actions over the course of a week and noting how their parent respond to those actions. While a parent may exhibit a non-binary response for each of these, let’s pretend that the child’s actions are either bad or good (two classes). After the first week, the child has learned a bit about what bothers their parents and makes an educated guess as to what else would bother their parents. The next week, the child dials down the actions which were successful and takes actions that weren’t successful a step further. The child repeats this, each week, noting their parents’ responses and adjusting their understanding of what will bother their parents until they know exactly what annoys them and what doesn’t. The augmented data is labeled by the black-box model and used to train a better 30
  32. substitute model. Just like the child, the substitute model gets

    a more precise understanding of where the black-box model’s decision boundary is. After a few iterations of this, the substitute model shares almost the exact same decision boundaries as the black-box model. It helps if we can visualize the results though, why don’t you show that Jia? 30
  33. Demo: Visualise the Explainer’s results @jkeatnuxsuo #DDDAzure @Codess_aus Reading the

    results from the cell can be difficult. We can also simple way to visualise the explainer by using Responsible-AI-Widgets. This package provides a collection of model and data exploration and assessment user interfaces that enable better understanding of AI/ML systems. We call ExplanationDashboard on our previously defined global explainer and prediction model. And we will see a dashboard pop up in the cell or we can follow this generated link on my compute instance that is being hosted on Azure to see a fuller dashboard. Going to the third tab “Aggregate Feature importance” will give you the interactive visualisation of Global feature importance Going to the last tab “Individual feature importance and what if” will give you the visualisation of local feature importance and click on individual datapoint’s explanation. You can play around with these visualisations however you like. The dashboard gives you so much flexibility that helps enhance the understanding of the model interpretability. 31
  34. Oopsie @jkeatnuxsuo #DDDAzure @Codess_aus M: It has become quite common

    these days to hear people refer to modern machine learning systems as “black boxes”. But in many cases they are not, it just seems like that because they are, like people… complicated. When we ask someone why they did something we’re operating on a certain set of assumptions. We are typically assuming that they had some good reason for acting as they did, and we are basically asking for the reasoning process they used to make the decision. However, when asking about why something went wrong, we are instead asking for a kind of root cause analysis of the failure. For example, after a bike accident, we might want an explanation of what caused the accident. Was the Jia distracted? Did another rider cause her to swerve? Was she drunk? Rather than a process of reasoning, we are asking, more or less, for the critical stimulus that caused a particular reaction outside of normal behaviour. With Machine Learning Models, you can choose to elevate certain features above others – Salary and existing Debt in a Loan Approval Tool, for example. Or certain features might rise to prominence through training. I’ve been training the Azure Percept Lego car to avoid cones, but I haven’t trained it to avoid lego people – so it would definitely just mow them down… unless it mistook them for cones. So why did 32
  35. it mow down the people? What people? What are people?

    The purpose of this session was to show you that despite the fact that one, at least in part, created the other – we are not so different. Machine Learning Models can be interpreted and there are as many models and ways to gain insight as we might need. Possibly more than we have to tackle humanity as individual decision makers. J: And more than this session can cover today. 32
  36. If you would like to follow the same Learning Path,

    we studied to understand how to explain machine learning models with Azure Machine Learning and: •Interpret global and local feature importance. •Use an explainer to interpret a model. •Create model explanations in a training experiment. •Visualize model explanations. Then it’s here: https://docs.microsoft.com/en-us/learn/modules/explain-machine-learning- models-with-azure-machine-learning/1-introduction We also loved: Interpretable Machine Learning Book by Christoph Molnar: https://christophm.github.io/interpretable-ml-book/ @jkeatnuxsuo #DDDAzure @Codess_aus 34
  37. https://aka.ms/microsoft.source I have an Azure Cloud Voucher and a Quantum

    Cat Sticker – which I will not be giving out at the stand. Exclusive in-session benefits if you join the community and come up afterwards to show me and grab your reward. 43