Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Your model is bias, but so is your data. The ca...

Your model is bias, but so is your data. The case for ethics in data science.

Your model is bias, but so is your data. The case for ethics in data science. Machine learning is increasingly used to make decisions for us as we rely more and more on applications and other technology in our daily life. Yet, what happens when the data we collect has bias? What does this do to our models? What can we do as technologists to challenge this? In this talk we review three case studies where ethical concerns are raised wrapping up with some steps to help you begin to build an ethical data practice at your organization.

Lorena Mesa

February 10, 2018
Tweet

More Decks by Lorena Mesa

Other Decks in Programming

Transcript

  1. Your model is bias, but so is your data. The

    case for ethics in data science. Lorena Mesa PyCon Colombia 2018 @loooorenanicole http://bit.ly/2EfbmsY
  2. “Were it not for the Internet, Barack Obama would not

    be president. Were it not for the Internet, Barack Obama would not have been the nominee,” said Arianna Huffington, editor in chief of The Huffington Post.” https://bits.blogs.nytimes.com/2008/11/07/how-ob amas-internet-campaign-changed-politics/ How? ▪ Use of social media (e.g. YouTube) ▪ GOTV drives informed by data science ▪ Customized “_______ for Obama” interest groups
  3. How I’ll approach today’s talk What is data science? How

    does data science impact our lives? ▪ Case Study #1: Cyberbullying in social media ▪ Case Study #2: Reporting sexual harassment on Tripadvisor ▪ Case Study #3: Racial discrimination in Airbnb Starting your own data ethics practices
  4. Your model is bias, but so is your data. The

    case for ethics in data science machine learning.
  5. Machine Learning is a subfield of computer science [that] stud[ies]

    pattern recognition and computational learning [in] artificial intelligence. [It] explores the construction and study of algorithms that can learn from and make predictions on data.
  6. Mitchell’s Definition of Machine Learning A computer program is said

    to learn from experience (E) with respect to some task (T) and some performance measure (P), if its performance on T, as measured by P, improves with experience E. Ch. 1, Machine Learning, Tom Mitchell
  7. "We have been in touch with Ms. McGowan's team," Twitter

    said in a tweet on Thursday. "We want to explain that her account was temporarily locked because one of her Tweets included a private phone number, which violates of our Terms of Service." Source: CNN
  8. “[Cyberbullying is] . . . the use of information and

    communication technologies to support deliberate, repeated, and hostile behaviour by an individual or group, that is intended to harm others” Belsey, B. Cyberbullying.ca. http://www.cyberbullying.ca
  9. What is the question we want to answer? Identifying the

    offender ▪ Is the responding party a cyberbully (e.g. a “Troll”? Identifying instances of cyberbullying ▪ What is the likelihood a conversation is cyberbullying? ▪ Is the conversation aggressive (e.g. “flaming”) ? Is it not? ▪ At what level is a conversation deemed cyberbullying?
  10. How has cyberbullying been deterred in the past? Historically done

    via: ▪ Content moderation via the product owner ▪ Content moderation via user feedback (e.g. user reports) Typically these approaches require having moderators manually review comments. This is gravely inefficient and doesn’t scale well due to the need for human input.
  11. Text Feature Extraction and Optimization Text Normalization (Reduce # of

    features) ▪ Removal of special characters ▪ Punctuation and stop word removal ▪ Stemming Vs Lemmatization Feature Optimization ▪ Choice of text normalization ▪ Ngrams (use of one, two or three words as a feature) ▪ Term frequency / document frequency tuning ▪ Limiting maximum features ▪ Dimensionality reduction (SVD) Tweaking ▪ Stemming, stop word removal ▪ Ngrams: Monogram and bigram ▪ Frequency tuning: Helped ▪ Limiting max features: Helped ▪ SVD: Did not help Bengfort et. al, unpublished work
  12. How that may look in Python Sklearn Pipeline([ ('vect', TfidfVectorizer(stop_words='english',

    max_df=0.20, ngram_range=(1,2))), max_features=50000))), ('clf', MultinomialNB()) ]) http://scikit-learn.org/stable/modules/generated/sklearn.feature_ extraction.text.TfidfVectorizer.html Tfdif Vectorizer:: ▪ Controls the size of your ngrams ▪ max_df ignores terms more than this document frequency ▪ Stop words - remove words with no intrinsic value (e.g. the)
  13. No one human or computer can sift through all social

    media and online communication. When that happens what do we risk?
  14. Interpreting results What is the cost of failing to identify

    cyberbullying? Do we care more about False Positive or False Negative? FP: Post labeled as cyberbullying FN: Mislabeled cyberbullying post as innocent
  15. Classifiers Avoids overfitting (making assumptions beforehand about the likely distribution

    of the answer). But independence assumption is a simplistic model of world. NB LR Modeling the relationship between variables that is iteratively refined using a measure of error in the predictions made by the model (Regularization, penalty term). Logistic regression gives linear class boundaries. DT Graphical model of rules that partitions the data until a decision is reached at one of the leaf nodes. Complexity is related to the amount of data and the partitioning method. Prone to overfit. Minor variations in data cause big changes in tree structure. Highly biased to training set (Random Forest to your rescue) RF Constructs a forest of decision trees. At each step, in one of its iterations (classification process), it picks a random subset of features to try. It will eventually pick a subset of features that perform best in a tree classifier. MLP Can learn a non-linear function approximator. Between the input and the output layer, there can be one or more non-linear layers (i.e., hidden layers). Requires tuning a number of hyperparameters; sensitive to feature scaling SVM Attempts to maximize the distance between classes, works in high dimensional space. Use of kernel to transpose data into a higher dimensional space. Linear kernels are commonly used for text classification due to the large number of features involved Probabilistic.
  16. More than 1,000 new words, senses, and subentries have been

    added to the Oxford English Dictionary in our latest update, including worstest, fungivorous, and corporation pop. Oxford English Dictionary September 2017 Update http://public.oed.com/the-oed-today/recent-updates-to-the-oed/
  17. In their place was a message from TripAdvisor that cited

    various reasons for the deletions: They were “determined to be inappropriate by the TripAdvisor community,” or removed by staff because they were “off-topic” or contained language or subject matter that was not “family friendly.” The Milwaukee Journal Sentinel asked TripAdvisor to see the posts that were removed. The company refused. https://www.jsonline.com/story/news/investigations/2017/11/01/tripadvisor-remo ved-warnings-rapes-and-injuries-mexico-resorts-tourists-say/817172001/
  18. “Our new email communications will clearly articulate the phrase or

    sentences that are in violation of our policy, inviting the reviewer to make edits and resubmit their review” TripAdvisor reports. “These badges will remain on TripAdvisor for up to three months. However, if the issues persist we may extend the duration of the badge,” he said. “These badges are intended to be informative, not punitive.” https://www.nytimes.com/2017/11/08/travel/tripadvisor-sex-assault-discriminati on-warnings.html
  19. Quirtina Crittenden unable to book a room on Airbnb, changed

    name to “Tina” and photo to cityscape permitted her to bypass those difficulties Edelman ed. al. “It is not clear a priori how online markets will affect discrimination. To the extent that online markets can be more anonymous than in-person transactions, there may actually be less room for discrimination.“
  20. 16% “In an experiment on Airbnb, we find that applications

    from guests with distinctively African-American names are 16% less likely to be accepted relative to identical guests with distinctly White names.” Racial Discrimination in the Sharing Economy: Evidence from a Field Experiment, Edelman ed. al.
  21. “Algorithms do not automatically eliminate bias. Suppose a university, with

    admission and rejection records dating back for decades and faced with growing numbers of applicants, decides to use a machine learning algorithm that, using the historical records, identifies candidates who are more likely to be admitted. Historical biases in the training data will be learned by the algorithm, and past discrimination will lead to future discrimination.” Cynthia Dwork, Algorithms and Bias (2015) http://nyti.ms/1Qyfqre
  22. “The math-powered applications powering the data economy were based on

    choices made by fallible human beings. Some of these choices were no doubt made with the best intentions. Nevertheless, many of these models encoded human prejudice, misunderstanding, and bias into the software systems that increasingly managed our lives. Like gods, these mathematical models were opaque, their workings invisible to all but the highest priests in their domain: mathematicians and computer scientists. Their verdicts, even when wrong or harmful, were beyond dispute or appeal. And they tended to punish the poor and the oppressed in our society, while making the rich richer.” Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy
  23. Questions to consider in development ▪ Collection: What is being

    collected and/or created? ▪ Standards and methodologies: How is data collected? ▪ Ethics: Do we have restrictions? ▪ Data sharing and access: How is it shared? With whom? ▪ Long term maintenance: Now that we have it, what next?
  24. Starting a data management policy at your organization ▪ Defining

    a data ethics curriculum for your organization: ▪ Deleting data: How can we build in mechanisms into our software to delete data? ▪ Education about repurposing data: ▪ Requiring consent to collect and use data in one way
  25. 17% Companies have a data map and use it to

    track the flow of data between systems according to a 2015 Nymity Privacy Management Program Benchmarking and Accountability Report
  26. “Not many of us like thinking about death — especially

    our own. But making plans for what happens after you’re gone is really important for the people you leave behind. So today, we’re launching a new feature that makes it easy to tell Google what you want done with your digital assets when you die or can no longer use your account.” Google Data Liberation Project http://dataliberation.blogspot.com/
  27. Continue the conversation by learning more [TALK] Liz Rush, Write/Speak/Code

    2016, “Challenging & Democratizing Algorithm Development” [ONLINE TRAINING] O’Reilly’s, “Data Ethics: Designing for Fairness in the Age of Algorithms” [BOOK] Cathy O’Neil, “Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy” [PAPER] Dr. Sharon Valler, Santa Clara University, “Introduction to Software Ethics” [TALK] Marius Watz, Papers We Love 2016, “Abuse of an Algorithm Comes as No Surprise”
  28. Bloomberg, BrightHive, and Data for Democracy are championing the “Community

    Principles on Ethical Data Sharing” (or CPEDS) to develop guidelines for data collection and sharing. You can join the effort on Slack or GitHub.