Your model is bias, but so is your data. The case for ethics in data science.

Your model is bias, but so is your data. The
case for ethics in data science. Lorena Mesa PyCon Colombia 2018 @loooorenanicole http://bit.ly/2EfbmsY

Why should data scientists worry about ethics?

2.5 billion gigabytes of new data generated daily (2012) http://www.bbc.co.uk/news/business-26383058

But first … I’d like to thank PyCon Colombia for
inviting me!

Hola, soy Lorena Mesa. http://bit.ly/2irmY5V

“Were it not for the Internet, Barack Obama would not
be president. Were it not for the Internet, Barack Obama would not have been the nominee,” said Arianna Huffington, editor in chief of The Huffington Post.” https://bits.blogs.nytimes.com/2008/11/07/how-ob amas-internet-campaign-changed-politics/ How? ▪ Use of social media (e.g. YouTube) ▪ GOTV drives informed by data science ▪ Customized “_______ for Obama” interest groups

How I’ll approach today’s talk What is data science? How
does data science impact our lives? ▪ Case Study #1: Cyberbullying in social media ▪ Case Study #2: Reporting sexual harassment on Tripadvisor ▪ Case Study #3: Racial discrimination in Airbnb Starting your own data ethics practices

What do we mean when we say data science?

Your model is bias, but so is your data. The
case for ethics in data science machine learning.

Machine Learning is a subfield of computer science [that] stud[ies]
pattern recognition and computational learning [in] artificial intelligence. [It] explores the construction and study of algorithms that can learn from and make predictions on data.

Mitchell’s Definition of Machine Learning A computer program is said
to learn from experience (E) with respect to some task (T) and some performance measure (P), if its performance on T, as measured by P, improves with experience E. Ch. 1, Machine Learning, Tom Mitchell

Human Experience

Case Study #1: Cyberbullying in Social Media

"We have been in touch with Ms. McGowan's team," Twitter
said in a tweet on Thursday. "We want to explain that her account was temporarily locked because one of her Tweets included a private phone number, which violates of our Terms of Service." Source: CNN

“[Cyberbullying is] . . . the use of information and
communication technologies to support deliberate, repeated, and hostile behaviour by an individual or group, that is intended to harm others” Belsey, B. Cyberbullying.ca. http://www.cyberbullying.ca

What is the question we want to answer? Identifying the
offender ▪ Is the responding party a cyberbully (e.g. a “Troll”? Identifying instances of cyberbullying ▪ What is the likelihood a conversation is cyberbullying? ▪ Is the conversation aggressive (e.g. “flaming”) ? Is it not? ▪ At what level is a conversation deemed cyberbullying?

Source: https://www.theguardian.com/news/2017/may/21/facebook-moderators-quick-guide-job-challenges

How has cyberbullying been deterred in the past? Historically done
via: ▪ Content moderation via the product owner ▪ Content moderation via user feedback (e.g. user reports) Typically these approaches require having moderators manually review comments. This is gravely inefficient and doesn’t scale well due to the need for human input.

Task: Classify a text comment Is a comment an instance
of cyberbullying?

Experience: Labeled training data Comment_id | No Comment_id | Yes

Performance Measurement: Is the label correct? Verify if the comment
is successful or not

Example Data Science Pipeline / Workflow

Model Selection Triple

Text Feature Extraction and Optimization Text Normalization (Reduce # of
features) ▪ Removal of special characters ▪ Punctuation and stop word removal ▪ Stemming Vs Lemmatization Feature Optimization ▪ Choice of text normalization ▪ Ngrams (use of one, two or three words as a feature) ▪ Term frequency / document frequency tuning ▪ Limiting maximum features ▪ Dimensionality reduction (SVD) Tweaking ▪ Stemming, stop word removal ▪ Ngrams: Monogram and bigram ▪ Frequency tuning: Helped ▪ Limiting max features: Helped ▪ SVD: Did not help Bengfort et. al, unpublished work

How that may look in Python Sklearn Pipeline([ ('vect', TfidfVectorizer(stop_words='english',
max_df=0.20, ngram_range=(1,2))), max_features=50000))), ('clf', MultinomialNB()) ]) http://scikit-learn.org/stable/modules/generated/sklearn.feature_ extraction.text.TfidfVectorizer.html Tfdif Vectorizer:: ▪ Controls the size of your ngrams ▪ max_df ignores terms more than this document frequency ▪ Stop words - remove words with no intrinsic value (e.g. the)

No one human or computer can sift through all social
media and online communication. When that happens what do we risk?

Social works differently across platforms

Interpreting results What is the cost of failing to identify
cyberbullying? Do we care more about False Positive or False Negative? FP: Post labeled as cyberbullying FN: Mislabeled cyberbullying post as innocent

Classifiers Avoids overfitting (making assumptions beforehand about the likely distribution
of the answer). But independence assumption is a simplistic model of world. NB LR Modeling the relationship between variables that is iteratively refined using a measure of error in the predictions made by the model (Regularization, penalty term). Logistic regression gives linear class boundaries. DT Graphical model of rules that partitions the data until a decision is reached at one of the leaf nodes. Complexity is related to the amount of data and the partitioning method. Prone to overfit. Minor variations in data cause big changes in tree structure. Highly biased to training set (Random Forest to your rescue) RF Constructs a forest of decision trees. At each step, in one of its iterations (classification process), it picks a random subset of features to try. It will eventually pick a subset of features that perform best in a tree classifier. MLP Can learn a non-linear function approximator. Between the input and the output layer, there can be one or more non-linear layers (i.e., hidden layers). Requires tuning a number of hyperparameters; sensitive to feature scaling SVM Attempts to maximize the distance between classes, works in high dimensional space. Use of kernel to transpose data into a higher dimensional space. Linear kernels are commonly used for text classification due to the large number of features involved Probabilistic.

More than 1,000 new words, senses, and subentries have been
added to the Oxford English Dictionary in our latest update, including worstest, fungivorous, and corporation pop. Oxford English Dictionary September 2017 Update http://public.oed.com/the-oed-today/recent-updates-to-the-oed/

Case Study #2: Reporting sexual harassment on Tripadvisor

In their place was a message from TripAdvisor that cited
various reasons for the deletions: They were “determined to be inappropriate by the TripAdvisor community,” or removed by staff because they were “off-topic” or contained language or subject matter that was not “family friendly.” The Milwaukee Journal Sentinel asked TripAdvisor to see the posts that were removed. The company refused. https://www.jsonline.com/story/news/investigations/2017/11/01/tripadvisor-remo ved-warnings-rapes-and-injuries-mexico-resorts-tourists-say/817172001/

“Our new email communications will clearly articulate the phrase or
sentences that are in violation of our policy, inviting the reviewer to make edits and resubmit their review” TripAdvisor reports. “These badges will remain on TripAdvisor for up to three months. However, if the issues persist we may extend the duration of the badge,” he said. “These badges are intended to be informative, not punitive.” https://www.nytimes.com/2017/11/08/travel/tripadvisor-sex-assault-discriminati on-warnings.html

Case Study #3: Racial discrimination on Airbnb

Quirtina Crittenden unable to book a room on Airbnb, changed
name to “Tina” and photo to cityscape permitted her to bypass those difficulties Edelman ed. al. “It is not clear a priori how online markets will affect discrimination. To the extent that online markets can be more anonymous than in-person transactions, there may actually be less room for discrimination.“

16% “In an experiment on Airbnb, we find that applications
from guests with distinctively African-American names are 16% less likely to be accepted relative to identical guests with distinctly White names.” Racial Discrimination in the Sharing Economy: Evidence from a Field Experiment, Edelman ed. al.

Algorithms as well as data are shaped and designed by
humans.

“Algorithms do not automatically eliminate bias. Suppose a university, with
admission and rejection records dating back for decades and faced with growing numbers of applicants, decides to use a machine learning algorithm that, using the historical records, identifies candidates who are more likely to be admitted. Historical biases in the training data will be learned by the algorithm, and past discrimination will lead to future discrimination.” Cynthia Dwork, Algorithms and Bias (2015) http://nyti.ms/1Qyfqre

“The math-powered applications powering the data economy were based on
choices made by fallible human beings. Some of these choices were no doubt made with the best intentions. Nevertheless, many of these models encoded human prejudice, misunderstanding, and bias into the software systems that increasingly managed our lives. Like gods, these mathematical models were opaque, their workings invisible to all but the highest priests in their domain: mathematicians and computer scientists. Their verdicts, even when wrong or harmful, were beyond dispute or appeal. And they tended to punish the poor and the oppressed in our society, while making the rich richer.” Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

Questions to consider in development ▪ Collection: What is being
collected and/or created? ▪ Standards and methodologies: How is data collected? ▪ Ethics: Do we have restrictions? ▪ Data sharing and access: How is it shared? With whom? ▪ Long term maintenance: Now that we have it, what next?

Data Lifecycle: What is your organization’s policy on data management?

Starting a data management policy at your organization ▪ Defining
a data ethics curriculum for your organization: ▪ Deleting data: How can we build in mechanisms into our software to delete data? ▪ Education about repurposing data: ▪ Requiring consent to collect and use data in one way

17% Companies have a data map and use it to
track the flow of data between systems according to a 2015 Nymity Privacy Management Program Benchmarking and Accountability Report

Mozilla Firefox Data Collection Policy

Electronic Frontier Foundation Privacy Policy provides a contract in intelligible
terms for the user.

“Not many of us like thinking about death — especially
our own. But making plans for what happens after you’re gone is really important for the people you leave behind. So today, we’re launching a new feature that makes it easy to tell Google what you want done with your digital assets when you die or can no longer use your account.” Google Data Liberation Project http://dataliberation.blogspot.com/

Continue the conversation by learning more [TALK] Liz Rush, Write/Speak/Code
2016, “Challenging & Democratizing Algorithm Development” [ONLINE TRAINING] O’Reilly’s, “Data Ethics: Designing for Fairness in the Age of Algorithms” [BOOK] Cathy O’Neil, “Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy” [PAPER] Dr. Sharon Valler, Santa Clara University, “Introduction to Software Ethics” [TALK] Marius Watz, Papers We Love 2016, “Abuse of an Algorithm Comes as No Surprise”

Bloomberg, BrightHive, and Data for Democracy are championing the “Community
Principles on Ethical Data Sharing” (or CPEDS) to develop guidelines for data collection and sharing. You can join the effort on Slack or GitHub.

Gracias! http://bit.ly/2EfbmsY| ¿Preguntas? Mandame un tweet @loooorenanicole

Your model is bias, but so is your data. The ca...

Your model is bias, but so is your data. The case for ethics in data science.

More Decks by Lorena Mesa

Other Decks in Programming

Featured

Transcript