Pro Yearly is on sale from $80 to $50! »

Data ethics: What is this thing called "Big Data"?

Data ethics: What is this thing called "Big Data"?

Lecture for LIS 640 "[Big] Data Ethics."

837b357dc46c47fc99560e03b8841a27?s=128

Dorothea Salo

June 17, 2019
Tweet

Transcript

  1. WHAT IS THIS THING CALLED “BIG DATA”? (it ain’t love)

  2. TECHNICAL CHARACTERISTICS • “Data that I can’t handle on a

    typical end-user computer.” •Volume, velocity, variety: lots of data, accumulating very very fast (think Twitter, or Google’s web crawlers), that’s hard to analyze because it’s not always built in computer-analysis-friendly ways • Storage capacity/cost, analysis speed, and analysis techniques for unstructured data have all improved markedly of late. • This puts collection and analysis of much “bigger” data within reach of many more people and organizations.
  3. INNOCUOUS BIG DATA • Astronomy, physics, meteorology/climatology, chemical pharmacology, many

    kinds of research modeling (“if this happens, what then?”)… • … so it’s not quite as simple as “outlaw Big Data altogether!” •Consequentialists would say “you’re foreclosing on an awful lot of good outcomes there, champ; maybe reconsider?” •Deontologists would shake their heads at how I am restricting the autonomy of many researchers for no obvious reason. •Virtue ethicists would say “do you really want to be the person who cuts off a whole lot of harmless, useful, or even awesome research because you think a totally different group of people abuses the techniques?” And no, I don’t want to be That Person.
  4. NOT INNOCUOUS AT ALL, OFTEN: BIG DATA ABOUT PEOPLE •

    This comes with a lot of ethical challenges that Big Data about (say) galaxies doesn’t. • So for purposes of this class, when I say “Big Data” from here on, assume I mean “… about people.” •If I mean something else, I’ll try to remember to say so! • Surveillance capitalism (Shoshana Zuboff): An economic regime substantially predicated on the collection, sale, analysis, and use of Big Data about people
  5. WHO COLLECTS BIG DATA? • Governments •who have put at

    least some of theirs online, e.g. property-tax databases • Commercial entities of many types •including “data brokers,” whose entire business it is to collect and sell Big Data •and all the industries busily putting in Big Data collection mechanisms so they can sell data to data brokers! • Technology-rich organizations: social media, advertisers, trackers, news and other media, schools and colleges/universities, device manufacturers, ISPs and telecoms… • It’s almost easier to answer “who doesn’t?” these days. •I used to be able to say “libraries!” reflexively, but…
  6. WHAT HAPPENS TO BIG DATA ONCE IT IS COLLECTED? •

    It is stored for an uncertain amount of time that often turns out to be “indefinitely.” • It is combined with Big Data from other sources (“data lake” “data warehouse”). • It is analyzed, often for otherwise-inscrutable patterns, and used to make judgments or predictions about people, or manipulate their behavior. •When you hear “machine learning,” “artificial intelligence,” “neural nets,” or the vague “algorithms,” it’s usually this. “Predictive analytics” is obviously this. •Judgments, predictions, and attempted manipulation may be collective, isolated to particular demographic or affinity groups, or individual. • It is shared, swapped, sold. Repeatedly. Or parceled out in bankruptcy. • It leaks accidentally (or is leaked deliberately). • It is stolen, sometimes then abused in ways that cause harm. •Financial harm, social/reputational harm, harassment harm, even physical harm.
  7. BUT THIS ISN’T REALLY NEW, IS IT? • Not entirely

    new, no. Marketing-by-demographics, loyalty programs—heck, the US census is enshrined in the Constitution! • However. The Big Data era comes with some key differences: •De-silo-ization. If you belonged to Sears’s loyalty program in 1984, you would have no reason to believe your data went outside Sears’s own data silos. This is emphatically no longer true; Big Data from a wide variety of sources is routinely combined and recombined. Much of it is publicly-available (one way or another, legally or not). •Following from the above, context collapse—inability to keep data from one social context (e.g. social media) from (ab)use in other contexts (e.g. workplaces) •New and sometimes very intimate sources of data, from website trackers to smartphone APIs to home gizmos to video cameras •Weakening of consent mechanisms: you can refuse a loyalty card, but you mostly can’t say no to becoming Big Data •A huge and growing, largely-unregulated economy of Big Data brokers, traders, grabbers, analyzers, snake oilists
  8. SO WHAT GETS COLLECTED? • Any detectable action, really. If

    a tracker can notice you’ve done it, into the data warehouse it goes, tied (or tie-able somehow) to your identity. • This includes: •Location (and through that, realspace activities), purchase history, travel history, click/view history, contacts/friends/coworkers (“social graph”), entertainment habits and tastes, health indicators (physical and mental) and health-related activities, education-related activities… • And can be analyzed to (theoretically, at least) approximate e.g.: •Age, mood, occupation, state of health (physical and mental), hirability, religion, politics, relationship status, pregnancy status, level of education, socioeconomic status (including creditworthiness), manipulability (commercial or political), educational or work performance… •You can’t know (and sometimes have little to no control over) what your tells are. Big Data analysis can and does find patterns you don’t consciously know about! •You also can’t know that the analyses are correct about you. Often they’re not… and you usually can’t correct the data or the conclusions, either.
  9. WHAT’S DETECTABLE, AND HOW? • Online: web trackers, advertising tech,

    routine server logs, cookies, click histories, search histories, “browser fingerprinting” • Offline: video surveillance, device identification and (increasingly precise) geolocation, interactions with offices (health, government, education, etc) that leave records in computers • Both: purchase histories, loyalty programs • Mobile: geolocation, API data leakage, sensor abuse • Internet of Things (including cars): speech, interaction, behavior patterns, driving, data-gathering from other devices on the local network • There’s more. Lots more. But just to give you the flavor.
  10. THANKS! This presentation copyright 2019 by Dorothea Salo. It is

    available under a Creative Commons Attribution 4.0 International license.
  11. WHAT IS PRIVACY, AND WHAT IS ITS WORTH? Dorothea Salo

  12. WHY IS THE LATTER QUESTION NECESSARY FOR BIG DATA ETHICS?

    • Whatever definition or conceptualization of privacy is your favorite (there’s lots), Big Data kicks it over and stomps it into the ground. •If you can think of an exception to this rule—a definition of privacy that doesn’t imply that Big Data imperils privacy—I’d love to hear about it. • We can’t do a useful analysis of the ethics of Big Data, then, unless we have at least some idea of the worth of privacy and the consequences (individual, group, and collective) of losing it. •Speaking consequentialist-ly: there’s no way to do cost-benefit analyses here unless we have a sense of the costs! Without that, we land right at the “own the good, ignore the bad” ethics smell.
  13. MANY WAYS TO THINK ABOUT PRIVACY! • And in fact,

    privacy-ethics scholarship/research/praxis comes from A LOT of fields: •LIS, of course •Law •Journalism (i.e. the ethics of how you treat your sources) •Psychology •Education •Health •Sociology, history of science, science and technology studies •Philosophy, ethics, religious studies • A false claim to watch out for: “Privacy is a New Idea (so we don’t have to take it seriously).” •Not justifiable on the evidence. Plenty of religious and legal writing from antiquity worldwide valuing privacy (for at least some people… “not everyone was thought to need or deserve privacy” is true).
  14. WHAT IS PRIVACY, REALLY? • Argh. ARGH. AAAAAAAAAAAARGH. So many

    definitions in play I can’t even. •Partly an artifact of the variety of fields working on/with privacy. • Useful questions (answers may and do shift over time): •Privacy of what? •Privacy from whom? •Privacy how? (The LIS concept of intellectual freedom meaning non- observation of information use, for example, is leagues away from the US’s search-and-seizure laws—yet both invoke privacy.) •Interruptible when, how, and by whom? (In the US, law enforcement can legally abridge suspects’ privacy rights, usually though not always when a judge agrees it’s necessary.)
  15. DEFINITIONS OF PRIVACY VIS-A-VIS BIG DATA THAT DON’T WORK REAL

    GREAT • Privacy as a property right in one’s own data •Does not address ethical problem of economically-disadvantaged people forced to sell “their” data; “privacy for the rich only” fails beneficence and justice forever. (The Zuck’s actions reveal that “privacy for the rich only” is his attitude, and you know what I think about agreeing with The Zuck.) •Does not address ethical problems around Big Data being sold/shared all over the place, once alienated from its originator •Does not address ethical problems of bias in and oppression through data • Privacy as a right to control data •There’s already not enough time in the universe for anyone to read the million Terms of Service agreements we’ve supposedly consented to. This conceptualization is just flat-out impractical. (“Time and awareness enough to worry about this” is also not universal—and not equitably distributed.) •That said, some limited forms of data control, e.g. “right to be forgotten,” are out there in the real world and worth studying for costs/benefits.
  16. CONTEXTUAL INTEGRITY: A PRIVACY FORMULATION I LIKE •(while admitting it

    has flaws) •Helen Nissenbaum et al. • Privacy is a right to “appropriate flow” of one’s data, considering the context. •For example, I don’t mind if my doctor shares relevant parts of my health record with my pharmacist. I mind a lot if doctor or pharmacist tells that very same information to my employer. Not appropriate! • The trick, of course, is figuring out what’s appropriate in context. •Contextual integrity scholars say “ask people about the norms.” This is fine as far as it goes, but… norms don’t always square with ethics, nor are they always perfectly-informed. Should we follow norms codified in ignorance? Or solely codified by the privileged? •Nor does everybody have the same norms in the same context, for that matter! (Athletes, for example, may worry less than I do about health info escaping health-care contexts.) Whose norms govern? • Still: listening to people is rarely bad! Considering context is rarely bad!
  17. CONTEXTUAL INTEGRITY AND BIG DATA • One of the big

    things about Big Data is that it escapes its original context so easily and so often. • This makes contextual-integrity failures practically a certainty. • How do we handle that? … We haven’t addressed it, so far. •Law hasn’t. Policy hasn’t. Norms haven’t. • This is a problem.
  18. A THING WE KNOW BY NOW • People behave differently

    when they know they are observed. •This often means less “deviant” behavior—less theft, for example. The trick here, though, is that “deviance” is neither a universal nor a stable concept. Compare Singapore to the Netherlands. This phenomenon is often used to justify surveillance. While important, it is not sufficient alone, because… •It also means less expression (artistic, political), less risk-taking, less novel/ innovative behavior, less behavior that may be socially disapproved even though innocuous (e.g. reading something considered déclassé), and a more intellectually and politically homogeneous society overall •Sufficient loss of privacy society-wide slows rebellion against (and therefore reform of) oppression. • So, ethically… •Loss of privacy—conceptualized as “freedom from worrisome and/or unwanted observation”—is a direct, serious challenge to the deontological concepts of autonomy, rights, and even justice and fairness. •Consequentialist-ly, the losses of societal innovation and progress against oppression are significant.
  19. NOT TO MENTION, OF COURSE… • Being watched is stressful,

    especially when the party watching is adversarial in some way. • If you’re watched intensely enough and/or long enough, especially by someone(s) you don’t trust, it becomes actually traumatic—as in, PTSD-causing traumatic. • Loss of privacy is a huge aspect of and contributor to domestic violence, online harassment, bullying, and other interpersonal cruelties. •Dox(x)ing: Tying an online pseudonym to the person whose pseudonym it is, and/or researching and releasing personal information about them (address, workplace, etc). Usually intended to spur offline harassment or even violence. • On the nation-state level, privacy loss is deployed as a tool of oppression. •Not uncommonly on the way to more and worse oppression • Beneficence, what beneficence?
  20. PRIVACY, ERROR, IDENTITY • Educators observe that privacy is often

    necessary for people to feel okay with making mistakes—and making mistakes is often crucial to learning. • Sociologists and criminologists note privacy as vital to allowing people to recover from mistakes and rejoin society productively after bad behavior. •How long, for example, should a social-media gaffe follow someone? (I’m not at all pretending this is an easy question, of course.) •Another iteration: “Ban the Box” advocacy (banning questions about criminal convictions on job applications) • Psychologists and sociologists observe that privacy allows people, especially but not exclusively young people, to experiment with identity— who they are and who they could (or want to) become. •Privacy is especially needed for marginalized/oppressed identity facets.
  21. ETHICS SMELL: “I HAVE NOTHING TO HIDE! SO YOU MUST

    NOT NEED PRIVACY EITHER!” • Missing the “from whom?” piece •Often conceptualized as “from law enforcement” or “from social media.” Privacy is much more complicated than that, or context collapse wouldn’t be so much of a problem. • Also a very narrow sense of context •My rejoinder is often “Oh? Cool. Hand over your wallet and unlocked phone, please, and the keys to your vehicle and dwelling.” • Also ignores benefits of privacy: even when not strictly necessary, it is often useful and beneficial. • Also SO PRIVILEGED I CAN’T EVEN—er, falsely universalized •You won’t be persecuted for your gender identity, religion (or lack thereof), race/ethnicity, sexual preferences, entertainment choices? Lucky you. Many of us are not in that place, and we matter too.
  22. THANKS! This presentation copyright 2019 by Dorothea Salo. It is

    available under a Creative Commons Attribution 4.0 International license.
  23. SURVEILLANCE Dorothea Salo

  24. WHAT IS SURVEILLANCE? • Luckily, this one’s a lot easier

    than defining privacy. The etymology even helps! It’s Latin via French. •SUPER > sur “over” •VIGILARE > veiller “to watch” •Trivia (impress your friends at parties): the word appears to have crossed into English via its use during the Terror in France • Some notes •Surveillance is systematic observation that creates and/or results from power over the surveilled. A child staring at you on the street is not surveillance (unless the kid is part of the Baker Street Irregulars…) •In a Big Data context, surveillance is not just being watched, it’s also being recorded and having those recordings stored and analyzed. Those latter two bits are important—they create ethical questions that just watching may not. •“Watch” is incomplete. Many other forms of detection besides the purely visual in play here, a lot of them unprecedented.
  25. A REMINDER • “It’s creepy!” is not an ethics argument

    in and of itself. •It can provide evidence for an ethics argument, perhaps “so many people think X is creepy that doing X fails beneficence forever.” •Exploring what sets off the feeling can hint at ethics concerns too. • So feel those feels, note those feels, see if those feels are shared… but in ethics, you have to argue beyond your own feels. •It’s arguable (and recent ethics work by/for marginalized populations often makes this point) that standard-issue Western ethics is way too emotionless! •“Rationality divorced from emotion” is not really a thing anyway (the neuroscientists tell us), no matter how many arguments start there. •Still, I think consciously noticing but refusing to rely solely on our own emotions is good practice for perspective-taking, so I’ll (somewhat reluctantly) stand with Western ethics on this point for now.
  26. IF YOU GATHER FROM THIS THAT I FIND SURVEILLANCE INTENSELY

    CREEPY YOU WOULD BE RIGHT. I WILL TRY TO SUPPRESS THAT, THOUGH.
  27. SURVEILLANCE AS BIG DATA SOURCE • Cameras and sensors and

    drones, oh my! •And phones. And cars. And wifi and Bluetooth tracking in realspace. And metadata from our phone and Internet use. And so on. •So many new ways to reidentify people! Device/browser fingerprinting, all sorts of biometrics (including facial recognition)… • Do I really need to say any more about this? I don’t think I do. • Well, maybe one thing: •People opposed to government surveillance need to worry about “surveillance capitalism” and vice versa. •In the present Big Data environment, anything corporations know about people, governments also know. (The reverse is slightly less true—but still true enough to be worrisome. Consider private contractors who surveil on behalf of governments! What else happens with that data?)
  28. WHAT IS SURVEILLANCE-DERIVED BIG DATA USED FOR? • So far

    I’ve been making a case that in and of itself, surveillance (as a source of privacy loss) causes harms that may not be ethically justifiable. •Mostly a deontological case: surveillance breaks moral precepts such as autonomy, dignity, and (often) transparency. • Consequentialists would insist that can’t be the whole story, though; we also have to ask about how surveillance is used and what the consequences of those uses are. •After all, are we not going to watch over very young children, or seniors who have dementia? Of course we are. It prevents greater harms to them. So at least sometimes, surveillance is ethically justifiable. • … We’re absolutely going to look at that, just not this module. I just want to call attention to the distinction, for now.
  29. SURVEILLANCE AND “SAFETY” • This is where “but nobody wants

    surveillance!” arguments fall down. Because many people DO, and it’s not always just power reproducing itself, either (though that is certainly a thing). • Peer-to-peer surveillance •Have you seen commercials for car dashcams or house cams on off-brand TV channels? I sure have. They scare me. •How about video doorbells? Door cams? Drones (Amazon) that report stuff to law enforcement? •… How about video surveillance in K-12 schools, to prevent mass shootings? Social-media surveillance, same reason? Both happening. •All in the name of “safety,” though some of it seems to be enforcement of perceived norms, often at the expense of the already-oppressed. •(Orwell saw this clearly, likely from experience. See the character “Tom Parsons” in Nineteen Eighty-Four. Also, Spanish literature’s “el quedirán.”)
  30. INTERNALIZED SURVEILLANCE, SURVEILLANCE CREEP • A person who is surveilled

    for a long time and knows it’s happening develops an internal watcher to keep from getting in trouble—one that is often much more effective than the actual surveillance regime. •This is one source of the often-observed risk-aversion effect in surveilled individuals and populations. People adopt even stricter rules for themselves than required (or, sometimes, intended). • Surveillance often starts small and grows: “surveillance creep.” •Expands to more contexts, more people, more types of surveillance •Cynically: surveillance often deployed first against the hated and/or powerless, then expanded to cover more people… until it covers everyone. (This process blatantly violates fairness, consistency, and justice precepts!)
  31. SURVEILLANCE AND TIME • This is one place the indefinite

    storage and secondary uses associated with Big Data surveillance come into play. • Even when initial data collection doesn’t trip our contextual-integrity or creepy-meters… subsequent use/sharing/sale may, and we have very little to no control over that (as I keep saying). •“Privacy is a time-shifted risk.” —Frederike Kaltheuner, explaining why Hong Kong residents stopped using transit cards (which leave Big Data trails) to keep from being associated with mass protests. https:// gizmodo.com/what-hong-kongs-protestors-can-teach-us-about-the- futur-1835715794 • A competent, complete consequentialist analysis needs to consider delayed and third-party harm also.
  32. THANKS! This presentation copyright 2019 by Dorothea Salo. It is

    available under a Creative Commons Attribution 4.0 International license.
  33. PAST/PRESENT PRIVACY/ETHICS SAFEGUARDS AND WHY THEY DON’T WORK FOR BIG

    DATA
  34. IDENTIFIERS AND PII • Identifier: a piece of information that

    singles us out from the crowd • Direct identifier: an identifier where identifying is its basic function! •Often this is a number: SSNs, UW ID number, passport number, etc •Not always, though: email address, name (imperfect, but sufficient to identify many), username/handle •A device identifier (such as your phone’s MAC address) may serve well enough to identify the device owner. • Indirect identifier: a set of attributes/characteristics that by themselves are not identifying… but ARE identifying when combined. •Famous study by Harvard’s Dr. Latanya Sweeney: combination of gender (binary assumed), birth date, zip code identifies 87% of Americans. • Personally Identifiable Information (PII): Usually, a specific set of direct identifiers with legal strictures on their handling intended to protect privacy •The exact set varies by context, and may include an indirect identifier or two.
  35. DEIDENTIFICATION, REIDENTIFICATION, “ANONYMIZATION,” AND BIG DATA • Deidentification: Removal of

    direct identifiers or PII from a dataset • Reidentification: Identifying individuals represented in a deidentified dataset • Anonymization: Ensuring that no one represented in a dataset can be identified from it •DEIDENTIFICATION IS NOT ANONYMIZATION. It’s not nearly enough to ensure reidentification is not possible! Please correct people on this point. • Big Data is a big reason deidentification doesn’t work any more. •One reidentification method is comparing datasets with overlapping data to see which people can be identified by the combination. (Famous example: Narayanan and Shmatikov reidentified people in a deidentified Netflix dataset by comparing rental dates against IMDB review dates.) •Big Data means a lot more extensive and public datasets to compare against!
  36. BIG DATA ETHICS SMELL: “WE PROTECT PII!” • Do they

    need to collect the data, including the PII, in the first place? The best protection for the data is not having it! • Are they ONLY protecting PII? What happens to the rest of the data? Pay no attention to the data broker behind the curtain! • If you’re reidentifiable from indirect identifiers anyway, what good does protecting PII do? • This is an Empty Platitude. When you hear it, dig deeper.
  37. “NOTICE AND CONSENT” AND BIG DATA • In the US,

    “contract law” presumes that adults are always capable of and responsible for looking after their own interests. •This is a Just So Story that completely ignores who has the power in the negotiation, of course! • This is the bedrock for “notice and consent” as a legal basis for Big Data collection and use. •Colloquially, “if I tell you and make you click ‘yes,’ I’m ethically off the hook, no matter what I told you or how forced your click was.” • It’s a dubiously-ethical stance at best. •Is consent with no real alternative actually consent? How about coerced consent, or tricked consent (“dark [design] patterns”)? •Does notice have to be readable and comprehensible? •Who has time to read and understand every privacy policy everywhere?!
  38. INSTITUTIONAL RESEARCH BOARDS AND BIG DATA • IRBs: reform measure

    for human-subjects research after several 20th- century ethical horrors in biomedical research •“Let’s deliberately infect black men with syphilis and lie to them about it! This seems fine.” —Tuskeegee Institute •“Let’s use Puerto Rican women to guinea-pig new birth control! We can totally lie to them and test irreversible and/or health-damaging stuff, it’ll be fine.” —Margaret Sanger, Clarence “Proctor and” Gamble, others • “Common Rule” intended to ensure: •Beneficence, Justice, Respect for persons, Privacy for research participants, Right to withdraw, Return of results, Informed consent •… many of which should be familiar concepts from our first module. • So for Big Data research, “privacy” and “consent” are already problematic for reasons I talked about. (I would argue “respect” is also at issue, often.)
  39. SO HOW ARE IRBS HANDLING THAT? • Not real well,

    honestly. •Most IRBs only consider harms of data collection… not data retention, sharing, reuse, or sale. •Most IRBs only consider harm to study subjects… not to anyone outside the study, much less to society itself. •Most IRBs are, shall we say, non-technical. (Many do their best!) They don’t always understand new harms associated with Big Data. • And (as I mentioned briefly with respect to the “gayface” study) too many researchers assume that if a study passed (or is exempt from) IRB review, that means it’s ethical. Nope. •Remember our ethics smell “Nobody told me I shouldn’t”? This is that.
  40. THANKS! This presentation copyright 2019 by Dorothea Salo. It is

    available under a Creative Commons Attribution 4.0 International license.