campus* *sends article* One is IP authentication, since Dr. Owl is in their o ff i ce. Dr. Owl clicks on the article link. The Content Vendor notices that Dr. Owl’s computer’s IP address belongs to campus, so The Content Vendor sends the article back to Dr. Owl’s browser.
noted. We’ll let these through. Behind the scenes, of course, there’s some work that The Library and The Content Vendor have to do, maintaining lists of IP addresses that belong to campus. And I’m kind of oversimplifying here, but just so you have the basic idea.
ce computer has a “dynamic” (changing) IP address, fairly private. • If Dr. Owl’s of fi ce computer has a “static” (persistent) IP address… The Content Vendor can build up a reading-tracking dossier on Dr. Owl. • If Dr. Owl’s computer’s network name identifies Dr. Owl (I have seen this!), or if Dr. Owl’s reading habits are fairly uncommon, or if The Content Vendor uses web bugs/trackers… The Content Vendor can reidentify Dr. Owl and associate their reading-behavior dossier with them. • The Content Vendor may sell that reidentified dossier to… The Data Broker! This kind of authentication may protect Dr. Owl’s privacy… or not. If Dr. Owl’s o ff i ce computer gets a dynamic IP address, that protects Dr. Owl to some extent. If the IP address is static, though, it’s trivial for The Content Vendor to track Dr. Owl’s reading just using that IP address as an identi fi er . If Dr. Owl has unusual reading habits, they may be reidenti fi able from those. More likely, though, The Content Vendor just reidenti fi es Dr. Owl with standard web bugs. And then The Content Vendor has a new revenue stream: selling reading-behavior dossiers on library patrons to The Data Broker!
I get it? But there’s another user story we have to account for. Dr. Owl is working from home, and suddenly they need an article. Since they don’t live on campus, The Content Vendor can’t recognize their computer’s IP address as belonging to campus. So how does The Library get Dr. Owl that article?
patron?” 4. logs in 5. “yes, Dr. Owl, legit” 6. “send article.” 7. sees proxy server, sends article 8. sends article 3. “log in, please?” What The Library does is stand up what’s called a proxy server, and here’s a diagram of more or less how that works . Dr. Owl clicks a proxied link to the article. The Library’s proxy server doesn’t know who clicked, so it asks Campus IT whether the clicker is a legitimate patron. Campus IT asks Dr. Owl to log in, which they do, so Campus IT reports back to The Library’s proxy server that it’s Dr. Owl and they’re legit. The Library’s proxy server then requests the article from The Content Vendor, who sees that it’s the proxy server and sends it over. And fi nally, The Library’s proxy server sends the article to Dr. Owl.
leaves campus. • If Dr. Owl starts mass-downloading for their Important Text-Mining Project, justifiably annoying The Content Vendor, The Library can handle it campus-internally. • The Content Vendor doesn’t see Dr. Owl’s IP address, only the proxy server’s. • This foils The Content Vendor’s web bugs/trackers. • Dr. Owl’s reading gets mixed in with the rest of campus. • This foils behavioral tracking by The Content Vendor. But there are privacy advantages to it. The big one is that Dr. Owl’s identity never leaves campus, so The Content Vendor doesn’t automatically know it. If Dr. Owl starts downloading a whole database to text-mine it, The Content Vendor tells The Library and The Library quietly tells Dr. Owl to knock it o ff . No lawsuits ! Second, the content vendor never sees Dr. Owl’s IP address, so web bugs and trackers can’t nab it. And third, with lots of people on campus going through The Library’s proxy server to get articles, it’s way harder to track Dr. Owl’s reading because it’s all mixed in with everybody else’s reading.
and authorization (“given who you are, what are you allowed to see/do?”) • There are actually several ways to set up a SSO-friendly authentication system, just to be clear. • Common, though far from universal, in higher ed • Rely on a communication language called SAML, and (often) a software setup called Shibboleth • Often implemented through “federations” such as InCommon and OpenAthens. With that for background, let’s talk about single sign-on systems. These both authenticate you — make you prove who you are — and authorize you to actually see and do things, such as download articles to read. They’re common, though far from universal, in higher ed . Some jargon, just so you’ve heard it: these systems rely on an XML-based communication language called SAML and often a software rig called Shibboleth. They may be implemented through federations such as InCommon or OpenAthens. If you hear any of these weird words, JUST THINK SINGLE SIGN-ON; you mostly won’t need the details.
design a SSO system to replace proxy servers. • NISO released it over signi fi cant protest, mostly about privacy and security. • SeamlessAccess: Intended RA21 successor, same goals, more participants • Learned some things from the RA21 protests! So a couple of years ago, NISO and the S-T-M Association designed a single sign-on system that they wanted everybody to use instead of proxy servers. Long story short, it didn’t go anywhere. BIG DISCLAIMER: I myself critiqued RA21 heavily because of privacy and security concerns, I was not a fan . The successor e ff ort, which has more organizations participating, is called SeamlessAccess and has the same goals as RA21.
logs in 6. sends article 3. “log in , please?” 5. sends Dr. Owl’s metadata Here’s a simpli fi ed sketch of how single-sign on works in our user story. Dr. Owl clicks on an ordinary unproxied link to an article. The Content Vendor fi gures out that the clicker is from campus, so it asks Campus IT who the clicker is. Campus IT has Dr. Owl log in, as before, and then sends metadata about Dr. Owl to The Content Vendor, who decides that Dr. Owl is legit and sends them the article they want . One important detail here: a lot of the communication I’ve drawn for you here actually happens right inside Dr. Owl’s browser. We’ll learn more about this from John Mark Ockerbloom!
AND CONFIDENTIALITY WITH RESPECT TO… RESOURCES CONSULTED, BORROWED, ACQUIRED OR TRANSMITTED.” —American Library Association Code of Ethics, Article 3 Quick reminder that we have ethics codes that tell us we’re supposed to protect the privacy and con fi dentiality of what people read, including from our vendors.
TO KNOW WHAT READS. IS NOT SUPPOSED TO KNOW WHAT READS. The Content Vendor is not supposed to know what Dr. Owl is reading. Law enforcement at any level is de fi nitely not supposed to know what Dr. Owl is reading. And data brokers, whose whole business is selling behavior data hither and yon and especially to governments and law enforcement, are emphatically not supposed to know what Dr. Owl is reading.
name, email, campus ID, etc. • Persistent pseudonymous identi fi ers (PPIs) • 24601, No. 6, 007, 8675309, 7 of 9, Eleven… • You know who (some of) those are, right? Yeah. Given a PPI, The Content Vendor can likely figure out who Dr. Owl is too. Behavioral tracking and web bugs are all it takes! • SeamlessAccess calls PPIs “anonymous.” This is flat wrong. • Non-PII/PPI metadata: “entitlement metadata” • “This person is tenured faculty in the Ornithology department with courtesy appointments in Philosophy, Religion, and Folklore.” • Given enough of this, and/or combining it with web-bug and behavioral data, The Content Vendor can absolutely reidentify Dr. Owl. First let’s ask what patron metadata even is there? Well, there’s standard P-I-I, like name and email address and campus identi fi ers. Then there’s what I’m calling persistent pseudonymous identi fi ers, like a number that’s consistently used over time to identify a given patron without using their actual name. So like, 2-4-6-0-1, Number Six, the immortal double-oh seven, and so on . You know who at least some of those numbers are, right? Yeah, so this isn’t really privacy protecting. Given a persistent pseudonymous identi fi er, The Content Vendor can easily use behavioral tracking and web bugs to tie that identi fi er to Dr. Owl. SeamlessAccess calls these identi fi ers “anonymous” and I am really angry about that because words mean things and PPIs are not anonymous. Deidenti fi ed, yeah. Anonymous, absolutely not . Anyway, so. Third kind of metadata is often called entitlement metadata, and it’s not about your identity, it’s a list of campus groups you belong to. So for Dr. Owl that might look like, they’re faculty, they’re tenured, they’re in the Ornithology department, and they’ve got courtesy appointments elsewhere. If Dr. Owl is the only tenured faculty member with this speci fi c collection of appointments — well, they’ve basically been immediately reidenti fi ed. They’re not anonymous to The Content Vendor at all.
scenarios: • Authentication Only: “yes or no, is this person affiliated with campus?” • (the very misnamed) Anonymous Authorization: Non-PII entitlement metadata. How much metadata? How reidentifiable would it be? Who knows. SeamlessAccess doesn’t fully specify. • Pseudonymous Authorization: PPI. Be seeing you, No. 6. But just because metadata exists doesn’t mean The Content Vendor gets to see it. So what metadata does The Content Vendor get ? Under SeamlessAccess, there are three possible scenarios. One I really like, and that’s the Authentication Only scenario, where the only thing The Content Vendor fi nds out is whether somebody actually is from campus . Then there’s the very misnamed Anonymous Authorization scenario, where what gets communicated is entitlement metadata. How much? How reidenti fi able would it be? Who knows, SeamlessAccess doesn’t say . And fi nally there’s Pseudonymous Authorization, which sends a persistent pseudonymous identi fi er. Be seeing you, Number Six.
up SSO to give us all the user metadata you got. *yawn* Sure, whatever, here you go. Either of the second two scenarios could go sour real fast. The fi rst horror scenario is where The Content Vendor, in setting up single sign-on, asks Campus IT to send over all available user metadata, P-I-I and P-P-I and entitlements and all, and Campus IT just… does that . Where’s The Library in this? Well, remember, The Library doesn’t run or control the campus single-sign-on system. To keep Campus IT from being lazy, they’ll have to make a nuisance of themselves, and even then Campus IT may brush them o ff .
hate to say it, but a lot of Campus IT folks wouldn’t even think twice. They’re not librarians. They’re cool with surveillance. They do it themselves and help other people on campus do it too! This slide here is from a campus chief information security o ff i cer, and was part of an explanation to a SeamlessAccess-friendly crowd how a system can use behavioral tracking, among other things, to nail so-called fraudulent downloaders.
OWL STOP! CFAA! Federal prosecution! Which leads me to horror scenario two, a repeat of the awful Aaron Swartz saga. The Content Vendor yells “fraudulent downloading!” and calls the feds to prosecute Dr. Owl under the Computer Fraud and Abuse Act. The Library can’t handle this internally even if it wants to, because The Content Vendor has either been told or fi gured out for itself exactly who Dr. Owl is.
hot fully-identi fi ed patron-behavior data? Sure. And I mentioned my third horror scenario already, but making it formal: The Content Vendor straight-up sells identi fi ed patron-behavior data to The Data Broker, comprehensively trashing patron privacy.
vendors ARE data brokers. • Others have been caught partnering/sharing data with the likes of ICE. • Most of them have a ton of web bugs on their sites that send data (directly or in-) to data brokers. • No, they don’t take the web bugs off just because a patron is authenticated. • I don’t trust most content vendors as far as I could throw them. LIS research so far agrees with me. But surely The Content Vendor knows better than to sell patron data to The Data Broker ! Y’all, some of our content vendors ARE DATA BROKERS. Others have partnered with the likes of immigration enforcement. Most are covered in web bugs that send data to data brokers . Hi, content vendors listening in! I need you to know that I DO NOT TRUST YOU with patron data, and I do not think any of us should, and all the LIS research that’s been done on this explains why not!
Me, “Physical-Equivalent Privacy.” Serials Librarian. OA postprint; check GScholar. • Forthcoming: McLean and Stregger, "Sounding the Alarm: scholarly information and global information companies in 2021.” Partnership. • Forthcoming: Lisa Hinchliffe, model license language (Mellon-funded project; thank you, Mellon!) • And watch SPARC. They’re aware and concerned. And speaking of research, some stu ff to read: everything by Sarah Lamdan, my own Physical-Equivalent Privacy piece, a forthcoming piece in Partnership that looks amazing, forthcoming model license language from Lisa Hinchli ff e, and keep an eye on SPARC, okay?