Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Single sign-on, the library, and patron privacy

Dorothea Salo
November 01, 2021

Single sign-on, the library, and patron privacy

Presented for the Digital Libraries Forum conference panel, "Protecting readers' right to privacy in single-sign-on access," 1 November 2021

Dorothea Salo

November 01, 2021

More Decks by Dorothea Salo

Other Decks in Technology



    School University of Wisconsin-Madison Hi, y’all, I’m Dorothea Salo and some of you know me already, so let’s just get right to talking about single sign-on, the library, and patron privacy.
  2. OUR MAIN CHARACTERS The Library Dr. Owl, a Campus Researcher

    Campus IT The Content Vendor Here’s our cast of main characters: The Library, The Content Vendor, Campus IT, and campus researcher Dr. Owl, whose pronouns are they/them.
  3. I need an article! And here’s our user story: Dr.

    Owl in their o ff i ce needs a paywalled journal article that’s available through The Library . The Library can make that happen in a couple-three ways.
  4. IP AUTHENTICATION *clicks* Yay! Article! *notices IP address belongs to

    campus* *sends article* One is IP authentication, since Dr. Owl is in their o ff i ce. Dr. Owl clicks on the article link. The Content Vendor notices that Dr. Owl’s computer’s IP address belongs to campus, so The Content Vendor sends the article back to Dr. Owl’s browser.
  5. (BEHIND THE SCENES) Here are campus’s IP address blocks! Okay,

    noted. We’ll let these through. Behind the scenes, of course, there’s some work that The Library and The Content Vendor have to do, maintaining lists of IP addresses that belong to campus. And I’m kind of oversimplifying here, but just so you have the basic idea.
  6. MAYBE PRIVATE, MAYBE… NOT. • If Dr. Owl’s of fi

    ce computer has a “dynamic” (changing) IP address, fairly private. • If Dr. Owl’s of fi ce computer has a “static” (persistent) IP address… The Content Vendor can build up a reading-tracking dossier on Dr. Owl. • If Dr. Owl’s computer’s network name identifies Dr. Owl (I have seen this!), or if Dr. Owl’s reading habits are fairly uncommon, or if The Content Vendor uses web bugs/trackers… The Content Vendor can reidentify Dr. Owl and associate their reading-behavior dossier with them. • The Content Vendor may sell that reidentified dossier to… The Data Broker! This kind of authentication may protect Dr. Owl’s privacy… or not. If Dr. Owl’s o ff i ce computer gets a dynamic IP address, that protects Dr. Owl to some extent. If the IP address is static, though, it’s trivial for The Content Vendor to track Dr. Owl’s reading just using that IP address as an identi fi er . If Dr. Owl has unusual reading habits, they may be reidenti fi able from those. More likely, though, The Content Vendor just reidenti fi es Dr. Owl with standard web bugs. And then The Content Vendor has a new revenue stream: selling reading-behavior dossiers on library patrons to The Data Broker!
  7. I need an article! I’m at home, though! How can

    I get it? But there’s another user story we have to account for. Dr. Owl is working from home, and suddenly they need an article. Since they don’t live on campus, The Content Vendor can’t recognize their computer’s IP address as belonging to campus. So how does The Library get Dr. Owl that article?
  8. YE OLDE PROXY SERVER 1. clicks proxied link 2. “legit

    patron?” 4. logs in 5. “yes, Dr. Owl, legit” 6. “send article.” 7. sees proxy server, sends article 8. sends article 3. “log in, please?” What The Library does is stand up what’s called a proxy server, and here’s a diagram of more or less how that works . Dr. Owl clicks a proxied link to the article. The Library’s proxy server doesn’t know who clicked, so it asks Campus IT whether the clicker is a legitimate patron. Campus IT asks Dr. Owl to log in, which they do, so Campus IT reports back to The Library’s proxy server that it’s Dr. Owl and they’re legit. The Library’s proxy server then requests the article from The Content Vendor, who sees that it’s the proxy server and sends it over. And fi nally, The Library’s proxy server sends the article to Dr. Owl.
  9. BUT IT HAS PRIVACY ADVANTAGES. • Dr. Owl’s identity never

    leaves campus. • If Dr. Owl starts mass-downloading for their Important Text-Mining Project, justifiably annoying The Content Vendor, The Library can handle it campus-internally. • The Content Vendor doesn’t see Dr. Owl’s IP address, only the proxy server’s. • This foils The Content Vendor’s web bugs/trackers. • Dr. Owl’s reading gets mixed in with the rest of campus. • This foils behavioral tracking by The Content Vendor. But there are privacy advantages to it. The big one is that Dr. Owl’s identity never leaves campus, so The Content Vendor doesn’t automatically know it. If Dr. Owl starts downloading a whole database to text-mine it, The Content Vendor tells The Library and The Library quietly tells Dr. Owl to knock it o ff . No lawsuits ! Second, the content vendor never sees Dr. Owl’s IP address, so web bugs and trackers can’t nab it. And third, with lots of people on campus going through The Library’s proxy server to get articles, it’s way harder to track Dr. Owl’s reading because it’s all mixed in with everybody else’s reading.
  10. SINGLE SIGN-ON (SSO) SYSTEMS • For authentication (“who are you?”)

    and authorization (“given who you are, what are you allowed to see/do?”) • There are actually several ways to set up a SSO-friendly authentication system, just to be clear. • Common, though far from universal, in higher ed • Rely on a communication language called SAML, and (often) a software setup called Shibboleth • Often implemented through “federations” such as InCommon and OpenAthens. With that for background, let’s talk about single sign-on systems. These both authenticate you — make you prove who you are — and authorize you to actually see and do things, such as download articles to read. They’re common, though far from universal, in higher ed . Some jargon, just so you’ve heard it: these systems rely on an XML-based communication language called SAML and often a software rig called Shibboleth. They may be implemented through federations such as InCommon or OpenAthens. If you hear any of these weird words, JUST THINK SINGLE SIGN-ON; you mostly won’t need the details.
  11. RA21, SEAMLESSACCESS • RA21: NISO and STM Association attempt to

    design a SSO system to replace proxy servers. • NISO released it over signi fi cant protest, mostly about privacy and security. • SeamlessAccess: Intended RA21 successor, same goals, more participants • Learned some things from the RA21 protests! So a couple of years ago, NISO and the S-T-M Association designed a single sign-on system that they wanted everybody to use instead of proxy servers. Long story short, it didn’t go anywhere. BIG DISCLAIMER: I myself critiqued RA21 heavily because of privacy and security concerns, I was not a fan . The successor e ff ort, which has more organizations participating, is called SeamlessAccess and has the same goals as RA21.
  12. HOW SSO WORKS 1. clicks link 2. “who’s this?” 4.

    logs in 6. sends article 3. “log in , please?” 5. sends Dr. Owl’s metadata Here’s a simpli fi ed sketch of how single-sign on works in our user story. Dr. Owl clicks on an ordinary unproxied link to an article. The Content Vendor fi gures out that the clicker is from campus, so it asks Campus IT who the clicker is. Campus IT has Dr. Owl log in, as before, and then sends metadata about Dr. Owl to The Content Vendor, who decides that Dr. Owl is legit and sends them the article they want . One important detail here: a lot of the communication I’ve drawn for you here actually happens right inside Dr. Owl’s browser. We’ll learn more about this from John Mark Ockerbloom!

    RIGHT? … right? This is less kludgy, less fragile, and the library doesn’t even have to do anything once there’s a content license! Yay, right ? … right?


    AND CONFIDENTIALITY WITH RESPECT TO… RESOURCES CONSULTED, BORROWED, ACQUIRED OR TRANSMITTED.” —American Library Association Code of Ethics, Article 3 Quick reminder that we have ethics codes that tell us we’re supposed to protect the privacy and con fi dentiality of what people read, including from our vendors.

    TO KNOW WHAT READS. IS NOT SUPPOSED TO KNOW WHAT READS. The Content Vendor is not supposed to know what Dr. Owl is reading. Law enforcement at any level is de fi nitely not supposed to know what Dr. Owl is reading. And data brokers, whose whole business is selling behavior data hither and yon and especially to governments and law enforcement, are emphatically not supposed to know what Dr. Owl is reading.
  17. WHAT PATRON METADATA IS THERE? • PII-style identi fi ers:

    name, email, campus ID, etc. • Persistent pseudonymous identi fi ers (PPIs) • 24601, No. 6, 007, 8675309, 7 of 9, Eleven… • You know who (some of) those are, right? Yeah. Given a PPI, The Content Vendor can likely figure out who Dr. Owl is too. Behavioral tracking and web bugs are all it takes! • SeamlessAccess calls PPIs “anonymous.” This is flat wrong. • Non-PII/PPI metadata: “entitlement metadata” • “This person is tenured faculty in the Ornithology department with courtesy appointments in Philosophy, Religion, and Folklore.” • Given enough of this, and/or combining it with web-bug and behavioral data, The Content Vendor can absolutely reidentify Dr. Owl. First let’s ask what patron metadata even is there? Well, there’s standard P-I-I, like name and email address and campus identi fi ers. Then there’s what I’m calling persistent pseudonymous identi fi ers, like a number that’s consistently used over time to identify a given patron without using their actual name. So like, 2-4-6-0-1, Number Six, the immortal double-oh seven, and so on . You know who at least some of those numbers are, right? Yeah, so this isn’t really privacy protecting. Given a persistent pseudonymous identi fi er, The Content Vendor can easily use behavioral tracking and web bugs to tie that identi fi er to Dr. Owl. SeamlessAccess calls these identi fi ers “anonymous” and I am really angry about that because words mean things and PPIs are not anonymous. Deidenti fi ed, yeah. Anonymous, absolutely not . Anyway, so. Third kind of metadata is often called entitlement metadata, and it’s not about your identity, it’s a list of campus groups you belong to. So for Dr. Owl that might look like, they’re faculty, they’re tenured, they’re in the Ornithology department, and they’ve got courtesy appointments elsewhere. If Dr. Owl is the only tenured faculty member with this speci fi c collection of appointments — well, they’ve basically been immediately reidenti fi ed. They’re not anonymous to The Content Vendor at all.
  18. WHAT METADATA GETS SENT TO ? • Under SeamlessAccess, three

    scenarios: • Authentication Only: “yes or no, is this person affiliated with campus?” • (the very misnamed) Anonymous Authorization: Non-PII entitlement metadata. How much metadata? How reidentifiable would it be? Who knows. SeamlessAccess doesn’t fully specify. • Pseudonymous Authorization: PPI. Be seeing you, No. 6. But just because metadata exists doesn’t mean The Content Vendor gets to see it. So what metadata does The Content Vendor get ? Under SeamlessAccess, there are three possible scenarios. One I really like, and that’s the Authentication Only scenario, where the only thing The Content Vendor fi nds out is whether somebody actually is from campus . Then there’s the very misnamed Anonymous Authorization scenario, where what gets communicated is entitlement metadata. How much? How reidenti fi able would it be? Who knows, SeamlessAccess doesn’t say . And fi nally there’s Pseudonymous Authorization, which sends a persistent pseudonymous identi fi er. Be seeing you, Number Six.

    up SSO to give us all the user metadata you got. *yawn* Sure, whatever, here you go. Either of the second two scenarios could go sour real fast. The fi rst horror scenario is where The Content Vendor, in setting up single sign-on, asks Campus IT to send over all available user metadata, P-I-I and P-P-I and entitlements and all, and Campus IT just… does that . Where’s The Library in this? Well, remember, The Library doesn’t run or control the campus single-sign-on system. To keep Campus IT from being lazy, they’ll have to make a nuisance of themselves, and even then Campus IT may brush them o ff .

    hate to say it, but a lot of Campus IT folks wouldn’t even think twice. They’re not librarians. They’re cool with surveillance. They do it themselves and help other people on campus do it too! This slide here is from a campus chief information security o ff i cer, and was part of an explanation to a SeamlessAccess-friendly crowd how a system can use behavioral tracking, among other things, to nail so-called fraudulent downloaders.

    OWL STOP! CFAA! Federal prosecution! Which leads me to horror scenario two, a repeat of the awful Aaron Swartz saga. The Content Vendor yells “fraudulent downloading!” and calls the feds to prosecute Dr. Owl under the Computer Fraud and Abuse Act. The Library can’t handle this internally even if it wants to, because The Content Vendor has either been told or fi gured out for itself exactly who Dr. Owl is.
  22. RIGHT THERE IN THE MANUAL This scenario is written RIGHT

    INTO the RA21 goals, which were adopted wholesale by SeamlessAccess. “End-to-end traceability… for detecting fraud.”

    hot fully-identi fi ed patron-behavior data? Sure. And I mentioned my third horror scenario already, but making it formal: The Content Vendor straight-up sells identi fi ed patron-behavior data to The Data Broker, comprehensively trashing patron privacy.

    • By no means. • Uh-uh. • Not one single solitary word. But surely SeamlessAccess says this isn’t okay ! Y’all, it does not say that. Not no way, not no how, not nowhere.
  25. WOULD SELL PATRONS TO ? • Some of our content

    vendors ARE data brokers. • Others have been caught partnering/sharing data with the likes of ICE. • Most of them have a ton of web bugs on their sites that send data (directly or in-) to data brokers. • No, they don’t take the web bugs off just because a patron is authenticated. • I don’t trust most content vendors as far as I could throw them. LIS research so far agrees with me. But surely The Content Vendor knows better than to sell patron data to The Data Broker ! Y’all, some of our content vendors ARE DATA BROKERS. Others have partnered with the likes of immigration enforcement. Most are covered in web bugs that send data to data brokers . Hi, content vendors listening in! I need you to know that I DO NOT TRUST YOU with patron data, and I do not think any of us should, and all the LIS research that’s been done on this explains why not!
  26. TO READ! • Everything Sarah Lamdan has ever written. •

    Me, “Physical-Equivalent Privacy.” Serials Librarian. OA postprint; check GScholar. • Forthcoming: McLean and Stregger, "Sounding the Alarm: scholarly information and global information companies in 2021.” Partnership. • Forthcoming: Lisa Hinchliffe, model license language (Mellon-funded project; thank you, Mellon!) • And watch SPARC. They’re aware and concerned. And speaking of research, some stu ff to read: everything by Sarah Lamdan, my own Physical-Equivalent Privacy piece, a forthcoming piece in Partnership that looks amazing, forthcoming model license language from Lisa Hinchli ff e, and keep an eye on SPARC, okay?
  27. THANKS! This presentation is copyright 2021 by Dorothea Salo. It

    is available under a Creative Commons Attribution 4.0 International license. All icons from OpenClipArt. Support OpenClipArt! And that’s what I got, so I’ll turn it over to our next presenter!