Presented for the Digital Libraries Forum conference panel, "Protecting readers' right to privacy in single-sign-on access," 1 November 2021
AND PATRON PRIVACY
University of Wisconsin-Madison
Hi, y’all, I’m Dorothea Salo and some of you know me already, so let’s just get right to talking about single sign-on,
the library, and patron privacy.
OUR MAIN CHARACTERS
Dr. Owl, a Campus Researcher
The Content Vendor
Here’s our cast of main characters: The Library, The Content Vendor, Campus IT, and campus researcher Dr. Owl,
whose pronouns are they/them.
I need an article!
And here’s our user story: Dr. Owl in their o
ce needs a paywalled journal article that’s available through The Library
The Library can make that happen in a couple-three ways.
*notices IP address
belongs to campus*
One is IP authentication, since Dr. Owl is in their o
ce. Dr. Owl clicks on the article link. The Content Vendor notices
that Dr. Owl’s computer’s IP address belongs to campus, so The Content Vendor sends the article back to Dr. Owl’s
(BEHIND THE SCENES)
Here are campus’s
IP address blocks!
We’ll let these
Behind the scenes, of course, there’s some work that The Library and The Content Vendor have to do, maintaining
lists of IP addresses that belong to campus. And I’m kind of oversimplifying here, but just so you have the basic idea.
MAYBE PRIVATE, MAYBE… NOT.
• If Dr. Owl’s of
ce computer has a “dynamic”
(changing) IP address, fairly private.
• If Dr. Owl’s of
ce computer has a “static” (persistent)
IP address… The Content Vendor can build up a
reading-tracking dossier on Dr. Owl.
• If Dr. Owl’s computer’s network name identifies Dr. Owl (I have seen this!), or if
Dr. Owl’s reading habits are fairly uncommon, or if The Content Vendor uses
web bugs/trackers… The Content Vendor can reidentify Dr. Owl and
associate their reading-behavior dossier with them.
• The Content Vendor may sell that reidentified dossier to…
The Data Broker!
This kind of authentication may protect Dr. Owl’s privacy… or not. If Dr. Owl’s o
ce computer gets a dynamic IP
address, that protects Dr. Owl to some extent. If the IP address is static, though, it’s trivial for The Content Vendor to
track Dr. Owl’s reading just using that IP address as an identi
If Dr. Owl has unusual reading habits, they may be reidenti
able from those. More likely, though, The Content Vendor
es Dr. Owl with standard web bugs. And then The Content Vendor has a new revenue stream: selling
reading-behavior dossiers on library patrons to The Data Broker!
I need an article!
I’m at home, though!
How can I get it?
But there’s another user story we have to account for. Dr. Owl is working from home, and suddenly they need an
article. Since they don’t live on campus, The Content Vendor can’t recognize their computer’s IP address as belonging
to campus. So how does The Library get Dr. Owl that article?
YE OLDE PROXY SERVER
2. “legit patron?”
4. logs in
5. “yes, Dr. Owl, legit”
6. “send article.”
7. sees proxy server,
article 3. “log in, please?”
What The Library does is stand up what’s called a proxy server, and here’s a diagram of more or less how that works
Dr. Owl clicks a proxied link to the article. The Library’s proxy server doesn’t know who clicked, so it asks Campus IT
whether the clicker is a legitimate patron. Campus IT asks Dr. Owl to log in, which they do, so Campus IT reports
back to The Library’s proxy server that it’s Dr. Owl and they’re legit.
The Library’s proxy server then requests the article from The Content Vendor, who sees that it’s the proxy server and
sends it over. And
nally, The Library’s proxy server sends the article to Dr. Owl.
LET’S FACE IT:
THIS IS AN UGLY, FRAGILE KLUDGE.
NOBODY LOVES THIS.
BUT IT HAS PRIVACY ADVANTAGES.
• Dr. Owl’s identity never leaves campus.
• If Dr. Owl starts mass-downloading for their Important Text-Mining Project,
justifiably annoying The Content Vendor, The Library can handle it
• The Content Vendor doesn’t see Dr. Owl’s IP
address, only the proxy server’s.
• This foils The Content Vendor’s web bugs/trackers.
• Dr. Owl’s reading gets mixed in with the rest of
• This foils behavioral tracking by The Content Vendor.
But there are privacy advantages to it. The big one is that Dr. Owl’s identity never leaves campus, so The Content
Vendor doesn’t automatically know it. If Dr. Owl starts downloading a whole database to text-mine it, The Content
Vendor tells The Library and The Library quietly tells Dr. Owl to knock it o
. No lawsuits
Second, the content vendor never sees Dr. Owl’s IP address, so web bugs and trackers can’t nab it. And third, with
lots of people on campus going through The Library’s proxy server to get articles, it’s way harder to track Dr. Owl’s
reading because it’s all mixed in with everybody else’s reading.
SINGLE SIGN-ON (SSO) SYSTEMS
• For authentication (“who are you?”) and
authorization (“given who you are, what are you
allowed to see/do?”)
• There are actually several ways to set up a SSO-friendly authentication
system, just to be clear.
• Common, though far from universal, in higher ed
• Rely on a communication language called SAML,
and (often) a software setup called Shibboleth
• Often implemented through “federations” such
as InCommon and OpenAthens.
With that for background, let’s talk about single sign-on systems. These both authenticate you — make you prove
who you are — and authorize you to actually see and do things, such as download articles to read. They’re common,
though far from universal, in higher ed
Some jargon, just so you’ve heard it: these systems rely on an XML-based communication language called SAML and
often a software rig called Shibboleth. They may be implemented through federations such as InCommon or
OpenAthens. If you hear any of these weird words, JUST THINK SINGLE SIGN-ON; you mostly won’t need the details.
• RA21: NISO and STM Association attempt to
design a SSO system to replace proxy servers.
• NISO released it over signi
cant protest, mostly
about privacy and security.
• SeamlessAccess: Intended RA21 successor,
same goals, more participants
• Learned some things from the RA21 protests!
So a couple of years ago, NISO and the S-T-M Association designed a single sign-on system that they wanted
everybody to use instead of proxy servers. Long story short, it didn’t go anywhere. BIG DISCLAIMER: I myself critiqued
RA21 heavily because of privacy and security concerns, I was not a fan
The successor e
ort, which has more organizations participating, is called SeamlessAccess and has the same goals as
HOW SSO WORKS
1. clicks link
4. logs in
6. sends article
3. “log in , please?”
Here’s a simpli
ed sketch of how single-sign on works in our user story. Dr. Owl clicks on an ordinary unproxied link
to an article. The Content Vendor
gures out that the clicker is from campus, so it asks Campus IT who the clicker is.
Campus IT has Dr. Owl log in, as before, and then sends metadata about Dr. Owl to The Content Vendor, who decides
that Dr. Owl is legit and sends them the article they want
One important detail here: a lot of the communication I’ve drawn for you here actually happens right inside Dr. Owl’s
browser. We’ll learn more about this from John Mark Ockerbloom!
LESS EFFORT FROM THE LIBRARY!
This is less kludgy, less fragile, and the library doesn’t even have to do anything once there’s a content license! Yay,
I SURE HOPE
SOME OF YOU ARE YELLING
“WHY IS PATRON DATA GOING
TO THE CONTENT VENDOR?!”
“WE PROTECT EACH LIBRARY USER’S RIGHT
TO PRIVACY AND CONFIDENTIALITY
WITH RESPECT TO… RESOURCES
CONSULTED, BORROWED, ACQUIRED
—American Library Association
Code of Ethics, Article 3
Quick reminder that we have ethics codes that tell us we’re supposed to protect the privacy and con
what people read, including from our vendors.
IS NOT SUPPOSED TO KNOW
IS NOT SUPPOSED TO KNOW
IS NOT SUPPOSED TO KNOW
The Content Vendor is not supposed to know what Dr. Owl is reading. Law enforcement at any level is de
supposed to know what Dr. Owl is reading.
And data brokers, whose whole business is selling behavior data hither and yon and especially to governments and
law enforcement, are emphatically not supposed to know what Dr. Owl is reading.
BUT LET’S NOT PANIC QUITE YET.
WHAT PATRON METADATA IS THERE?
• PII-style identi
ers: name, email, campus ID, etc.
• Persistent pseudonymous identi
• 24601, No. 6, 007, 8675309, 7 of 9, Eleven…
• You know who (some of) those are, right? Yeah. Given a PPI, The Content
Vendor can likely figure out who Dr. Owl is too. Behavioral tracking and
web bugs are all it takes!
• SeamlessAccess calls PPIs “anonymous.” This is flat wrong.
• Non-PII/PPI metadata: “entitlement metadata”
• “This person is tenured faculty in the Ornithology department with
courtesy appointments in Philosophy, Religion, and Folklore.”
• Given enough of this, and/or combining it with web-bug and behavioral
data, The Content Vendor can absolutely reidentify Dr. Owl.
First let’s ask what patron metadata even is there? Well, there’s standard P-I-I, like name and email address and
ers. Then there’s what I’m calling persistent pseudonymous identi
ers, like a number that’s
consistently used over time to identify a given patron without using their actual name. So like, 2-4-6-0-1, Number
Six, the immortal double-oh seven, and so on
You know who at least some of those numbers are, right? Yeah, so this isn’t really privacy protecting. Given a
persistent pseudonymous identi
er, The Content Vendor can easily use behavioral tracking and web bugs to tie that
er to Dr. Owl. SeamlessAccess calls these identi
ers “anonymous” and I am really angry about that because
words mean things and PPIs are not anonymous. Deidenti
ed, yeah. Anonymous, absolutely not
Anyway, so. Third kind of metadata is often called entitlement metadata, and it’s not about your identity, it’s a list of
campus groups you belong to. So for Dr. Owl that might look like, they’re faculty, they’re tenured, they’re in the
Ornithology department, and they’ve got courtesy appointments elsewhere.
If Dr. Owl is the only tenured faculty member with this speci
c collection of appointments — well, they’ve basically
been immediately reidenti
ed. They’re not anonymous to The Content Vendor at all.
WHAT METADATA GETS SENT
• Under SeamlessAccess, three scenarios:
• Authentication Only: “yes or no, is this person affiliated with
• (the very misnamed) Anonymous Authorization: Non-PII
entitlement metadata. How much metadata? How reidentifiable
would it be? Who knows. SeamlessAccess doesn’t fully specify.
• Pseudonymous Authorization: PPI. Be seeing you, No. 6.
But just because metadata exists doesn’t mean The Content Vendor gets to see it. So what metadata does The
Content Vendor get
Under SeamlessAccess, there are three possible scenarios. One I really like, and that’s the Authentication Only
scenario, where the only thing The Content Vendor
nds out is whether somebody actually is from campus
Then there’s the very misnamed Anonymous Authorization scenario, where what gets communicated is entitlement
metadata. How much? How reidenti
able would it be? Who knows, SeamlessAccess doesn’t say
nally there’s Pseudonymous Authorization, which sends a persistent pseudonymous identi
er. Be seeing you,
HORROR SCENARIO 1:
CAMPUS IT GETS LAZY
Okay, so, set up SSO
to give us all the user metadata
Sure, whatever, here you go.
Either of the second two scenarios could go sour real fast. The
rst horror scenario is where The Content Vendor, in
setting up single sign-on, asks Campus IT to send over all available user metadata, P-I-I and P-P-I and entitlements
and all, and Campus IT just… does that
Where’s The Library in this? Well, remember, The Library doesn’t run or control the campus single-sign-on system.
To keep Campus IT from being lazy, they’ll have to make a nuisance of themselves, and even then Campus IT may
brush them o
SOME CAMPUS IT DIGS SURVEILLANCE!
CISO COREY ROACH:
And I hate to say it, but a lot of Campus IT folks wouldn’t even think twice. They’re not librarians. They’re cool with
surveillance. They do it themselves and help other people on campus do it too!
This slide here is from a campus chief information security o
cer, and was part of an explanation to a
SeamlessAccess-friendly crowd how a system can use behavioral tracking, among other things, to nail so-called
HORROR SCENARIO 2: SWARTZ REDUX
MAKE DR. OWL STOP!
Which leads me to horror scenario two, a repeat of the awful Aaron Swartz saga. The Content Vendor yells “fraudulent
downloading!” and calls the feds to prosecute Dr. Owl under the Computer Fraud and Abuse Act. The Library can’t
handle this internally even if it wants to, because The Content Vendor has either been told or
gured out for itself
exactly who Dr. Owl is.
RIGHT THERE IN THE MANUAL
This scenario is written RIGHT INTO the RA21 goals, which were adopted wholesale by SeamlessAccess. “End-to-end
traceability… for detecting fraud.”
HORROR SCENARIO 3:
PATRON PRIVACY TRASHED
Want some hot hot
And I mentioned my third horror scenario already, but making it formal: The Content Vendor straight-up sells
ed patron-behavior data to The Data Broker, comprehensively trashing patron privacy.
DOES ANYTHING IN
SEAMLESSACCESS STOP THIS?
• By no means.
• Not one single solitary word.
But surely SeamlessAccess says this isn’t okay
Y’all, it does not say that. Not no way, not no how, not nowhere.
PATRONS TO ?
• Some of our content vendors ARE data brokers.
• Others have been caught partnering/sharing data
with the likes of ICE.
• Most of them have a ton of web bugs on their sites
that send data (directly or in-) to data brokers.
• No, they don’t take the web bugs off just because a patron is authenticated.
• I don’t trust most content vendors as far as I could
throw them. LIS research so far agrees with me.
But surely The Content Vendor knows better than to sell patron data to The Data Broker
Y’all, some of our content vendors ARE DATA BROKERS. Others have partnered with the likes of immigration
enforcement. Most are covered in web bugs that send data to data brokers
Hi, content vendors listening in! I need you to know that I DO NOT TRUST YOU with patron data, and I do not think
any of us should, and all the LIS research that’s been done on this explains why not!
• Everything Sarah Lamdan has ever written.
• Me, “Physical-Equivalent Privacy.” Serials Librarian.
OA postprint; check GScholar.
• Forthcoming: McLean and Stregger, "Sounding the
Alarm: scholarly information and global information
companies in 2021.” Partnership.
• Forthcoming: Lisa Hinchliffe, model license language
(Mellon-funded project; thank you, Mellon!)
• And watch SPARC. They’re aware and concerned.
And speaking of research, some stu
to read: everything by Sarah Lamdan, my own Physical-Equivalent Privacy piece,
a forthcoming piece in Partnership that looks amazing, forthcoming model license language from Lisa Hinchli
keep an eye on SPARC, okay?
This presentation is copyright 2021 by Dorothea Salo.
It is available under a Creative Commons
Attribution 4.0 International license.
All icons from OpenClipArt.
And that’s what I got, so I’ll turn it over to our next presenter!