Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Single sign-on, the library, and patron privacy

Dorothea Salo
November 01, 2021

Single sign-on, the library, and patron privacy

Presented for the Digital Libraries Forum conference panel, "Protecting readers' right to privacy in single-sign-on access," 1 November 2021

Dorothea Salo

November 01, 2021
Tweet

More Decks by Dorothea Salo

Other Decks in Technology

Transcript

  1. SINGLE SIGN-ON,


    THE LIBRARY,


    AND PATRON PRIVACY
    Dorothea Salo


    Information School


    University of Wisconsin-Madison
    Hi, y’all, I’m Dorothea Salo and some of you know me already, so let’s just get right to talking about single sign-on,
    the library, and patron privacy.

    View full-size slide

  2. OUR MAIN CHARACTERS
    The Library
    Dr. Owl, a Campus Researcher
    Campus IT
    The Content Vendor
    Here’s our cast of main characters: The Library, The Content Vendor, Campus IT, and campus researcher Dr. Owl,
    whose pronouns are they/them.

    View full-size slide

  3. I need an article!
    And here’s our user story: Dr. Owl in their o
    ff
    i
    ce needs a paywalled journal article that’s available through The Library
    .

    The Library can make that happen in a couple-three ways.

    View full-size slide

  4. IP AUTHENTICATION
    *clicks*
    Yay! Article!
    *notices IP address
    belongs to campus*


    *sends article*
    One is IP authentication, since Dr. Owl is in their o
    ff
    i
    ce. Dr. Owl clicks on the article link. The Content Vendor notices
    that Dr. Owl’s computer’s IP address belongs to campus, so The Content Vendor sends the article back to Dr. Owl’s
    browser.

    View full-size slide

  5. (BEHIND THE SCENES)
    Here are campus’s


    IP address blocks!
    Okay, noted.


    We’ll let these
    through.
    Behind the scenes, of course, there’s some work that The Library and The Content Vendor have to do, maintaining
    lists of IP addresses that belong to campus. And I’m kind of oversimplifying here, but just so you have the basic idea.

    View full-size slide

  6. MAYBE PRIVATE, MAYBE… NOT.
    • If Dr. Owl’s of
    fi
    ce computer has a “dynamic”
    (changing) IP address, fairly private.


    • If Dr. Owl’s of
    fi
    ce computer has a “static” (persistent)
    IP address… The Content Vendor can build up a
    reading-tracking dossier on Dr. Owl.


    • If Dr. Owl’s computer’s network name identifies Dr. Owl (I have seen this!), or if
    Dr. Owl’s reading habits are fairly uncommon, or if The Content Vendor uses
    web bugs/trackers… The Content Vendor can reidentify Dr. Owl and
    associate their reading-behavior dossier with them.


    • The Content Vendor may sell that reidentified dossier to…
    The Data Broker!
    This kind of authentication may protect Dr. Owl’s privacy… or not. If Dr. Owl’s o
    ff
    i
    ce computer gets a dynamic IP
    address, that protects Dr. Owl to some extent. If the IP address is static, though, it’s trivial for The Content Vendor to
    track Dr. Owl’s reading just using that IP address as an identi
    fi
    er
    .

    If Dr. Owl has unusual reading habits, they may be reidenti
    fi
    able from those. More likely, though, The Content Vendor
    just reidenti
    fi
    es Dr. Owl with standard web bugs. And then The Content Vendor has a new revenue stream: selling
    reading-behavior dossiers on library patrons to The Data Broker!

    View full-size slide

  7. I need an article!
    I’m at home, though!
    How can I get it?
    But there’s another user story we have to account for. Dr. Owl is working from home, and suddenly they need an
    article. Since they don’t live on campus, The Content Vendor can’t recognize their computer’s IP address as belonging
    to campus. So how does The Library get Dr. Owl that article?

    View full-size slide

  8. YE OLDE PROXY SERVER
    1. clicks


    proxied link
    2. “legit patron?”
    4. logs in
    5. “yes, Dr. Owl, legit”
    6. “send article.”
    7. sees proxy server,


    sends article
    8. sends


    article 3. “log in, please?”
    What The Library does is stand up what’s called a proxy server, and here’s a diagram of more or less how that works
    .

    Dr. Owl clicks a proxied link to the article. The Library’s proxy server doesn’t know who clicked, so it asks Campus IT
    whether the clicker is a legitimate patron. Campus IT asks Dr. Owl to log in, which they do, so Campus IT reports
    back to The Library’s proxy server that it’s Dr. Owl and they’re legit.


    The Library’s proxy server then requests the article from The Content Vendor, who sees that it’s the proxy server and
    sends it over. And
    fi
    nally, The Library’s proxy server sends the article to Dr. Owl.

    View full-size slide

  9. LET’S FACE IT:


    THIS IS AN UGLY, FRAGILE KLUDGE.


    NOBODY LOVES THIS.

    View full-size slide

  10. BUT IT HAS PRIVACY ADVANTAGES.
    • Dr. Owl’s identity never leaves campus.


    • If Dr. Owl starts mass-downloading for their Important Text-Mining Project,
    justifiably annoying The Content Vendor, The Library can handle it
    campus-internally.


    • The Content Vendor doesn’t see Dr. Owl’s IP
    address, only the proxy server’s.


    • This foils The Content Vendor’s web bugs/trackers.


    • Dr. Owl’s reading gets mixed in with the rest of
    campus.


    • This foils behavioral tracking by The Content Vendor.
    But there are privacy advantages to it. The big one is that Dr. Owl’s identity never leaves campus, so The Content
    Vendor doesn’t automatically know it. If Dr. Owl starts downloading a whole database to text-mine it, The Content
    Vendor tells The Library and The Library quietly tells Dr. Owl to knock it o
    ff
    . No lawsuits
    !

    Second, the content vendor never sees Dr. Owl’s IP address, so web bugs and trackers can’t nab it. And third, with
    lots of people on campus going through The Library’s proxy server to get articles, it’s way harder to track Dr. Owl’s
    reading because it’s all mixed in with everybody else’s reading.

    View full-size slide

  11. SINGLE SIGN-ON (SSO) SYSTEMS
    • For authentication (“who are you?”) and
    authorization (“given who you are, what are you
    allowed to see/do?”)


    • There are actually several ways to set up a SSO-friendly authentication
    system, just to be clear.


    • Common, though far from universal, in higher ed
    • Rely on a communication language called SAML,
    and (often) a software setup called Shibboleth


    • Often implemented through “federations” such
    as InCommon and OpenAthens.
    With that for background, let’s talk about single sign-on systems. These both authenticate you — make you prove
    who you are — and authorize you to actually see and do things, such as download articles to read. They’re common,
    though far from universal, in higher ed
    .

    Some jargon, just so you’ve heard it: these systems rely on an XML-based communication language called SAML and
    often a software rig called Shibboleth. They may be implemented through federations such as InCommon or
    OpenAthens. If you hear any of these weird words, JUST THINK SINGLE SIGN-ON; you mostly won’t need the details.

    View full-size slide

  12. RA21, SEAMLESSACCESS
    • RA21: NISO and STM Association attempt to
    design a SSO system to replace proxy servers.


    • NISO released it over signi
    fi
    cant protest, mostly
    about privacy and security.


    • SeamlessAccess: Intended RA21 successor,
    same goals, more participants


    • Learned some things from the RA21 protests!
    So a couple of years ago, NISO and the S-T-M Association designed a single sign-on system that they wanted
    everybody to use instead of proxy servers. Long story short, it didn’t go anywhere. BIG DISCLAIMER: I myself critiqued
    RA21 heavily because of privacy and security concerns, I was not a fan
    .

    The successor e
    ff
    ort, which has more organizations participating, is called SeamlessAccess and has the same goals as
    RA21.

    View full-size slide

  13. HOW SSO WORKS
    1. clicks link
    2. “who’s
    this?”
    4. logs in
    6. sends article
    3. “log in , please?”
    5. sends


    Dr. Owl’s
    metadata
    Here’s a simpli
    fi
    ed sketch of how single-sign on works in our user story. Dr. Owl clicks on an ordinary unproxied link
    to an article. The Content Vendor
    fi
    gures out that the clicker is from campus, so it asks Campus IT who the clicker is.
    Campus IT has Dr. Owl log in, as before, and then sends metadata about Dr. Owl to The Content Vendor, who decides
    that Dr. Owl is legit and sends them the article they want
    .

    One important detail here: a lot of the communication I’ve drawn for you here actually happens right inside Dr. Owl’s
    browser. We’ll learn more about this from John Mark Ockerbloom!

    View full-size slide

  14. LESS KLUDGY!


    LESS FRAGILE!


    LESS EFFORT FROM THE LIBRARY!


    YAY, RIGHT?


    … right?
    This is less kludgy, less fragile, and the library doesn’t even have to do anything once there’s a content license! Yay,
    right
    ?

    … right?

    View full-size slide

  15. I SURE HOPE


    SOME OF YOU ARE YELLING


    “WHY IS PATRON DATA GOING
    TO THE CONTENT VENDOR?!”
    RIGHT NOW.
    (read slide)

    View full-size slide

  16. Quick reminder:

    “WE PROTECT EACH LIBRARY USER’S RIGHT
    TO PRIVACY AND CONFIDENTIALITY


    WITH RESPECT TO… RESOURCES


    CONSULTED, BORROWED, ACQUIRED
    OR TRANSMITTED.”


    —American Library Association

    Code of Ethics, Article 3
    Quick reminder that we have ethics codes that tell us we’re supposed to protect the privacy and con
    fi
    dentiality of
    what people read, including from our vendors.

    View full-size slide

  17. IS NOT SUPPOSED TO KNOW


    WHAT READS.


    IS NOT SUPPOSED TO KNOW


    WHAT READS.


    IS NOT SUPPOSED TO KNOW


    WHAT READS.
    The Content Vendor is not supposed to know what Dr. Owl is reading. Law enforcement at any level is de
    fi
    nitely not
    supposed to know what Dr. Owl is reading.


    And data brokers, whose whole business is selling behavior data hither and yon and especially to governments and
    law enforcement, are emphatically not supposed to know what Dr. Owl is reading.

    View full-size slide

  18. BUT LET’S NOT PANIC QUITE YET.

    View full-size slide

  19. WHAT PATRON METADATA IS THERE?
    • PII-style identi
    fi
    ers: name, email, campus ID, etc.


    • Persistent pseudonymous identi
    fi
    ers (PPIs)


    • 24601, No. 6, 007, 8675309, 7 of 9, Eleven…


    • You know who (some of) those are, right? Yeah. Given a PPI, The Content
    Vendor can likely figure out who Dr. Owl is too. Behavioral tracking and
    web bugs are all it takes!


    • SeamlessAccess calls PPIs “anonymous.” This is flat wrong.


    • Non-PII/PPI metadata: “entitlement metadata”


    • “This person is tenured faculty in the Ornithology department with
    courtesy appointments in Philosophy, Religion, and Folklore.”


    • Given enough of this, and/or combining it with web-bug and behavioral
    data, The Content Vendor can absolutely reidentify Dr. Owl.
    First let’s ask what patron metadata even is there? Well, there’s standard P-I-I, like name and email address and
    campus identi
    fi
    ers. Then there’s what I’m calling persistent pseudonymous identi
    fi
    ers, like a number that’s
    consistently used over time to identify a given patron without using their actual name. So like, 2-4-6-0-1, Number
    Six, the immortal double-oh seven, and so on
    .

    You know who at least some of those numbers are, right? Yeah, so this isn’t really privacy protecting. Given a
    persistent pseudonymous identi
    fi
    er, The Content Vendor can easily use behavioral tracking and web bugs to tie that
    identi
    fi
    er to Dr. Owl. SeamlessAccess calls these identi
    fi
    ers “anonymous” and I am really angry about that because
    words mean things and PPIs are not anonymous. Deidenti
    fi
    ed, yeah. Anonymous, absolutely not
    .

    Anyway, so. Third kind of metadata is often called entitlement metadata, and it’s not about your identity, it’s a list of
    campus groups you belong to. So for Dr. Owl that might look like, they’re faculty, they’re tenured, they’re in the
    Ornithology department, and they’ve got courtesy appointments elsewhere.


    If Dr. Owl is the only tenured faculty member with this speci
    fi
    c collection of appointments — well, they’ve basically
    been immediately reidenti
    fi
    ed. They’re not anonymous to The Content Vendor at all.

    View full-size slide

  20. WHAT METADATA GETS SENT


    TO ?
    • Under SeamlessAccess, three scenarios:


    • Authentication Only: “yes or no, is this person affiliated with
    campus?”


    • (the very misnamed) Anonymous Authorization: Non-PII
    entitlement metadata. How much metadata? How reidentifiable
    would it be? Who knows. SeamlessAccess doesn’t fully specify.


    • Pseudonymous Authorization: PPI. Be seeing you, No. 6.
    But just because metadata exists doesn’t mean The Content Vendor gets to see it. So what metadata does The
    Content Vendor get
    ?

    Under SeamlessAccess, there are three possible scenarios. One I really like, and that’s the Authentication Only
    scenario, where the only thing The Content Vendor
    fi
    nds out is whether somebody actually is from campus
    .

    Then there’s the very misnamed Anonymous Authorization scenario, where what gets communicated is entitlement
    metadata. How much? How reidenti
    fi
    able would it be? Who knows, SeamlessAccess doesn’t say
    .

    And
    fi
    nally there’s Pseudonymous Authorization, which sends a persistent pseudonymous identi
    fi
    er. Be seeing you,
    Number Six.

    View full-size slide

  21. HORROR SCENARIO 1:


    CAMPUS IT GETS LAZY
    Okay, so, set up SSO


    to give us all the user metadata


    you got.
    *yawn*


    Sure, whatever, here you go.
    Either of the second two scenarios could go sour real fast. The
    fi
    rst horror scenario is where The Content Vendor, in
    setting up single sign-on, asks Campus IT to send over all available user metadata, P-I-I and P-P-I and entitlements
    and all, and Campus IT just… does that
    .

    Where’s The Library in this? Well, remember, The Library doesn’t run or control the campus single-sign-on system.
    To keep Campus IT from being lazy, they’ll have to make a nuisance of themselves, and even then Campus IT may
    brush them o
    ff
    .

    View full-size slide

  22. SOME CAMPUS IT DIGS SURVEILLANCE!
    CISO COREY ROACH:
    And I hate to say it, but a lot of Campus IT folks wouldn’t even think twice. They’re not librarians. They’re cool with
    surveillance. They do it themselves and help other people on campus do it too!


    This slide here is from a campus chief information security o
    ff
    i
    cer, and was part of an explanation to a
    SeamlessAccess-friendly crowd how a system can use behavioral tracking, among other things, to nail so-called
    fraudulent downloaders.

    View full-size slide

  23. HORROR SCENARIO 2: SWARTZ REDUX
    Yikes! Help!
    FRAUD!


    MAKE DR. OWL STOP!
    CFAA!


    Federal prosecution!
    Which leads me to horror scenario two, a repeat of the awful Aaron Swartz saga. The Content Vendor yells “fraudulent
    downloading!” and calls the feds to prosecute Dr. Owl under the Computer Fraud and Abuse Act. The Library can’t
    handle this internally even if it wants to, because The Content Vendor has either been told or
    fi
    gured out for itself
    exactly who Dr. Owl is.

    View full-size slide

  24. RIGHT THERE IN THE MANUAL
    This scenario is written RIGHT INTO the RA21 goals, which were adopted wholesale by SeamlessAccess. “End-to-end
    traceability… for detecting fraud.”

    View full-size slide

  25. HORROR SCENARIO 3:


    PATRON PRIVACY TRASHED
    Psst!


    Want some hot hot


    fully-identi
    fi
    ed


    patron-behavior data?
    Sure.
    And I mentioned my third horror scenario already, but making it formal: The Content Vendor straight-up sells
    identi
    fi
    ed patron-behavior data to The Data Broker, comprehensively trashing patron privacy.

    View full-size slide

  26. DOES ANYTHING IN
    SEAMLESSACCESS STOP THIS?
    • Nope.


    • Nah.


    • By no means.


    • Uh-uh.


    • Not one single solitary word.
    But surely SeamlessAccess says this isn’t okay
    !

    Y’all, it does not say that. Not no way, not no how, not nowhere.

    View full-size slide

  27. WOULD SELL


    PATRONS TO ?
    • Some of our content vendors ARE data brokers.
    • Others have been caught partnering/sharing data
    with the likes of ICE.


    • Most of them have a ton of web bugs on their sites
    that send data (directly or in-) to data brokers.


    • No, they don’t take the web bugs off just because a patron is authenticated.
    • I don’t trust most content vendors as far as I could
    throw them. LIS research so far agrees with me.
    But surely The Content Vendor knows better than to sell patron data to The Data Broker
    !

    Y’all, some of our content vendors ARE DATA BROKERS. Others have partnered with the likes of immigration
    enforcement. Most are covered in web bugs that send data to data brokers
    .

    Hi, content vendors listening in! I need you to know that I DO NOT TRUST YOU with patron data, and I do not think
    any of us should, and all the LIS research that’s been done on this explains why not!

    View full-size slide

  28. TO READ!
    • Everything Sarah Lamdan has ever written.


    • Me, “Physical-Equivalent Privacy.” Serials Librarian.
    OA postprint; check GScholar.


    • Forthcoming: McLean and Stregger, "Sounding the
    Alarm: scholarly information and global information
    companies in 2021.” Partnership.


    • Forthcoming: Lisa Hinchliffe, model license language
    (Mellon-funded project; thank you, Mellon!)


    • And watch SPARC. They’re aware and concerned.
    And speaking of research, some stu
    ff
    to read: everything by Sarah Lamdan, my own Physical-Equivalent Privacy piece,
    a forthcoming piece in Partnership that looks amazing, forthcoming model license language from Lisa Hinchli
    ff
    e, and
    keep an eye on SPARC, okay?

    View full-size slide

  29. THANKS!
    This presentation is copyright 2021 by Dorothea Salo.


    It is available under a Creative Commons


    Attribution 4.0 International license.


    All icons from OpenClipArt.


    Support OpenClipArt!
    And that’s what I got, so I’ll turn it over to our next presenter!

    View full-size slide