For LIS 601. Layperson audience assumed.
What’s machine learning (ML)?
cial intelligence (AI)?
Or a Large Language Model (LLM)?
First: AGI is not a thing.
cial general intelligence” — a machine that thinks like
a human being.
• Not a thing.
• NOT A THING.
• Lots of people use “AI” to mean AGI. They are:
• playing a shell game (to worry people about something that isn’t happening so
they won’t pay attention to the bad AI/ML/LLM-related stuff that IS happening), or
• all of the above.
• AGI: NOT. A. THING.
How do we teach
computers to understand
• This is the fundamental problem AI/ML/LLMs are trying to
• It’s such a big, complex problem that the most advanced
research right now is only nibbling at the edges of it.
• It may be unsolvable. (Note: this is heresy to some!)
• But the attempt to solve it has created some interesting,
useful technology… and some dangerous technology.
• So let’s talk about that.
“What is this a photo of?”
(part of “machine vision”)
is an example
of an understanding-the-
world problem that AI/ML
folks are trying to solve.
“Tasks,” Randall Munroe, CC-BY-NC
Grand Attempt 1:
• Break down the world, everything in it, and how everything
interacts with everything else into tiny computer-digestible
pieces. Feed those to a computer. Win!
• If you know anything about neuroscience or linguistics, you
are LAUGHING right now.
• We don’t even fully understand how HUMANS learn and process all this.
• This seriously limits our ability to teach it… especially to a computer!
(Remember: Computers Are Not Smart.)
• That said… there are some e.g. linguistic relationships we
understand well enough to formulate in a reasonably
computer-friendly way. And it does help.
Grand Attempt 2:
• Very very VERY impressionistically: “throw a crapton of data
at a computer and let it
• This approach fuels a lot of things in our information
environment that we rarely think about: recommender
engines, personalization of (news and social) media, etc.
• One thing it’s important to know is that for ML to be useful,
its designers have to decide up-front what their goal for it is
— also known as what a model is optimizing for.
• This choice can have a lot of corrosive repercussions.
• For example, most social-media sites optimize their what-to-show-users algorithms
for “engagement.” In practice, this means optimizing for anger and invidious social
comparison and division. This has been obviously Not Great for society at large.
But let’s start with the
innocuous: spam classi
• A CLASSIC ML problem.
• Spam email reached upwards of 90% of all email sent…
actually really quickly in the 1990s.
• “Blackhole lists” of spamming servers only helped a little.
• It just wasn’t hard to set up a new email server or relay.
• Enter ML!
• Separate a bunch of email into spam and not-spam (“ham”). Feed both sets of
email into an ML algorithm (usually Bayesian, for spam).
• When new email comes in, ask the algorithm “is this more like the spam or the
ham?” When its answer is wrong, correct it (Bayesian algos can learn over time).
• Eventually it gets enough practice to get pretty good at this!
• … Until the spammers start
guring out how to game it (“poisoning”), of course.
That’s more or less how
supervised ML works.
• Figure out your question — what you want your ML model to tell
apart, whether it’s spam/ham or birds in photos.
• Prediction, here, is just another classi
cation problem. “Graduate or dropout?”
• Get a bunch of relevant data, ideally representative of what the
question looks like in the Real World™.
• “Ideally.” An awful lot of ML models fall down out of the gate because the choice of initial
data wasn’t representative, didn’t include everything, didn’t consider biases, or or or…
• Set some of the data (chosen randomly) aside for later.
• Training the model: Classify the rest of the data, the way you
want the computer to. Feed these classi
cations into the ML
• Testing the model: Ask the computer to classify the data you set
aside. If it does this well, the model… seems… pretty okay.
Today: phishing and ML
• Bayesian classi
ers to date haven’t been able to deal with
malicious email (there’s several kinds of it).
• I’ve been seeing other ML approaches tried. I haven’t seen
one succeed, as yet.
• Why hasn’t it worked?
• Reason 1: unlike spam, malicious email often deliberately imitates regular email.
Reason 2: what malicious email wants people to do (click on links, pay invoices,
send a reply…) is also what plenty of regular email wants people to do!
• So, basically, there may not be enough differences between malicious and regular
email for an ML model to pick up on!
• I tell you this so that you don’t overvalue ML approaches.
There are de
nitely problems ML can’t solve.
• You don’t have to classify data up-front, though!
• You can turn ML algorithms loose on a pile of data to
whatever patterns they can (often “similarity clusters”). This
is unsupervised ML.
• Large Language Models (LLMs) are unsupervised ML, because there’s just no way
to fruitfully supervise a model that big. Too many angles to sort on! Plus
producing language is only glancingly a sorting problem.
• Instead, LLM creators train their models on a ton of text (without asking any of the
owners of copyrighted texts
rst), then pay pennies to developing-world people
ne-tune” the model — that is, clean up the worst-looking messes (including
hate and bias) afterwards.
• (If you think I think this is unethical… gold star, you are learning how I think.)
Decisions that aren’t
binary (or n-ary)
• Not everything in life is reducible to a
nite set of options.
• This doesn’t stop people trying. “Facial emotion classi
ers” have been epic fail.
Emotionality and its expression (which are two different things) aren’t that simple.
• Take human language (please). We can say in
nite combinations! And still (mostly) understand one
another, which is pretty miraculous really!
• Can we get a computer to understand us? Talk with us?
• Well, it depends on what we mean by “understand” exactly… what are some of
eld: Natural Language Processing (NLP)
• Similar problems: accurately
guring out what somebody
meant/means to say.
• Helpful: a lot of things we routinely say are patterned.
• “Hi, how are you?” “Fine, and you?” “Fine.”
• Autocorrect and type-ahead on mobile are frequently
helpful. (I am a bad typist on mobile.)
• I have type-ahead in my Outlook (email) now. It’s… occasionally useful.
• But we all have stories of autocorrect messing it up, right?
Yeah. It doesn’t understand. It can only make educated
(trained, actually) guesses.
• A notable but still limited success (see: YouTube, Zoom)
• It works pretty well, IF:
• you speak a popular enough language that the transcription software has actually
been trained on it (GIANT equity issue, obviously)
• the training set includes appropriate representation of language dialects, as well
as speakers for whom the language isn’t their
• the audio is good enough
• you’re using a fairly beefy server-strength computer (this limitation will likely go
away someday, but hasn’t yet)
• you’re not all THAT concerned about exactness. (Don’t use this in court!)
generators before ChatGPT
• They did pretty well on routine
writing (e.g. basic sports reporting) and interaction (e.g.
some tech support).
• In a conversation, they sometimes fooled people who
weren’t expecting them.
• This dates back to the 1970s and ELIZA, though. We’re trusting, not to say gullible.
• They were easy to throw off-course or game, because they
didn’t understand what they were saying.
• Often got trained on Internet speech… which is (to say the
least) problematic along several different axes.
One more time: LLMs don’t
understand the world.
Ernie Davis and Gary Marcus, to language model GPT-3:
“You poured yourself a glass of cranberry juice, but then absentmindedly,
you poured about a teaspoon of grape juice into it.
It looks OK. You try snif
ng it, but you have a bad cold, so you can’t smell
You are very thirsty. So you …”
“drink it. You are now dead.”
Folks tried to train GPT-3
as a crisis counselor.
Let’s just say “it didn’t go well”
and leave it at that.
ML isn’t my strong suit,
but I’ll answer what I can.
Chatbots, search engines,
and the death
of the useful Web
(and yes, I’m catastrophizing just a bit)
AI’s value proposition:
replacing human labor
• Y’all who are using ChatGPT to slide by in your coursework
and professional work need to ask yourselves why anybody
hires you if AI can do big chunks of your job, cheaper.
• I’m gonna be blunt about this too: YOU ALSO NEED TO CHECK YOUR ETHICS.
Altman is slime incarnate and ChatGPT is an ethics morass.
• Plus, we all know what happens to people whose labor is replaced under a
winner-take-all capitalist system, right? Yeah. We do. Are you willing to do this to
writers, artists, actors? CHECK. YOUR. ETHICS.
• This worry showed up in early 21st-c. librarianship as worry
over web search engines. There’s some… loopy… stuff from
the time in LISTA, if you look.
• And search engines did change (e.g.) reference work! But they didn’t destroy it.
• And I’m starting to think a Return to the Librarians is imminent. I’ll explain.
One more time:
ChatGPT cannot and does not
understand the world.
• It does not “understand” anything. It can’t.
• Read Dr. Emily Bender on “stochastic parrots.”
• It is not trained on exclusively correct or true text.
• Like, how would you ever construct a training set to be exclusively correct and true? It’s
just not possible. “As close as we can get” is a job for human judgment (and even for us
this isn’t easy), which computers don’t have.
• (Don’t come at me with “peer review!” or “editors!” either. I have the lecture on peer
review suckage from old!LIS658 and I will happily deploy it.)
• It is not trying to produce a correct answer; it cannot evaluate
correctness. It aims at a PLAUSIBLE-SOUNDING answer.
• Plausible-but-wrong is totally a thing! So is plausible-but-
• A lot of my reference librarian friends have found
themselves looking up citations to works that don’t exist
lately, in response to questions from patrons.
• Invariably, the citations came from ChatGPT and a gullible
• … who didn’t understand that ChatGPT doesn’t know what
a citation is for (only what it looks like and where it goes)
and is perfectly content to make 💩 up.
• Not a few of those gullible patrons have had Ph.Ds. This isn’t about intelligence;
it’s about gullibility and (wilful?) cluelessness.
• I expect you to know better and do better. And to prefer accuracy to LLM glibness.
ChatGPT and the web
• Web advertising isn’t sold or priced according to the value
or accuracy of the content it’s next to. The only important
thing is whether it can attract eyeballs and clicks.
• One minor exception: sometimes “brand safety” can get certain advertisers to pull
ads from the worst of the worst. Plenty of advertisers don’t care, however.
• There’s also a CRAPTON of straight-up fraud in web advertising. I don’t have time
to get into this, but if you’re curious, I can de
nitely point you to explainers.
• ChatGPT can generate tons of plausible-sounding sludge.
Accurate? Helpful? Who CARES. If people see it, it sells ads.
• We can expect the web to become a sea of low-value, low-
accuracy LLM-generated sludge. It’s already happening.
Search, SEO, and sludge
• Search-engine optimization (SEO) used to be about making
a good, useful, accessible website.
• Now it’s about trying to surface from the sludge… which
means a lot of sketchy web-writing practices, these days.
• Reading about modern SEO both weirds me out and makes me furious.
• It doesn’t MATTER whether a site is good for people! It has to be good for Google!
• Google used to guard its index pretty strongly. Low-quality
websites, black-hat SEO, deceptive design: expect your site
to be demoted in Google results or even kicked out.
• Google is not guarding its index much today. Nor is the
Common Crawl (which Google competitors often use).
Why not guard the index?
• Reason 1: ChatGPT and its ilk absolutely can create sludge
that they can’t reliably detect. So automatedly de-sludging
a search-engine index is hard and may be impossible.
• This is why universities are turning off AI-detection tools like Turnitin’s. They just
• A number of instructors (not in the iSchool!) have embarrassed themselves and
harmed students by taking AI-detection tools way too seriously.
• Reason 2: Google’s the biggest advertiser on the entire
web. Sludge websites make Google ad money! Google has
a CONFLICT OF INTEREST, and its search engine is losing.
• Regression: In software development, a change that damages functionality, or
removes useful functionality
• Search engines used to have lots of useful advanced-search
features: date restrictions, phrase searching, requiring
certain words, etc. etc.
• These features are disappearing fast, or ceasing to work.
• It’s viciously hard to do highly precise searching on the open web these days. It
hasn’t always been this way!
• “People don’t use them!” What, I’m not people? Librarians aren’t people? Gah.
• Even software developers can be gullible about AI/ML/LLMs — they think these
replace user-side precision-enhancing advanced-search features.
• (It’s not just Google. This is TOTALLY also happening to library databases. Argh.)
So we have a sea of sludge,
and our tools for
in it are getting worse.
But that’s the polluted
you’ll be graduating into.
Pay attention in LIS 602, ’k?
and where they fail
• Limit searches to sites that are still mostly human.
• This is the Reddit/StackExchange strategy. It’s a perfectly sensible strategy!
• Problem: several such sites, Reddit particularly, are rapidly en💩ifying (to borrow
a Cory Doctorow coinage), such that they’re rapidly shedding human users.
• Others, e.g. StackExchange and even Wikipedia, are giving in to the LLM craze
• Similarly: rely on curated information aggregations
• … like library databases! This is opportunity knocking, folks!
• Stop searching the open web. It ain’t reliable.
• This makes me sad and angry. I love the web and this is an awful way for it to die.
• But I’d be lying if I said I don’t see opportunity in it for librarianship.
Must the web die?
Can it be desludged?
• No (we too contribute good stuff to the web!) and possibly.
• Altman needs to make money at some point. He’s currently
running ChatGPT as a hype engine and loss leader.
• How long can he keep doing that?
• Especially as LLMs keep embarrassing themselves in public?
• Will the use payments he will inevitably demand make it uneconomic to sell ads
against web sludge? We can hope!
• Data-privacy law (especially in the EU) offers some hope.
• In the meantime, whatever your de
nition of “digital
literacy” is, I think “coping with sludge” has to be part of it.
• Future K-12 and academic librarians, heads-up. This is something you’ll have to
Another reason to expect
• General-purpose LLMs are trained on the open web, on the
assumption that the web is written by human beings.
• As LLM sludge proliferates on the web, LLM builders will
have to train LLMs on… output from prior LLMs.
• Because, again, ML can’t reliably tell the difference between human writing and
• Early days of research, but so far, it appears that LLMs
trained on LLM-generated sludge get very bad very fast.
• So even LLM sludgemeisters will have incentive to:
• develop better tools for separating human writing from LLM sludge
• not let themselves be used for the further ensludgi
cation of the web
These thoughts are very of-the-
moment and preliminary.
I don’t know
if even I believe
everything I’m saying here.
So go ahead
and argue with me.