Taking Arms against a Sea of Chatbot Sludge

TAKING ARMS AGAINST A SEA OF CHATBOT SLUDGE Dorothea Salo
Universit y of Wisconsin-Madison iSchool [email protected]

SO, HOW’S WORKING OUT FOR YOU LATELY?

RESULTS MAYBE NOT SO GREAT?

TAKING YOU TO WEBSITES FULL OF SOUND well, ht m
l AND FURY SIGNIFYING NOTHING?

WITH LOTS AND LOTS OF ADS?

I really do recommend browser adblockers. uBlock Origin and Privacy
Badger are good. (Firefox Focus on mobile) It’s not just that ads are annoying. Increasingly, they’re a securit y risk.

NO JOY FROM EITHER?

IT’S NOT YOU. IT’S REALLY HAPPENING. AND THE CULPRITS ARE
WEB ADVERTISING AND CHATBOT SLUDGE.

ABOUT WEB ADVERTISING •A deeply selfish, harmful, and corrupt industry
riddled with fraud •The source of (and rationalization for) incalculable amounts of online, even offline, surveillance •Also the source of much of the money that keeps non ‑ commerce websites available at all • Hor r ible dilemma, isn’t it?

WEB ADVERTISING PLACEMENT •Not (or only minimally) determined by the
website on which the ads are placed. • That website may be able to set some standards (e.g. no por n , no gambling, no politics) or ref u se specific ads, but that’s it. •Instead, “real-time bidding” • You visit website. • Website fig u res out who you are and what it already knows about you, and • sends that (plus new info about you) to an ad net w ork asking “what ad should I show this chump?” •So ad placement is indifferent to the quality of the site they’re supporting. “Can this ad get you to click?” is the only question.

IF AD NETWORKS THINK YOU’LL CLICK ON AN AD ON
A 💩 SITE, THE AD GOES ON THE 💩 SITE. ESPECIALLY IF IT’S CHEAPER.

WONDERING WHY THERE’S SO MUCH LESS JOURNALISM ONLINE THESE DAYS?
THAT’S WHY. AD NETWORKS DON’T CARE ABOUT THE SOCIETAL IMPORTANCE OF JOURNALISTS AND JOURNALISM.

HOW DO YOU BUILD A 💩 WEBSITE TO SELL AD
PLACEMENTS AGAINST REAL FAST AND REAL CHEAP? (YOU READ THE TITLE OF THIS TALK. YOU KNOW THE ANSWER.)

HOW DO YOU BUILD A 💩 WEBSITE TO SELL AD
PLACEMENTS AGAINST REAL FAST AND REAL CHEAP? YOU USE A CHATBOT TO GENERATE SLUDGE. OF COURSE YOU DO.

REDACTED REDACTED

STOPPING HERE FOR QUESTIONS. WHAT DIDN’T I EXPLAIN CLEARLY? OR
POSSIBLY AT ALL?

GENERATIVE AI •Subsets: large language model based chatbots, programming-code generators,
image generators, sound and voice generators, video generators •Amass a massive amount of data. More than that. No, more than that! EVEN MORE THAN THAT. • How? Mostly by hauling it in off the open/accessible web. Copy r ight, what’s that? Creator wishes, eh, who cares. Sur v eillance images/video? Sure, why not. • Private or confidential images (such as medical images)? Hey, if it’s on the open web nobody can object, right? CSAM? Uh-oh, bet t er at least clean THAT up. •Feed all this data into a black-box creator. Use the resulting black box to make stuff up in response to prompts. • Of course it’s not quite this simple, but this gives you the flavor.

A CHATBOT IS A MAKE-STUFF-UP MACHINE. THAT IS WHAT IT
IS. THAT IS WHAT IT DOES.

THIS IS ONE REASON NEVER, EVER, EVER TO USE A
CHATBOT AS A SEARCH ENGINE. IT IS NOT A SEARCH ENGINE. IT’S A MAKE-STUFF-UP MACHINE.

A CHATBOT IS ALSO A… Laurens, “Honey Badger.” CC-BY-ND. https://www.flickr.com/
photos/47456200@N04/4362649552

CHATBOTS ARE HONEY BADGERS! •They just don’t care. (Can’t. Can’t
care.) • (The honey-badger thing is an old meme. You can… probably… look it up.) •They produce responses to search-like prompts the same way they do to any other prompt. • They send the prompt tex t to their tokenizer, and pop the resulting tokens into statistics engines that compute a statistically-likely answer, one token at a time. •“Statistically likely” is not the same as “accurate,” “factual,” or even “reasonable.” It’s definitely NOT THE SAME AS “SAFE.” •That’s how we get pizza rock glue, recipes that ask for poisonous ingredients, and so on. Chatbot don’t care! • More seriously, chatbot cannot be held accountable. Nor can chatbot companies.

WEB CREATORS DON’T LOVE AI. SO THEY’RE BLOCKING IT.

THAT CAN MEAN BLOCKING SEARCH ENGINE CRAWLERS TOO. REMEMBER HOW
GOOGLE’S GETTING INTO AI? YEAH.

SO THAT REDUCES HOW MUCH GOOD STUFF GETS INTO SEARCH
ENGINES.

SO THAT REDUCES HOW MUCH GOOD STUFF GETS INTO CHATBOT
MODELS.

RESULT: THE WEB IS DROWNING IN SLUDGE

SO SEARCH ENGINES ARE DROWNING TOO. IT’S HURTING THE NOT-MADE-UP
WEB.

https://catandgirl.com/4000-of-my-closest-friends/

PEOPLE USING GENERATIVE AI TO LIE •Scams and grifting (beyond
cheating ad networks) •Deepfakes, including sexually-themed ones •Genuinely fake “fake news” •Academic and professional cheating • Including research and scholarly publication f r aud! • At base, the lie is “I thought about and created this.” • No, you didn’t, and yes, it mat t ers. •Yes, the chatbot companies know about this and don’t care. •But it’s also up to us not to use these things just to be liars and grifters and cheats, okay?

SEARCH, SEO, AND SLUDGE •Search-engine optimization (SEO) used to be
about making a good, useful, accessible website. (I used to teach it proudly!) •Now it’s about trying to surface from the sludge… which means a lot of sketchy web-writing practices, these days. • Reading about moder n SEO both weirds me out and makes me f u rious. • It doesn’t MATTER whether a site is good for people! It has to be good for Google! •Google used to guard its index pretty strongly. • Low-qualit y websites, slimy unethical SEO, deceptive desig n ? • You could ex p ect your site to sink in Google results or even be kicked out. •Google is not guarding its index much today. Nor is the Common Crawl (which Google competitors often use).

WHY NOT GUARD THE INDEX? •Reason 1: Chatbots absolutely can
create sludge that they can’t reliably detect. So automatedly de-sludging a search-engine index is hard and may be impossible. • This is why universities are t u r n ing off AI-detection tools like Tur n itin’s. They just don’t work. • A number of inst r u ctors (not in our iSchool!) have embar r assed themselves and (worse!) har m ed st u dents by taking AI-detection tools way too seriously. •Reason 2: Google’s the biggest advertiser on the entire web. Sludge websites make Google ad money! • Google is a web adver t ising and user-sur v eillance company that happens to r u n a search engine. • Remember what I said about the web adver t ising indust r y ? Yeah.

GOOGLE HAS MASSIVE CONFLICTS OF INTEREST. ITS SEARCH ENGINE BUSINESS
CONFLICTS WITH ITS WEB-AD BUSINESSES… AND ITS BROWSER/MOBILE BUSINESS… AND ITS SURVEILLANCE BUSINESSES… AND ITS NEBULOUS AI BUSINESS… AND SEARCH AND BROWSER ARE LOSING.

If your browser is Google Chrome… … now’s a good
time to star t auditioning alter n atives. I suggest LibreWolf.

SEARCH-ENGINE REGRESSIONS • Reg r ession: In sof t w
are development, a change that damages f u nctionalit y , or removes usef u l f u nctionalit y •Search engines used to have lots of useful advanced-search features: date restrictions, phrase searching, requiring certain words, etc. etc. •These features are disappearing fast, or ceasing to work. • It’s viciously hard to do highly precise searching on the open web these days. It hasn’t always been this way! •Why? • “People don’t use them!” What, I’m not people? Librarians aren’t people? Gah. • Even sof t w are developers can be g u llible about AI/ML/LLMs — they think these replace user-side precision-enhancing advanced-search feat u res.

SO WE HAVE A SEA OF SLUDGE, AND OUR TOOLS
FOR FINDING NON-SLUDGE IN IT ARE GETTING WORSE. AAAAAAAAAAAWESOME. WHAT NOW?

HERE WE ARE: TAKING ARMS AGAINST A SEA OF CHATBOT
SLUDGE

ANTI-SLUDGE STRATEGIES, AND WHERE THEY FAIL •Limit searches to sites
that are still mostly human. • This is the Reddit/StackExchange st r ateg y . It’s a perfectly sensible st r ateg y ! • Problem: as AI bros ex p loit the open web for t r aining data, a lot of human interaction is mig r ating off the open web to more private/confidential spaces. • Problem: Google bought access to Reddit, denying it to others. •Similarly: rely on curated information aggregations • … like librar y -provided databases! This is oppor t u nit y knocking, folks! •Stop searching the open web entirely. Neither the open web nor its search engines is reliable. • This makes me sad and ang r y . I love the web. This is an awf u l way for it to die. • But I’d be lying if I said I don’t see oppor t u nit y in it for librarianship.

LET’S CLEAN OUR OWN HOUSE FIRST

WITH OUR PATRONS’ QUESTIONS. NOT EVER. WE DO NOT

LET’S REMAIN UNDAZZLED BY HYPE.

LET’S EXPLAIN THE COSTS, ALOUD.

LET’S SHOW HOW TO DO BETTER. •School and academic librarianship:
“information literacy” •ONE BOOK: Verified, Caulfield & Wineburg •Okay, two books: AI Snake Oil, Narayanan & Kapoor •Fine, FINE, three books (though this one isn’t out until next year): The AI Con, Bender & Hanna

MUST THE WEB DIE? CAN IT BE DESLUDGED? IS THERE
HOPE? •No (we too cont r ibute good st u ff to the web!) , possibly, and yes. •Altman (as one example) needs to make money at some point. He’s currently running ChatGPT as a hype engine and loss leader. • How long can he keep doing that? • Especially as LLMs keep embar r assing themselves in public? • Will the use pay m ents he will inevitably demand make it uneconomic to sell ads against web sludge? We can hope! •Data-privacy law (especially in the EU) offers some hope. So does copyright law. •In the meantime, whatever our definition of “digital literacy” is, I think “coping with sludge” has to be part of it.

ANOTHER REASON TO EXPECT DESLUDGIFICATION •General-purpose LLMs are trained on
the open web, on the assumption that the web is written by human beings. •As LLM sludge proliferates on the web, LLM builders will have to train LLMs on… output from prior LLMs. • Because, again, machine lear n ing can’t reliably tell the difference bet w een human writing and LLM sludge! •Early days of research, but so far, it appears that LLMs trained on LLM-generated sludge get very bad very fast. • Phrase to search on: “model collapse” •So even LLM sludgemeisters will have incentive to: • develop bet t er tools for separating human writing f r om LLM sludge • not let themselves be used for the f u r t her ensludgification of the web

THANK YOU! Dorothea Salo Universit y of Wisconsin-Madison iSchool [email protected]
MA/LIS Graduate Coordinator: Tanya Hendricks Cobb [email protected] MS/Info Graduate Coordinator: Jenny Greiber jg r [email protected]

Taking Arms against a Sea of Chatbot Sludge

Taking Arms against a Sea of Chatbot Sludge

More Decks by Dorothea Salo

Other Decks in Technology

Featured

Transcript