Slide 1

Slide 1 text

TAKING ARMS AGAINST A SEA OF CHATBOT SLUDGE Dorothea Salo Universit y of Wisconsin-Madison iSchool [email protected]

Slide 2

Slide 2 text

SO, HOW’S WORKING OUT FOR YOU LATELY?

Slide 3

Slide 3 text

RESULTS MAYBE NOT SO GREAT?

Slide 4

Slide 4 text

TAKING YOU TO WEBSITES FULL OF SOUND well, ht m l AND FURY SIGNIFYING NOTHING?

Slide 5

Slide 5 text

WITH LOTS AND LOTS OF ADS?

Slide 6

Slide 6 text

I really do recommend browser adblockers. uBlock Origin and Privacy Badger are good. (Firefox Focus on mobile) It’s not just that ads are annoying. Increasingly, they’re a securit y risk.

Slide 7

Slide 7 text

NO JOY FROM EITHER?

Slide 8

Slide 8 text

IT’S NOT YOU. IT’S REALLY HAPPENING. AND THE CULPRITS ARE WEB ADVERTISING AND CHATBOT SLUDGE.

Slide 9

Slide 9 text

ABOUT WEB ADVERTISING •A deeply selfish, harmful, and corrupt industry riddled with fraud •The source of (and rationalization for) incalculable amounts of online, even offline, surveillance •Also the source of much of the money that keeps non ‑ commerce websites available at all • Hor r ible dilemma, isn’t it?

Slide 10

Slide 10 text

WEB ADVERTISING PLACEMENT •Not (or only minimally) determined by the website on which the ads are placed. • That website may be able to set some standards (e.g. no por n , no gambling, no politics) or ref u se specific ads, but that’s it. •Instead, “real-time bidding” • You visit website. • Website fig u res out who you are and what it already knows about you, and • sends that (plus new info about you) to an ad net w ork asking “what ad should I show this chump?” •So ad placement is indifferent to the quality of the site they’re supporting. “Can this ad get you to click?” is the only question.

Slide 11

Slide 11 text

IF AD NETWORKS THINK YOU’LL CLICK ON AN AD ON A 💩 SITE, THE AD GOES ON THE 💩 SITE. ESPECIALLY IF IT’S CHEAPER.

Slide 12

Slide 12 text

WONDERING WHY THERE’S SO MUCH LESS JOURNALISM ONLINE THESE DAYS? THAT’S WHY. AD NETWORKS DON’T CARE ABOUT THE SOCIETAL IMPORTANCE OF JOURNALISTS AND JOURNALISM.

Slide 13

Slide 13 text

SO.

Slide 14

Slide 14 text

HOW DO YOU BUILD A 💩 WEBSITE TO SELL AD PLACEMENTS AGAINST REAL FAST AND REAL CHEAP? (YOU READ THE TITLE OF THIS TALK. YOU KNOW THE ANSWER.)

Slide 15

Slide 15 text

HOW DO YOU BUILD A 💩 WEBSITE TO SELL AD PLACEMENTS AGAINST REAL FAST AND REAL CHEAP? YOU USE A CHATBOT TO GENERATE SLUDGE. OF COURSE YOU DO.

Slide 16

Slide 16 text

REDACTED REDACTED

Slide 17

Slide 17 text

STOPPING HERE FOR QUESTIONS. WHAT DIDN’T I EXPLAIN CLEARLY? OR POSSIBLY AT ALL?

Slide 18

Slide 18 text

GENERATIVE AI •Subsets: large language model based chatbots, programming-code generators, image generators, sound and voice generators, video generators •Amass a massive amount of data. More than that. No, more than that! EVEN MORE THAN THAT. • How? Mostly by hauling it in off the open/accessible web. Copy r ight, what’s that? Creator wishes, eh, who cares. Sur v eillance images/video? Sure, why not. • Private or confidential images (such as medical images)? Hey, if it’s on the open web nobody can object, right? CSAM? Uh-oh, bet t er at least clean THAT up. •Feed all this data into a black-box creator. Use the resulting black box to make stuff up in response to prompts. • Of course it’s not quite this simple, but this gives you the flavor.

Slide 19

Slide 19 text

A CHATBOT IS A MAKE-STUFF-UP MACHINE. THAT IS WHAT IT IS. THAT IS WHAT IT DOES.

Slide 20

Slide 20 text

THIS IS ONE REASON NEVER, EVER, EVER TO USE A CHATBOT AS A SEARCH ENGINE. IT IS NOT A SEARCH ENGINE. IT’S A MAKE-STUFF-UP MACHINE.

Slide 21

Slide 21 text

A CHATBOT IS ALSO A… Laurens, “Honey Badger.” CC-BY-ND. https://www.flickr.com/ photos/47456200@N04/4362649552

Slide 22

Slide 22 text

CHATBOTS ARE HONEY BADGERS! •They just don’t care. (Can’t. Can’t care.) • (The honey-badger thing is an old meme. You can… probably… look it up.) •They produce responses to search-like prompts the same way they do to any other prompt. • They send the prompt tex t to their tokenizer, and pop the resulting tokens into statistics engines that compute a statistically-likely answer, one token at a time. •“Statistically likely” is not the same as “accurate,” “factual,” or even “reasonable.” It’s definitely NOT THE SAME AS “SAFE.” •That’s how we get pizza rock glue, recipes that ask for poisonous ingredients, and so on. Chatbot don’t care! • More seriously, chatbot cannot be held accountable. Nor can chatbot companies.

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

WEB CREATORS DON’T LOVE AI. SO THEY’RE BLOCKING IT.

Slide 25

Slide 25 text

THAT CAN MEAN BLOCKING SEARCH ENGINE CRAWLERS TOO. REMEMBER HOW GOOGLE’S GETTING INTO AI? YEAH.

Slide 26

Slide 26 text

SO THAT REDUCES HOW MUCH GOOD STUFF GETS INTO SEARCH ENGINES.

Slide 27

Slide 27 text

SO THAT REDUCES HOW MUCH GOOD STUFF GETS INTO CHATBOT MODELS.

Slide 28

Slide 28 text

RESULT: THE WEB IS DROWNING IN SLUDGE

Slide 29

Slide 29 text

SO SEARCH ENGINES ARE DROWNING TOO. IT’S HURTING THE NOT-MADE-UP WEB.

Slide 30

Slide 30 text

https://catandgirl.com/4000-of-my-closest-friends/

Slide 31

Slide 31 text

PEOPLE USING GENERATIVE AI TO LIE •Scams and grifting (beyond cheating ad networks) •Deepfakes, including sexually-themed ones •Genuinely fake “fake news” •Academic and professional cheating • Including research and scholarly publication f r aud! • At base, the lie is “I thought about and created this.” • No, you didn’t, and yes, it mat t ers. •Yes, the chatbot companies know about this and don’t care. •But it’s also up to us not to use these things just to be liars and grifters and cheats, okay?

Slide 32

Slide 32 text

SEARCH, SEO, AND SLUDGE •Search-engine optimization (SEO) used to be about making a good, useful, accessible website. (I used to teach it proudly!) •Now it’s about trying to surface from the sludge… which means a lot of sketchy web-writing practices, these days. • Reading about moder n SEO both weirds me out and makes me f u rious. • It doesn’t MATTER whether a site is good for people! It has to be good for Google! •Google used to guard its index pretty strongly. • Low-qualit y websites, slimy unethical SEO, deceptive desig n ? • You could ex p ect your site to sink in Google results or even be kicked out. •Google is not guarding its index much today. Nor is the Common Crawl (which Google competitors often use).

Slide 33

Slide 33 text

WHY NOT GUARD THE INDEX? •Reason 1: Chatbots absolutely can create sludge that they can’t reliably detect. So automatedly de-sludging a search-engine index is hard and may be impossible. • This is why universities are t u r n ing off AI-detection tools like Tur n itin’s. They just don’t work. • A number of inst r u ctors (not in our iSchool!) have embar r assed themselves and (worse!) har m ed st u dents by taking AI-detection tools way too seriously. •Reason 2: Google’s the biggest advertiser on the entire web. Sludge websites make Google ad money! • Google is a web adver t ising and user-sur v eillance company that happens to r u n a search engine. • Remember what I said about the web adver t ising indust r y ? Yeah.

Slide 34

Slide 34 text

GOOGLE HAS MASSIVE CONFLICTS OF INTEREST. ITS SEARCH ENGINE BUSINESS CONFLICTS WITH ITS WEB-AD BUSINESSES… AND ITS BROWSER/MOBILE BUSINESS… AND ITS SURVEILLANCE BUSINESSES… AND ITS NEBULOUS AI BUSINESS… AND SEARCH AND BROWSER ARE LOSING.

Slide 35

Slide 35 text

If your browser is Google Chrome… … now’s a good time to star t auditioning alter n atives. I suggest LibreWolf.

Slide 36

Slide 36 text

SEARCH-ENGINE REGRESSIONS • Reg r ession: In sof t w are development, a change that damages f u nctionalit y , or removes usef u l f u nctionalit y •Search engines used to have lots of useful advanced-search features: date restrictions, phrase searching, requiring certain words, etc. etc. •These features are disappearing fast, or ceasing to work. • It’s viciously hard to do highly precise searching on the open web these days. It hasn’t always been this way! •Why? • “People don’t use them!” What, I’m not people? Librarians aren’t people? Gah. • Even sof t w are developers can be g u llible about AI/ML/LLMs — they think these replace user-side precision-enhancing advanced-search feat u res.

Slide 37

Slide 37 text

SO WE HAVE A SEA OF SLUDGE, AND OUR TOOLS FOR FINDING NON-SLUDGE IN IT ARE GETTING WORSE. AAAAAAAAAAAWESOME. WHAT NOW?

Slide 38

Slide 38 text

HERE WE ARE: TAKING ARMS AGAINST A SEA OF CHATBOT SLUDGE

Slide 39

Slide 39 text

ANTI-SLUDGE STRATEGIES, AND WHERE THEY FAIL •Limit searches to sites that are still mostly human. • This is the Reddit/StackExchange st r ateg y . It’s a perfectly sensible st r ateg y ! • Problem: as AI bros ex p loit the open web for t r aining data, a lot of human interaction is mig r ating off the open web to more private/confidential spaces. • Problem: Google bought access to Reddit, denying it to others. •Similarly: rely on curated information aggregations • … like librar y -provided databases! This is oppor t u nit y knocking, folks! •Stop searching the open web entirely. Neither the open web nor its search engines is reliable. • This makes me sad and ang r y . I love the web. This is an awf u l way for it to die. • But I’d be lying if I said I don’t see oppor t u nit y in it for librarianship.

Slide 40

Slide 40 text

LET’S CLEAN OUR OWN HOUSE FIRST

Slide 41

Slide 41 text

WITH OUR PATRONS’ QUESTIONS. NOT EVER. WE DO NOT

Slide 42

Slide 42 text

LET’S REMAIN UNDAZZLED BY HYPE.

Slide 43

Slide 43 text

LET’S EXPLAIN THE COSTS, ALOUD.

Slide 44

Slide 44 text

LET’S SHOW HOW TO DO BETTER. •School and academic librarianship: “information literacy” •ONE BOOK: Verified, Caulfield & Wineburg •Okay, two books: AI Snake Oil, Narayanan & Kapoor •Fine, FINE, three books (though this one isn’t out until next year): The AI Con, Bender & Hanna

Slide 45

Slide 45 text

MUST THE WEB DIE? CAN IT BE DESLUDGED? IS THERE HOPE? •No (we too cont r ibute good st u ff to the web!) , possibly, and yes. •Altman (as one example) needs to make money at some point. He’s currently running ChatGPT as a hype engine and loss leader. • How long can he keep doing that? • Especially as LLMs keep embar r assing themselves in public? • Will the use pay m ents he will inevitably demand make it uneconomic to sell ads against web sludge? We can hope! •Data-privacy law (especially in the EU) offers some hope. So does copyright law. •In the meantime, whatever our definition of “digital literacy” is, I think “coping with sludge” has to be part of it.

Slide 46

Slide 46 text

ANOTHER REASON TO EXPECT DESLUDGIFICATION •General-purpose LLMs are trained on the open web, on the assumption that the web is written by human beings. •As LLM sludge proliferates on the web, LLM builders will have to train LLMs on… output from prior LLMs. • Because, again, machine lear n ing can’t reliably tell the difference bet w een human writing and LLM sludge! •Early days of research, but so far, it appears that LLMs trained on LLM-generated sludge get very bad very fast. • Phrase to search on: “model collapse” •So even LLM sludgemeisters will have incentive to: • develop bet t er tools for separating human writing f r om LLM sludge • not let themselves be used for the f u r t her ensludgification of the web

Slide 47

Slide 47 text

THANK YOU! Dorothea Salo Universit y of Wisconsin-Madison iSchool [email protected] MA/LIS Graduate Coordinator: Tanya Hendricks Cobb [email protected] MS/Info Graduate Coordinator: Jenny Greiber jg r [email protected]