riddled with fraud •The source of (and rationalization for) incalculable amounts of online, even offline, surveillance •Also the source of much of the money that keeps non ‑ commerce websites available at all • Hor r ible dilemma, isn’t it?
website on which the ads are placed. • That website may be able to set some standards (e.g. no por n , no gambling, no politics) or ref u se specific ads, but that’s it. •Instead, “real-time bidding” • You visit website. • Website fig u res out who you are and what it already knows about you, and • sends that (plus new info about you) to an ad net w ork asking “what ad should I show this chump?” •So ad placement is indifferent to the quality of the site they’re supporting. “Can this ad get you to click?” is the only question.
image generators, sound and voice generators, video generators •Amass a massive amount of data. More than that. No, more than that! EVEN MORE THAN THAT. • How? Mostly by hauling it in off the open/accessible web. Copy r ight, what’s that? Creator wishes, eh, who cares. Sur v eillance images/video? Sure, why not. • Private or confidential images (such as medical images)? Hey, if it’s on the open web nobody can object, right? CSAM? Uh-oh, bet t er at least clean THAT up. •Feed all this data into a black-box creator. Use the resulting black box to make stuff up in response to prompts. • Of course it’s not quite this simple, but this gives you the flavor.
care.) • (The honey-badger thing is an old meme. You can… probably… look it up.) •They produce responses to search-like prompts the same way they do to any other prompt. • They send the prompt tex t to their tokenizer, and pop the resulting tokens into statistics engines that compute a statistically-likely answer, one token at a time. •“Statistically likely” is not the same as “accurate,” “factual,” or even “reasonable.” It’s definitely NOT THE SAME AS “SAFE.” •That’s how we get pizza rock glue, recipes that ask for poisonous ingredients, and so on. Chatbot don’t care! • More seriously, chatbot cannot be held accountable. Nor can chatbot companies.
cheating ad networks) •Deepfakes, including sexually-themed ones •Genuinely fake “fake news” •Academic and professional cheating • Including research and scholarly publication f r aud! • At base, the lie is “I thought about and created this.” • No, you didn’t, and yes, it mat t ers. •Yes, the chatbot companies know about this and don’t care. •But it’s also up to us not to use these things just to be liars and grifters and cheats, okay?
about making a good, useful, accessible website. (I used to teach it proudly!) •Now it’s about trying to surface from the sludge… which means a lot of sketchy web-writing practices, these days. • Reading about moder n SEO both weirds me out and makes me f u rious. • It doesn’t MATTER whether a site is good for people! It has to be good for Google! •Google used to guard its index pretty strongly. • Low-qualit y websites, slimy unethical SEO, deceptive desig n ? • You could ex p ect your site to sink in Google results or even be kicked out. •Google is not guarding its index much today. Nor is the Common Crawl (which Google competitors often use).
create sludge that they can’t reliably detect. So automatedly de-sludging a search-engine index is hard and may be impossible. • This is why universities are t u r n ing off AI-detection tools like Tur n itin’s. They just don’t work. • A number of inst r u ctors (not in our iSchool!) have embar r assed themselves and (worse!) har m ed st u dents by taking AI-detection tools way too seriously. •Reason 2: Google’s the biggest advertiser on the entire web. Sludge websites make Google ad money! • Google is a web adver t ising and user-sur v eillance company that happens to r u n a search engine. • Remember what I said about the web adver t ising indust r y ? Yeah.
CONFLICTS WITH ITS WEB-AD BUSINESSES… AND ITS BROWSER/MOBILE BUSINESS… AND ITS SURVEILLANCE BUSINESSES… AND ITS NEBULOUS AI BUSINESS… AND SEARCH AND BROWSER ARE LOSING.
are development, a change that damages f u nctionalit y , or removes usef u l f u nctionalit y •Search engines used to have lots of useful advanced-search features: date restrictions, phrase searching, requiring certain words, etc. etc. •These features are disappearing fast, or ceasing to work. • It’s viciously hard to do highly precise searching on the open web these days. It hasn’t always been this way! •Why? • “People don’t use them!” What, I’m not people? Librarians aren’t people? Gah. • Even sof t w are developers can be g u llible about AI/ML/LLMs — they think these replace user-side precision-enhancing advanced-search feat u res.
that are still mostly human. • This is the Reddit/StackExchange st r ateg y . It’s a perfectly sensible st r ateg y ! • Problem: as AI bros ex p loit the open web for t r aining data, a lot of human interaction is mig r ating off the open web to more private/confidential spaces. • Problem: Google bought access to Reddit, denying it to others. •Similarly: rely on curated information aggregations • … like librar y -provided databases! This is oppor t u nit y knocking, folks! •Stop searching the open web entirely. Neither the open web nor its search engines is reliable. • This makes me sad and ang r y . I love the web. This is an awf u l way for it to die. • But I’d be lying if I said I don’t see oppor t u nit y in it for librarianship.
“information literacy” •ONE BOOK: Verified, Caulfield & Wineburg •Okay, two books: AI Snake Oil, Narayanan & Kapoor •Fine, FINE, three books (though this one isn’t out until next year): The AI Con, Bender & Hanna
HOPE? •No (we too cont r ibute good st u ff to the web!) , possibly, and yes. •Altman (as one example) needs to make money at some point. He’s currently running ChatGPT as a hype engine and loss leader. • How long can he keep doing that? • Especially as LLMs keep embar r assing themselves in public? • Will the use pay m ents he will inevitably demand make it uneconomic to sell ads against web sludge? We can hope! •Data-privacy law (especially in the EU) offers some hope. So does copyright law. •In the meantime, whatever our definition of “digital literacy” is, I think “coping with sludge” has to be part of it.
the open web, on the assumption that the web is written by human beings. •As LLM sludge proliferates on the web, LLM builders will have to train LLMs on… output from prior LLMs. • Because, again, machine lear n ing can’t reliably tell the difference bet w een human writing and LLM sludge! •Early days of research, but so far, it appears that LLMs trained on LLM-generated sludge get very bad very fast. • Phrase to search on: “model collapse” •So even LLM sludgemeisters will have incentive to: • develop bet t er tools for separating human writing f r om LLM sludge • not let themselves be used for the f u r t her ensludgification of the web