Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Taking Arms against a Sea of Chatbot Sludge

Taking Arms against a Sea of Chatbot Sludge

For INFOCon 2024, November 13. The term of art is "chatbot slop," actually, but for me the word "sludge" is ever so much more evocative.

Dorothea Salo

November 13, 2024
Tweet

More Decks by Dorothea Salo

Other Decks in Technology

Transcript

  1. TAKING YOU TO WEBSITES FULL OF SOUND well, ht m

    l AND FURY SIGNIFYING NOTHING?
  2. I really do recommend browser adblockers. uBlock Origin and Privacy

    Badger are good. (Firefox Focus on mobile) It’s not just that ads are annoying. Increasingly, they’re a securit y risk.
  3. ABOUT WEB ADVERTISING •A deeply selfish, harmful, and corrupt industry

    riddled with fraud •The source of (and rationalization for) incalculable amounts of online, even offline, surveillance •Also the source of much of the money that keeps non ‑ commerce websites available at all • Hor r ible dilemma, isn’t it?
  4. WEB ADVERTISING PLACEMENT •Not (or only minimally) determined by the

    website on which the ads are placed. • That website may be able to set some standards (e.g. no por n , no gambling, no politics) or ref u se specific ads, but that’s it. •Instead, “real-time bidding” • You visit website. • Website fig u res out who you are and what it already knows about you, and • sends that (plus new info about you) to an ad net w ork asking “what ad should I show this chump?” •So ad placement is indifferent to the quality of the site they’re supporting. “Can this ad get you to click?” is the only question.
  5. IF AD NETWORKS THINK YOU’LL CLICK ON AN AD ON

    A 💩 SITE, THE AD GOES ON THE 💩 SITE. ESPECIALLY IF IT’S CHEAPER.
  6. WONDERING WHY THERE’S SO MUCH LESS JOURNALISM ONLINE THESE DAYS?

    THAT’S WHY. AD NETWORKS DON’T CARE ABOUT THE SOCIETAL IMPORTANCE OF JOURNALISTS AND JOURNALISM.
  7. SO.

  8. HOW DO YOU BUILD A 💩 WEBSITE TO SELL AD

    PLACEMENTS AGAINST REAL FAST AND REAL CHEAP? (YOU READ THE TITLE OF THIS TALK. YOU KNOW THE ANSWER.)
  9. HOW DO YOU BUILD A 💩 WEBSITE TO SELL AD

    PLACEMENTS AGAINST REAL FAST AND REAL CHEAP? YOU USE A CHATBOT TO GENERATE SLUDGE. OF COURSE YOU DO.
  10. GENERATIVE AI •Subsets: large language model based chatbots, programming-code generators,

    image generators, sound and voice generators, video generators •Amass a massive amount of data. More than that. No, more than that! EVEN MORE THAN THAT. • How? Mostly by hauling it in off the open/accessible web. Copy r ight, what’s that? Creator wishes, eh, who cares. Sur v eillance images/video? Sure, why not. • Private or confidential images (such as medical images)? Hey, if it’s on the open web nobody can object, right? CSAM? Uh-oh, bet t er at least clean THAT up. •Feed all this data into a black-box creator. Use the resulting black box to make stuff up in response to prompts. • Of course it’s not quite this simple, but this gives you the flavor.
  11. THIS IS ONE REASON NEVER, EVER, EVER TO USE A

    CHATBOT AS A SEARCH ENGINE. IT IS NOT A SEARCH ENGINE. IT’S A MAKE-STUFF-UP MACHINE.
  12. CHATBOTS ARE HONEY BADGERS! •They just don’t care. (Can’t. Can’t

    care.) • (The honey-badger thing is an old meme. You can… probably… look it up.) •They produce responses to search-like prompts the same way they do to any other prompt. • They send the prompt tex t to their tokenizer, and pop the resulting tokens into statistics engines that compute a statistically-likely answer, one token at a time. •“Statistically likely” is not the same as “accurate,” “factual,” or even “reasonable.” It’s definitely NOT THE SAME AS “SAFE.” •That’s how we get pizza rock glue, recipes that ask for poisonous ingredients, and so on. Chatbot don’t care! • More seriously, chatbot cannot be held accountable. Nor can chatbot companies.
  13. PEOPLE USING GENERATIVE AI TO LIE •Scams and grifting (beyond

    cheating ad networks) •Deepfakes, including sexually-themed ones •Genuinely fake “fake news” •Academic and professional cheating • Including research and scholarly publication f r aud! • At base, the lie is “I thought about and created this.” • No, you didn’t, and yes, it mat t ers. •Yes, the chatbot companies know about this and don’t care. •But it’s also up to us not to use these things just to be liars and grifters and cheats, okay?
  14. SEARCH, SEO, AND SLUDGE •Search-engine optimization (SEO) used to be

    about making a good, useful, accessible website. (I used to teach it proudly!) •Now it’s about trying to surface from the sludge… which means a lot of sketchy web-writing practices, these days. • Reading about moder n SEO both weirds me out and makes me f u rious. • It doesn’t MATTER whether a site is good for people! It has to be good for Google! •Google used to guard its index pretty strongly. • Low-qualit y websites, slimy unethical SEO, deceptive desig n ? • You could ex p ect your site to sink in Google results or even be kicked out. •Google is not guarding its index much today. Nor is the Common Crawl (which Google competitors often use).
  15. WHY NOT GUARD THE INDEX? •Reason 1: Chatbots absolutely can

    create sludge that they can’t reliably detect. So automatedly de-sludging a search-engine index is hard and may be impossible. • This is why universities are t u r n ing off AI-detection tools like Tur n itin’s. They just don’t work. • A number of inst r u ctors (not in our iSchool!) have embar r assed themselves and (worse!) har m ed st u dents by taking AI-detection tools way too seriously. •Reason 2: Google’s the biggest advertiser on the entire web. Sludge websites make Google ad money! • Google is a web adver t ising and user-sur v eillance company that happens to r u n a search engine. • Remember what I said about the web adver t ising indust r y ? Yeah.
  16. GOOGLE HAS MASSIVE CONFLICTS OF INTEREST. ITS SEARCH ENGINE BUSINESS

    CONFLICTS WITH ITS WEB-AD BUSINESSES… AND ITS BROWSER/MOBILE BUSINESS… AND ITS SURVEILLANCE BUSINESSES… AND ITS NEBULOUS AI BUSINESS… AND SEARCH AND BROWSER ARE LOSING.
  17. If your browser is Google Chrome… … now’s a good

    time to star t auditioning alter n atives. I suggest LibreWolf.
  18. SEARCH-ENGINE REGRESSIONS • Reg r ession: In sof t w

    are development, a change that damages f u nctionalit y , or removes usef u l f u nctionalit y •Search engines used to have lots of useful advanced-search features: date restrictions, phrase searching, requiring certain words, etc. etc. •These features are disappearing fast, or ceasing to work. • It’s viciously hard to do highly precise searching on the open web these days. It hasn’t always been this way! •Why? • “People don’t use them!” What, I’m not people? Librarians aren’t people? Gah. • Even sof t w are developers can be g u llible about AI/ML/LLMs — they think these replace user-side precision-enhancing advanced-search feat u res.
  19. SO WE HAVE A SEA OF SLUDGE, AND OUR TOOLS

    FOR FINDING NON-SLUDGE IN IT ARE GETTING WORSE. AAAAAAAAAAAWESOME. WHAT NOW?
  20. ANTI-SLUDGE STRATEGIES, AND WHERE THEY FAIL •Limit searches to sites

    that are still mostly human. • This is the Reddit/StackExchange st r ateg y . It’s a perfectly sensible st r ateg y ! • Problem: as AI bros ex p loit the open web for t r aining data, a lot of human interaction is mig r ating off the open web to more private/confidential spaces. • Problem: Google bought access to Reddit, denying it to others. •Similarly: rely on curated information aggregations • … like librar y -provided databases! This is oppor t u nit y knocking, folks! •Stop searching the open web entirely. Neither the open web nor its search engines is reliable. • This makes me sad and ang r y . I love the web. This is an awf u l way for it to die. • But I’d be lying if I said I don’t see oppor t u nit y in it for librarianship.
  21. LET’S SHOW HOW TO DO BETTER. •School and academic librarianship:

    “information literacy” •ONE BOOK: Verified, Caulfield & Wineburg •Okay, two books: AI Snake Oil, Narayanan & Kapoor •Fine, FINE, three books (though this one isn’t out until next year): The AI Con, Bender & Hanna
  22. MUST THE WEB DIE? CAN IT BE DESLUDGED? IS THERE

    HOPE? •No (we too cont r ibute good st u ff to the web!) , possibly, and yes. •Altman (as one example) needs to make money at some point. He’s currently running ChatGPT as a hype engine and loss leader. • How long can he keep doing that? • Especially as LLMs keep embar r assing themselves in public? • Will the use pay m ents he will inevitably demand make it uneconomic to sell ads against web sludge? We can hope! •Data-privacy law (especially in the EU) offers some hope. So does copyright law. •In the meantime, whatever our definition of “digital literacy” is, I think “coping with sludge” has to be part of it.
  23. ANOTHER REASON TO EXPECT DESLUDGIFICATION •General-purpose LLMs are trained on

    the open web, on the assumption that the web is written by human beings. •As LLM sludge proliferates on the web, LLM builders will have to train LLMs on… output from prior LLMs. • Because, again, machine lear n ing can’t reliably tell the difference bet w een human writing and LLM sludge! •Early days of research, but so far, it appears that LLMs trained on LLM-generated sludge get very bad very fast. • Phrase to search on: “model collapse” •So even LLM sludgemeisters will have incentive to: • develop bet t er tools for separating human writing f r om LLM sludge • not let themselves be used for the f u r t her ensludgification of the web
  24. THANK YOU! Dorothea Salo Universit y of Wisconsin-Madison iSchool [email protected]

    MA/LIS Graduate Coordinator: Tanya Hendricks Cobb [email protected] MS/Info Graduate Coordinator: Jenny Greiber jg r [email protected]