First: AGI is not a thing. • “Arti fi cial general intelligence” — a machine that thinks like a human being.
• Not a thing.
• NOT A THING.
• Lots of people use “AI” to mean AGI. They are:
• hypesters
• deluded
• playing a shell game (to worry people about something that isn’t happening so they won’t pay attention to the bad AI/ML/LLM-related stuff that IS happening), or
Grand Attempt 1: • Break down the world, everything in it, and how everything interacts with everything else into tiny computer-digestible pieces. Feed those to a computer. Win!
• If you know anything about neuroscience or linguistics, you are LAUGHING right now.
• We don’t even fully understand how HUMANS learn and process all this.
• This seriously limits our ability to teach it… especially to a computer! (Remember: Computers Are Not Smart.)
• That said… there are some e.g. linguistic relationships we understand well enough to formulate in a reasonably computer-friendly way. And it does help.
machine learning • Very very VERY impressionistically: “throw a crapton of data at a computer and let it fi nd patterns.”
• This approach fuels a lot of things in our information environment that we rarely think about: recommender engines, personalization of (news and social) media, etc.
• One thing it’s important to know is that for ML to be useful, its designers have to decide up-front what their goal for it is — also known as what a model is optimizing for.
• This choice can have a lot of corrosive repercussions.
• For example, most social-media sites optimize their what-to-show-users algorithms for “engagement.” In practice, this means optimizing for anger and invidious social comparison and division. This has been obviously Not Great for society at large.
But let’s start with the innocuous: spam classi fi cation • A CLASSIC ML problem.
• Spam email reached upwards of 90% of all email sent… actually really quickly in the 1990s.
• “Blackhole lists” of spamming servers only helped a little.
• It just wasn’t hard to set up a new email server or relay.
• Enter ML!
• Separate a bunch of email into spam and not-spam (“ham”). Feed both sets of email into an ML algorithm (usually Bayesian, for spam).
• When new email comes in, ask the algorithm “is this more like the spam or the ham?” When its answer is wrong, correct it (Bayesian algos can learn over time).
• Eventually it gets enough practice to get pretty good at this!
• … Until the spammers start fi guring out how to game it (“poisoning”), of course.
That’s more or less how supervised ML works. • Figure out your question — what you want your ML model to tell apart, whether it’s spam/ham or birds in photos.
• Prediction, here, is just another classi fi cation problem. “Graduate or dropout?”
• Get a bunch of relevant data, ideally representative of what the question looks like in the Real World™.
• “Ideally.” An awful lot of ML models fall down out of the gate because the choice of initial data wasn’t representative, didn’t include everything, didn’t consider biases, or or or…
• Set some of the data (chosen randomly) aside for later.
• Training the model: Classify the rest of the data, the way you want the computer to. Feed these classi fi cations into the ML algorithm.
• Testing the model: Ask the computer to classify the data you set aside. If it does this well, the model… seems… pretty okay.
Today: phishing and ML • Bayesian classi fi ers to date haven’t been able to deal with malicious email (there’s several kinds of it).
• I’ve been seeing other ML approaches tried. I haven’t seen one succeed, as yet.
• Why hasn’t it worked?
• Reason 1: unlike spam, malicious email often deliberately imitates regular email. Reason 2: what malicious email wants people to do (click on links, pay invoices, send a reply…) is also what plenty of regular email wants people to do!
• So, basically, there may not be enough differences between malicious and regular email for an ML model to pick up on!
• I tell you this so that you don’t overvalue ML approaches. There are de fi nitely problems ML can’t solve.
Unsupervised ML • You don’t have to classify data up-front, though!
• You can turn ML algorithms loose on a pile of data to fi nd whatever patterns they can (often “similarity clusters”). This is unsupervised ML.
• Large Language Models (LLMs) are unsupervised ML, because there’s just no way to fruitfully supervise a model that big. Too many angles to sort on! Plus producing language is only glancingly a sorting problem.
• Instead, LLM creators train their models on a ton of text (without asking any of the owners of copyrighted texts fi rst), then pay pennies to developing-world people to “ fi ne-tune” the model — that is, clean up the worst-looking messes (including hate and bias) afterwards.
• (If you think I think this is unethical… gold star, you are learning how I think.)
Decisions that aren’t binary (or n-ary) • Not everything in life is reducible to a fi nite set of options.
• This doesn’t stop people trying. “Facial emotion classi fi ers” have been epic fail. Emotionality and its expression (which are two different things) aren’t that simple.
• Take human language (please). We can say in fi nite things in in fi nite combinations! And still (mostly) understand one another, which is pretty miraculous really!
• Can we get a computer to understand us? Talk with us?
• Well, it depends on what we mean by “understand” exactly… what are some of the possibilities?
• Research fi eld: Natural Language Processing (NLP)
Automated language generators before ChatGPT • They did pretty well on routine fi ll-in-the-blank-style writing (e.g. basic sports reporting) and interaction (e.g. some tech support).
• In a conversation, they sometimes fooled people who weren’t expecting them.
• This dates back to the 1970s and ELIZA, though. We’re trusting, not to say gullible.
• They were easy to throw off-course or game, because they didn’t understand what they were saying.
• Often got trained on Internet speech… which is (to say the least) problematic along several different axes.
replacing human labor • Y’all who are using ChatGPT to slide by in your coursework and professional work need to ask yourselves why anybody hires you if AI can do big chunks of your job, cheaper.
• I’m gonna be blunt about this too: YOU ALSO NEED TO CHECK YOUR ETHICS. Altman is slime incarnate and ChatGPT is an ethics morass.
• Plus, we all know what happens to people whose labor is replaced under a winner-take-all capitalist system, right? Yeah. We do. Are you willing to do this to writers, artists, actors? CHECK. YOUR. ETHICS.
• This worry showed up in early 21st-c. librarianship as worry over web search engines. There’s some… loopy… stuff from the time in LISTA, if you look.
• And search engines did change (e.g.) reference work! But they didn’t destroy it.
• And I’m starting to think a Return to the Librarians is imminent. I’ll explain.
ChatGPT cannot and does not understand the world. • It does not “understand” anything. It can’t.
• Read Dr. Emily Bender on “stochastic parrots.”
• It is not trained on exclusively correct or true text.
• Like, how would you ever construct a training set to be exclusively correct and true? It’s just not possible. “As close as we can get” is a job for human judgment (and even for us this isn’t easy), which computers don’t have.
• (Don’t come at me with “peer review!” or “editors!” either. I have the lecture on peer review suckage from old!LIS658 and I will happily deploy it.)
• It is not trying to produce a correct answer; it cannot evaluate correctness. It aims at a PLAUSIBLE-SOUNDING answer.
• Plausible-but-wrong is totally a thing! So is plausible-but- completely-made-up!
Example: “hallucinated” citations • A lot of my reference librarian friends have found themselves looking up citations to works that don’t exist lately, in response to questions from patrons.
• Invariably, the citations came from ChatGPT and a gullible patron…
• … who didn’t understand that ChatGPT doesn’t know what a citation is for (only what it looks like and where it goes) and is perfectly content to make 💩 up.
• Not a few of those gullible patrons have had Ph.Ds. This isn’t about intelligence; it’s about gullibility and (wilful?) cluelessness.
• I expect you to know better and do better. And to prefer accuracy to LLM glibness.
ChatGPT and the web • Web advertising isn’t sold or priced according to the value or accuracy of the content it’s next to. The only important thing is whether it can attract eyeballs and clicks.
• One minor exception: sometimes “brand safety” can get certain advertisers to pull ads from the worst of the worst. Plenty of advertisers don’t care, however.
• There’s also a CRAPTON of straight-up fraud in web advertising. I don’t have time to get into this, but if you’re curious, I can de fi nitely point you to explainers.
• ChatGPT can generate tons of plausible-sounding sludge. Accurate? Helpful? Who CARES. If people see it, it sells ads.
• We can expect the web to become a sea of low-value, low- accuracy LLM-generated sludge. It’s already happening.
Search, SEO, and sludge • Search-engine optimization (SEO) used to be about making a good, useful, accessible website.
• Now it’s about trying to surface from the sludge… which means a lot of sketchy web-writing practices, these days.
• Reading about modern SEO both weirds me out and makes me furious.
• It doesn’t MATTER whether a site is good for people! It has to be good for Google!
• Google used to guard its index pretty strongly. Low-quality websites, black-hat SEO, deceptive design: expect your site to be demoted in Google results or even kicked out.
• Google is not guarding its index much today. Nor is the Common Crawl (which Google competitors often use).
Why not guard the index? • Reason 1: ChatGPT and its ilk absolutely can create sludge that they can’t reliably detect. So automatedly de-sludging a search-engine index is hard and may be impossible.
• This is why universities are turning off AI-detection tools like Turnitin’s. They just don’t work.
• A number of instructors (not in the iSchool!) have embarrassed themselves and harmed students by taking AI-detection tools way too seriously.
• Reason 2: Google’s the biggest advertiser on the entire web. Sludge websites make Google ad money! Google has a CONFLICT OF INTEREST, and its search engine is losing.
and where they fail • Limit searches to sites that are still mostly human.
• This is the Reddit/StackExchange strategy. It’s a perfectly sensible strategy!
• Problem: several such sites, Reddit particularly, are rapidly en💩ifying (to borrow a Cory Doctorow coinage), such that they’re rapidly shedding human users.
• Others, e.g. StackExchange and even Wikipedia, are giving in to the LLM craze and ensludgifying.
• Similarly: rely on curated information aggregations
• … like library databases! This is opportunity knocking, folks!
• Stop searching the open web. It ain’t reliable.
• This makes me sad and angry. I love the web and this is an awful way for it to die.
• But I’d be lying if I said I don’t see opportunity in it for librarianship.
Another reason to expect desludgi fi cation • General-purpose LLMs are trained on the open web, on the assumption that the web is written by human beings.
• As LLM sludge proliferates on the web, LLM builders will have to train LLMs on… output from prior LLMs.
• Because, again, ML can’t reliably tell the difference between human writing and LLM sludge!
• Early days of research, but so far, it appears that LLMs trained on LLM-generated sludge get very bad very fast.
• So even LLM sludgemeisters will have incentive to:
• develop better tools for separating human writing from LLM sludge
• not let themselves be used for the further ensludgi fi cation of the web