in schools ◦ ChatGPT, Bard, Grammarly, etc. • Questions are emerging: ◦ Is it plagiarism to use generative AI tools? ◦ Are there ways to detect this AI content? ◦ How can teachers combat this? Rosenblatt, Kalhan. “ChatGPT banned from New York City public schools' devices and networks.” NBC News
• Universities are struggling to develop fair policies • Teachers are turning to publicized AI detectors ◦ Turnitin, GPTZero, Winston AI, writer.com, etc. • Questions are emerging: ◦ Can teachers tell when content is AI generated? ◦ Are these tools reliable? Fowler, Geoffrey A., and Gerald Loeb. “We tested Turnitin's ChatGPT-detector for teachers. It got some wrong.” The Washington Post, 3 April 2023 Introduction & Context
detectors • Regularly encountered reports of detectors misidentifying writing source and returning false negatives or positives. • Liang et al. determined existing bias against non-native english speakers due to linguistic simplicity. Alimardani, Armin, and Emma A. Jane. “We pitted ChatGPT against tools for detecting AI-written text, and the results are troubling.” The Conversation, 19 February 2023 Liang et al. “GPT detectors are biased against non-native English writers,” Patterns, vol. 4, issue 7, 2023,
Most common type available • Calculating the differences between a text’s characteristics • Burstiness ⇒ consistency in style and tone through the text • Perplexity ⇒ compare the predicted word with another AI’s generated word • LLMs trained on AI and human text to be able to distinguish between the two. • OpenAI Text Classifier taken down as of July 20th due to “its low rate of accuracy.” • Take advantage of hidden watermarks (or combinations of words) that exist in AI-generated text • use these to distinguish between human-written and generated content. • Still experimental (Kirchbauer et. al) Krishna et al. “Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2303.13408 Kirchbauer et al. “A watermark for large language models,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2301.10226
Most common type available • Calculating the differences between a text’s characteristics • Burstiness ⇒ consistency in style and tone through the text • Perplexity ⇒ compare the predicted word with another AI’s generated word • LLMs trained on AI and human text to be able to distinguish between the two. • OpenAI Text Classifier taken down as of July 20th due to “its low rate of accuracy.” • Take advantage of hidden watermarks (or combinations of words) that exist in AI-generated text • use these to distinguish between human-written and generated content. • Still experimental (Kirchbauer et. al) Krishna et al. “Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2303.13408 Kirchbauer et al. “A watermark for large language models,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2301.10226
Most common type available • Calculating the differences between a text’s characteristics • Burstiness ⇒ consistency in style and tone through the text • Perplexity ⇒ compare the predicted word with another AI’s generated word • LLMs trained on AI and human text to be able to distinguish between the two. • OpenAI Text Classifier taken down as of July 20th due to “its low rate of accuracy.” • Take advantage of hidden watermarks (or combinations of words) that exist in AI-generated text • use these to distinguish between human-written and generated content. • Still experimental (Kirchbauer et. al) Krishna et al. “Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2303.13408 Kirchbauer et al. “A watermark for large language models,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2301.10226
Most common type available • Calculating the differences between a text’s characteristics • Burstiness ⇒ consistency in style and tone through the text • Perplexity ⇒ compare the predicted word with another AI’s generated word • LLMs trained on AI and human text to be able to distinguish between the two. • OpenAI Text Classifier taken down as of July 20th due to “its low rate of accuracy.” • Take advantage of hidden watermarks (or combinations of words) that exist in AI-generated text • use these to distinguish between human-written and generated content. • Still experimental (Kirchbauer et. al) Krishna et al. “Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2303.13408 Kirchbauer et al. “A watermark for large language models,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2301.10226
identification ◦ Tools: ZeroGPT and Write.com • Provided professors with a Google Form ◦ Each prompt had 5 responses ◦ 0-2 responses were AI • Provided professors and detection tools with the same content
the following prompts to several classmates ◦ What is something on your bucket list? ◦ What are some sports in the Winter Olympics? ◦ What was the cause of World War II? ◦ Give me a basic explanation of how soccer is played. ◦ What is your favorite food? ◦ What are some uses of generative AI for students?
• Asked participants to identify AI generated options • Each prompt is worth two points ◦ Fully correct answers (i.e. selecting all correct options or selecting nothing when there is no correct): +2 points ◦ Partially correct answer (i.e. selecting 1 correct when there are 2 correct options or selecting 1 correct and 1 incorrect when there is 1 correct option): +1 point ◦ Incorrect answers (i.e. selecting anything when there is no correct option or selecting incorrect options): +0 points Total points possible: 12 3-5 human responses 0-2 AI responses
form as the AI using it’s choices ◦ Anything over 60% flagged as AI considered a selected option • Same scoring criteria as before • Issues: ◦ Too short for ZeroGPT ◦ Solution: I chose detectors that specifically highlighted AI generated content. I tested them in different orders and the tool regularly flagged the same options.
Reasoning: “Last one sounds like someone trying to make AI sound non-AI.” Professor #1 5/12 41.6% Translates to: F Interesting Reasoning: “The AI spoke in a very monotone tone… other articles showed more emotion using exclamation points and nicknames for positions Professor #3 3/12 25% Translates to: F Interesting Reasoning: “Favorite and 'quite like' seem odd Q&A pairs. #5 is very long & over detailed”
passing score • Differences ◦ Humans noticed more of the context clues (i.e. saying “I quite like tacos” seemed to not flow as well) ◦ AI only reviewed the writing itself when it needed more context • Neither are trustworthy to use in a classroom
a 200-word paragraph summarizing and analyzing this article about the Chinese education ministry’s proposition to redesign school P.E classes. What was the motivation behind this proposal? What was the public response? Are any arguments made or presented by the article?”
paragraph - [first 100 words from human sample] - make a 200 word response to summarizing and analyzing this article about the Chinese education ministry’s proposition to redesign school P.E classes. What was the motivation behind this proposal? What was the public response? Are any arguments made or presented by the article? Use the paragraph excerpt above word-for-word as the first half of your response.
detectors is tricky. • They utilized different scales to report their findings, ◦ Using hard numbers and others using qualitative descriptions. • Can go into more detail during discussion or visit the visual essay: ◦ https://ashi-kamra.github.io/AIContentDetectorAnalysis/
and obscurity that can’t be overlooked • Do give a solid general idea about the usage of AI across a large group of samples • Confident and accurate classifications for any one writing sample is not possible.
background knowledge to safely and effectively utilize these tools, or the skills to identify AI themselves ◦ “ Makes about as much sense in the long run as trying to ban calculators.” - Executives at Turnitin • Must determine how to maintain meaningful learning. • Using these technologies a matter of enhancing learning and supplementing the existing process. Alimardani, Armin, and Emma A. Jane. “We pitted ChatGPT against tools for detecting AI-written text, and the results are troubling.” The Conversation, 19 February 2023
we as educators have to act with integrity and model that behavior.” “ Deceptive assessment using tools and technologies without students’ knowledge ahead of time is not modeling integrity.” Wilhelm, Ian. “Nobody Wins in an Academic-Integrity Arms Race,” Chronicle of Higher Education, 12 June 2023.
Michelle Lee Shannon Li All Research Participants Generative AI Experts Ethan Mollick Emon Shahrier Snacks and Support Jess Lipsey Photography Anabelle Lipsey Hannah’s Proofreaders Sandra Rose Natalee Rose Sasha Sherstnev