Slide 1

Slide 1 text

AN Analysis of A.i Content Detectors Ashima Kamra Wellesley College ‘26 Hannah E. Rose MIT ‘25

Slide 2

Slide 2 text

2. Research Question 3. Literature Review 4. Methodology 5. FIndings 6. Implications 1. Introduction Contents

Slide 3

Slide 3 text

Introduction & Context ● Generative AI tools are becoming commonplace in schools ○ ChatGPT, Bard, Grammarly, etc. ● Questions are emerging: ○ Is it plagiarism to use generative AI tools? ○ Are there ways to detect this AI content? ○ How can teachers combat this? Rosenblatt, Kalhan. “ChatGPT banned from New York City public schools' devices and networks.” NBC News

Slide 4

Slide 4 text

Increasing restrictions ● NYC public schools are banning generative AI ● Universities are struggling to develop fair policies ● Teachers are turning to publicized AI detectors ○ Turnitin, GPTZero, Winston AI, writer.com, etc. ● Questions are emerging: ○ Can teachers tell when content is AI generated? ○ Are these tools reliable? Fowler, Geoffrey A., and Gerald Loeb. “We tested Turnitin's ChatGPT-detector for teachers. It got some wrong.” The Washington Post, 3 April 2023 Introduction & Context

Slide 5

Slide 5 text

How accurate are A.I content detectors?

Slide 6

Slide 6 text

Literature Review ● Seeking out existing research and sentiments on detectors ● Regularly encountered reports of detectors misidentifying writing source and returning false negatives or positives. ● Liang et al. determined existing bias against non-native english speakers due to linguistic simplicity. Alimardani, Armin, and Emma A. Jane. “We pitted ChatGPT against tools for detecting AI-written text, and the results are troubling.” The Conversation, 19 February 2023 Liang et al. “GPT detectors are biased against non-native English writers,” Patterns, vol. 4, issue 7, 2023,

Slide 7

Slide 7 text

Literature Review Statistical Outlier Detectors Text Classifiers Watermarking Algorithms ● Most common type available ● Calculating the differences between a text’s characteristics ● Burstiness ⇒ consistency in style and tone through the text ● Perplexity ⇒ compare the predicted word with another AI’s generated word ● LLMs trained on AI and human text to be able to distinguish between the two. ● OpenAI Text Classifier taken down as of July 20th due to “its low rate of accuracy.” ● Take advantage of hidden watermarks (or combinations of words) that exist in AI-generated text ● use these to distinguish between human-written and generated content. ● Still experimental (Kirchbauer et. al) Krishna et al. “Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2303.13408 Kirchbauer et al. “A watermark for large language models,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2301.10226

Slide 8

Slide 8 text

Literature Review Statistical Outlier Detectors Text Classifiers Watermarking Algorithms ● Most common type available ● Calculating the differences between a text’s characteristics ● Burstiness ⇒ consistency in style and tone through the text ● Perplexity ⇒ compare the predicted word with another AI’s generated word ● LLMs trained on AI and human text to be able to distinguish between the two. ● OpenAI Text Classifier taken down as of July 20th due to “its low rate of accuracy.” ● Take advantage of hidden watermarks (or combinations of words) that exist in AI-generated text ● use these to distinguish between human-written and generated content. ● Still experimental (Kirchbauer et. al) Krishna et al. “Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2303.13408 Kirchbauer et al. “A watermark for large language models,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2301.10226

Slide 9

Slide 9 text

Literature Review Statistical Outlier Detectors Text Classifiers Watermarking Algorithms ● Most common type available ● Calculating the differences between a text’s characteristics ● Burstiness ⇒ consistency in style and tone through the text ● Perplexity ⇒ compare the predicted word with another AI’s generated word ● LLMs trained on AI and human text to be able to distinguish between the two. ● OpenAI Text Classifier taken down as of July 20th due to “its low rate of accuracy.” ● Take advantage of hidden watermarks (or combinations of words) that exist in AI-generated text ● use these to distinguish between human-written and generated content. ● Still experimental (Kirchbauer et. al) Krishna et al. “Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2303.13408 Kirchbauer et al. “A watermark for large language models,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2301.10226

Slide 10

Slide 10 text

Literature Review Statistical Outlier Detectors Text Classifiers Watermarking Algorithms ● Most common type available ● Calculating the differences between a text’s characteristics ● Burstiness ⇒ consistency in style and tone through the text ● Perplexity ⇒ compare the predicted word with another AI’s generated word ● LLMs trained on AI and human text to be able to distinguish between the two. ● OpenAI Text Classifier taken down as of July 20th due to “its low rate of accuracy.” ● Take advantage of hidden watermarks (or combinations of words) that exist in AI-generated text ● use these to distinguish between human-written and generated content. ● Still experimental (Kirchbauer et. al) Krishna et al. “Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2303.13408 Kirchbauer et al. “A watermark for large language models,” Cornell University, 2023. https://doi.org/10.48550/arXiv.2301.10226

Slide 11

Slide 11 text

Methodology - Hannah ● Testing human identification vs. AI detector identification ○ Tools: ZeroGPT and Write.com ● Provided professors with a Google Form ○ Each prompt had 5 responses ○ 0-2 responses were AI ● Provided professors and detection tools with the same content

Slide 12

Slide 12 text

Methodology Gathering Inputs ● I sent out Google Form with the following prompts to several classmates ○ What is something on your bucket list? ○ What are some sports in the Winter Olympics? ○ What was the cause of World War II? ○ Give me a basic explanation of how soccer is played. ○ What is your favorite food? ○ What are some uses of generative AI for students?

Slide 13

Slide 13 text

Creating the Evaluation Form ● 5 choices per prompt Methodology ● Asked participants to identify AI generated options ● Each prompt is worth two points ○ Fully correct answers (i.e. selecting all correct options or selecting nothing when there is no correct): +2 points ○ Partially correct answer (i.e. selecting 1 correct when there are 2 correct options or selecting 1 correct and 1 incorrect when there is 1 correct option): +1 point ○ Incorrect answers (i.e. selecting anything when there is no correct option or selecting incorrect options): +0 points Total points possible: 12 3-5 human responses 0-2 AI responses

Slide 14

Slide 14 text

Methodology Choose the AI generated options Share reasoning Chat GPT loves Europe! Scan to take the quiz yourself! Scan to view AI vs Human inputs in a spreadsheet!

Slide 15

Slide 15 text

Methodology AI Evaluation ● For simplicity: I filled out the form as the AI using it’s choices ○ Anything over 60% flagged as AI considered a selected option ● Same scoring criteria as before ● Issues: ○ Too short for ZeroGPT ○ Solution: I chose detectors that specifically highlighted AI generated content. I tested them in different orders and the tool regularly flagged the same options.

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Human Findings Professor #2 9/12 75% Translates to: C Interesting Reasoning: “Last one sounds like someone trying to make AI sound non-AI.” Professor #1 5/12 41.6% Translates to: F Interesting Reasoning: “The AI spoke in a very monotone tone… other articles showed more emotion using exclamation points and nicknames for positions Professor #3 3/12 25% Translates to: F Interesting Reasoning: “Favorite and 'quite like' seem odd Q&A pairs. #5 is very long & over detailed”

Slide 18

Slide 18 text

AI Findings GPTZero 3/12 25% Translates to: F Interesting Result: Flagged every “Olympic Sport” input as AI Writer.com 1/12 8.3% Translates to: F Interesting Result: Flagged almost every food answer

Slide 19

Slide 19 text

Overall ● Humans were more accurate, but neither had a passing score ● Differences ○ Humans noticed more of the context clues (i.e. saying “I quite like tacos” seemed to not flow as well) ○ AI only reviewed the writing itself when it needed more context ● Neither are trustworthy to use in a classroom

Slide 20

Slide 20 text

Methodology - Ashi ● Testing GPTZero, Originality.ai, and Winston A.I ● 7 human responses, 7 mixed-source, 3 A.I (ChatGPT, Claude, Bard) ● Prompt: ○ 200 words ○ non-personal/sensitive ○ English prose

Slide 21

Slide 21 text

Methodology - Ashi “Expand upon the following paragraph to generate a 200-word paragraph summarizing and analyzing this article about the Chinese education ministry’s proposition to redesign school P.E classes. What was the motivation behind this proposal? What was the public response? Are any arguments made or presented by the article?”

Slide 22

Slide 22 text

Methodology - Ashi To generate the mixed-source responses: Using this paragraph - [first 100 words from human sample] - make a 200 word response to summarizing and analyzing this article about the Chinese education ministry’s proposition to redesign school P.E classes. What was the motivation behind this proposal? What was the public response? Are any arguments made or presented by the article? Use the paragraph excerpt above word-for-word as the first half of your response.

Slide 23

Slide 23 text

Findings - Ashi ● Directly comparing each of the 3 detectors is tricky. ● They utilized different scales to report their findings, ○ Using hard numbers and others using qualitative descriptions. ● Can go into more detail during discussion or visit the visual essay: ○ https://ashi-kamra.github.io/AIContentDetectorAnalysis/

Slide 24

Slide 24 text

GPTZero Detector Results Amount of Human Involvement Number of Responses ___ = less accurate ___ = more accurate

Slide 25

Slide 25 text

Winston A.I Detector Results mount of Human Involvement Number of Responses Amount of Human Involvement ___ = less accurate ___ = more accurate

Slide 26

Slide 26 text

Originality.ai Detector Results Number of Responses Amount of Human Involvement Number of Responses ___ = less accurate ___ = more accurate

Slide 27

Slide 27 text

Findings - Ashi ● There is a level of inaccuracy and obscurity that can’t be overlooked ● Do give a solid general idea about the usage of AI across a large group of samples ● Confident and accurate classifications for any one writing sample is not possible.

Slide 28

Slide 28 text

Implications ● Teachers likely won’t have the time, energy, or background knowledge to safely and effectively utilize these tools, or the skills to identify AI themselves ○ “ Makes about as much sense in the long run as trying to ban calculators.” - Executives at Turnitin ● Must determine how to maintain meaningful learning. ● Using these technologies a matter of enhancing learning and supplementing the existing process. Alimardani, Armin, and Emma A. Jane. “We pitted ChatGPT against tools for detecting AI-written text, and the results are troubling.” The Conversation, 19 February 2023

Slide 29

Slide 29 text

Implications “If we expect students to act with integrity, then we as educators have to act with integrity and model that behavior.” “ Deceptive assessment using tools and technologies without students’ knowledge ahead of time is not modeling integrity.” Wilhelm, Ian. “Nobody Wins in an Academic-Integrity Arms Race,” Chronicle of Higher Education, 12 June 2023.

Slide 30

Slide 30 text

Thank you! Supervisor Melissa Webster Researchers Liora Jones Natasha Khoso Michelle Lee Shannon Li All Research Participants Generative AI Experts Ethan Mollick Emon Shahrier Snacks and Support Jess Lipsey Photography Anabelle Lipsey Hannah’s Proofreaders Sandra Rose Natalee Rose Sasha Sherstnev