The Imperfection of AI Detection Tools

magnifying glass with focus on glass

As AI-generated content continues to grow and develop, so too are certain issues to consider. For academic workers, one of the key challenges is discerning AI-generated content from human-generated content. As the use of AI tools becomes more widespread, educators are seeing a growing number of assignments and submissions that may include — or be entirely composed of — AI-generated content. This has led to increasing demand for AI content detectors within higher education.

AI content detectors are software tools that can scan and analyze text content to determine whether an AI writing tool generated it. Today, they play a crucial role in upholding academic integrity, providing fair assessments, and fostering genuine learning experiences by enabling educators to identify content that may have been generated by AI or borrowed without proper citations. AI content detectors analyze text by looking for certain patterns indicative of AI generation, including repetitive terms and phrases that look abnormal, nonsensical sentences or clauses, overuse of formal or informal tones, a suspicious lack of emotional nuance or personalization, or other types of technical markers.

Many AI-detection tools work by “assessing perplexity, a measurement of the “unpredictability” of a sequence of language in text. Lower perplexity is considered evidence of AI generation because AI tends to make the most “obvious” or most common language choices as compared with human-produced writing. Burstiness, the variation in sentence structure and length, is another factor. A particularly low “burstiness” level indicates that a text is likely to be AI-generated. AI models tend to produce less varied sentence length and structure when compared with typical writing.” This can easily lead to bias against non-native English speakers and can also be easily combated through prompt engineering

Use with Caution

While the demand for AI detection tools is strong, these tools are deeply flawed and require careful consideration before use. Their unreliability has led major institutions to reject them outright. UCLA, for instance, declined to adopt Turnitin’s AI detection software, citing “concerns and unanswered questions” about accuracy and false positives—a decision mirrored by many UC campuses and institutions nationwide.

Even OpenAI, the company behind ChatGPT, shuttered its own AI detector due to poor performance. The tool correctly identified only 26% of AI-written text while falsely flagging 9% of human writing as AI-generated. The failures extend beyond statistics: AI detectors have incorrectly accused innocent students and even labeled the U.S. Constitution as 100% AI-written.

National organizations have taken notice. The MLA-CCCC Joint Task Force on Writing and AI urged educators to “focus on approaches to academic integrity that support students rather than punish them” and cautioned against detection tools, noting that “false accusations” may “disproportionately affect marginalized groups.”

Recent studies consistently demonstrate these tools’ inadequacy. They’re remarkably easy to fool—one study found that while detectors identified ChatGPT text with 74% accuracy, this plummeted to 42% when students made minor tweaks to the generated content. As AI models improve, distinguishing machine-generated text from well-written human prose will only become harder.

Testing by Times Higher Education confirmed this unreliability. Through simple prompt engineering—asking ChatGPT to write like a teenager—researchers reduced Turnitin’s detection rate from 100% to 0%. When they had ChatGPT “improve” genuinely human-written academic work to sound more scholarly, Turnitin failed to detect any AI involvement.

The bias problem is even more troubling. Stanford researchers discovered that while detectors were “near-perfect” with essays by U.S.-born eighth-graders, they misclassified over 61% of essays written by non-native English speakers as AI-generated. Shockingly, 97% of these TOEFL essays were flagged by at least one detector.

Ethical and Data Privacy Issues

Beyond accuracy issues lie serious ethical questions. When student work is uploaded to ChatGPT or commercial detection tools, what happens to that data? Do we need student permission? Are we violating FERPA protections?

Perhaps most ironically, current AI detection software relies on older AI models as its detection mechanism—raising the fundamental question of whether we should be using AI to catch AI.

Summary

With AI detection tools, we recommend you proceed with caution. They should never be your only source of proof, and we suggest you address concerns about AI use with the student directly, as suggested by TLC’s “Guidance for Addressing Suspected AI Misconduct


Peer Institution Resources

Further Reading

Dalalah, D., & Dalalah, O. M. A. (2023). The false positives and false negatives of generative AI detection tools in education and academic research: The case of ChatGPT. The International Journal of Management Education, 21(2), Article 100822. https://doi.org/10.1016/j.ijme.2023.100822

Elkhatat, A.M., Elsaid, K. & Almeer, S. Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. Int J Educ Integr 19, 17 (2023). https://doi.org/10.1007/s40979-023-00140-5

Guan, Q., & Han, Y. (2025). From AI to authorship: Exploring the use of LLM detection tools for calling on “originality” of students in academic environments. Innovations in Education and Teaching International, 62(5), 1514–1528. https://doi.org/10.1080/14703297.2025.2511062

Kim, H. N. (2025). Detecting the Undetectable: The Need for a New Paradigm for Academic Writing Evaluation in the AI Era – Addressing Inconsistencies in AI and Plagiarism Detection Tools. In C. Stephanidis, S. Ntoa, M. Antona, & G. Salvendy (Eds.), HCI International 2025 Posters (pp. 229–239). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-94153-5_22

Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns (New York, N.Y.), 4(7), Article 100779. https://doi.org/10.1016/j.patter.2023.100779

Pratama, A. R. (2025). The accuracy-bias trade-offs in AI text detection tools and their impact on fairness in scholarly publication. PeerJ. Computer Science, 11, Article e2953. https://doi.org/10.7717/peerj-cs.2953

Sadasivan, V. S., Kumar, A., Balasubramanian, S., Wang, W., & Feizi, S. (2024). Can AI-Generated Text be Reliably Detected? https://doi.org/10.48550/arXiv.2303.11156

Weber-Wulff, D., Anohina-Naumeca, A., Bjelobaba, S., Foltýnek, T., Guerrero-Dib, J., Popoola, O., Šigut, P., & Waddington, L. (2023). Testing of detection tools for AI-generated text. International Journal for Educational Integrity, 19(1), Article 26. https://doi.org/10.1007/s40979-023-00146-z


Main image used under the Creative Commons Attribution-Share Alike 3.0 Unported license.
Source: https://commons.wikimedia.org/wiki/File:Magnifying_glass_with_focus_on_glass.png