AI writing quickly became a powerful tool but also a tremendous threat to education. How accurate AI detection tools are, and how do they work? A PlagiarismCheck.org team developed an innovative AI detection software reaching 97% accuracy. Our experts gave a behind-the-scenes look at these cutting-edge tools in an online webinar 13 April 2023.

Head of Product Garrett Baklytsky and Language Analyst Natalie Weiner revealed the features and capabilities for detecting and analyzing data in modern AI Content Detectors.

Experts also answered questions from participants from various educational organizations. 

So, what findings and takeaways about how to deal with AI in students’ texts have been shared by PlagiarismCheck team? In particular, the gains and limitations of our three models of AI detection: Fingerprint, Enhanced Fingerprint, and Perplexity.

Model 1: Fingerprint

Fingerprint created a profile trained on ChatGPT texts. Then created a profile for a generic ‘Human’ writer and compared each text to both profiles. Defined if the text resembled ChatGPT or Human writing and incorporated the verdict into a similarity report.

Fingerprint for Al: The outcomes

✅ Gains 

❌ Limitations

• a quick production-ready solution

• c. 70% overall model accuracy

• 3 pages per text minimum

• only a whole-text verdict

• false-positives (human as Al)

Model 2: Enhanced Fingerprint

Fingerprint 2.0: The outcomes

✅ Gains 

❌ Limitations

• managed to reach over 90% accuracy

• could process shorter texts

• started showing the probability

score

• still a whole-text verdict

• required at least 2 pages of text for reliable result                                                                      

Model 3: Perplexity

What is perplexity? Predictability is characteristic of machine texts, and only human ones can really surprise you. So, perplexity is a metric showing how well a language model predicts an unseen test. A good language model gives a sequence with the highest probability. Perplexity is the inverse probability of a test set, normalized by the number of words. The lower the perplexity, the better the model predicts the next word.

“I always order pizza with cheese and mushrooms”

– High probability, low perplexity.

“She said me this. After wall a chaining”

– Low probability, high perplexity.

Despite the fact that human speech often has higher perplexity (lower probability) than machine-generated language, our perplexity model gave us what we were looking for.

We mark suspicious sentences as ‘Likely Al’ or ‘Highly likely Al’ and provide a whole-text verdict in addition to a sentence-by-sentence report.

Perplexity: The outcomes

✅ Gains 

❌ Limitations

• achieved 97% accuracy on short test samples

• no longer need to set the minimum text length

• Humans can write very Al-like sentences, and LLMs can generate a very human-like text                                                                            

The important caveats

Undoubtedly, students will eventually learn to bypass the Al detectors. But our experts are constantly researching the topic and updating our solutions too.

There are a number of LLMs out there, each having its own peculiarities. In addition to GPT3.5, we are going to address content detected by other Al models (GPT4, BARD, etc.)

Al tools are constantly developing, so combating them is a never-ending race.

Perhaps it’s time to embrace Al and use it to our benefit?