AI Detectors and Humanizers: Two Sides of the Same Coin?

LLMs are getting better and better at generating meaningful, well-written text. Nowadays, they can even adjust their tone depending on the context. At the same time, AI detection tools are evolving too. They're performing better at spotting AI-generated content. One reason might be that more AI-generated text is available, giving these tools larger and more diverse datasets to train on.

As you might expect, the rise of AI detection tools has been matched by the emergence of AI "humanizer" tools. While these may seem to work against each other, at a fundamental level, they rely on the same principles. In fact, improvements in AI detection often help humanizer tools improve as well. The logic is simple: if a pattern can be used to identify AI-generated text, that same insight can be used to modify the text so it no longer triggers those labels.

The Test

In this post, we put this idea to the test, giving the AI detection tools a slight advantage. I use a paid AI detector that claims top performance and compare it against a free, online AI humanizer. For the experiment, I generated text with ChatGPT using the following prompt:

"Write a paragraph summarizing how large language models have evolved over time."

Let's take a look at the results:

ChatGPT-Generated Text

Large language models have evolved from relatively simple statistical language models and early neural networks into highly capable systems trained on vast amounts of data using deep learning. Initial approaches relied on n-grams and rule-based features, which were limited in capturing long-range context. The introduction of recurrent neural networks and later transformer architectures marked a major shift, enabling models to represent language more flexibly and at scale. As training datasets grew and computational resources expanded, models began to exhibit emergent abilities such as few-shot learning, reasoning across domains, and generating coherent long-form text. More recent developments focus not only on scale but also on alignment, efficiency, and integration with tools, reflecting a broader move from purely predictive models toward systems designed to be more reliable, adaptable, and useful in real-world applications.

Result → AI Detection: 100% AI Generated

Humanized Version of the Text

The development of large language models has come from the early 'statistical language model' based on simple statistical characteristics, to the more complex neural networks (NNs) of the early days of artificial intelligence (AI), and into the currently most advanced models that use deep learning to train large amounts of data. The earliest models used n-grams and defined rules for defining words based on limited data (such as only a few words) and therefore had serious limitations in how to define words over a long distance. The introduction of Recurrent Neural Networks (RNNs) and, ultimately, the transformer-based architectures has been a significant change for large-scale language models, allowing them to have the ability to define and use the language much more flexibly and on a much larger scale. As the datasets increased, as well as computer processing capacity, large language models began to stimulate new behaviors and used models that were capable of performing multi-domain tasks and generating coherent long-form text, as well as few-shot learning and reasoning. The ideas in newer model developments (and research) are focusing not only on the scale of the new models, but also on alignment, efficiency, and the ability to provide a seamless integration with other tools; thereby continuing the general efforts to move from predictive models to reliable, flexible, and useful systems that can operate in many real-world applications.

Result → AI Detection: 2% AI Generated

Observations

Clearly, the humanized version differs, with varied sentence structures and vocabulary. As expected, while AI detection tools perform well, AI humanizing tools can match them, making the text undetectable.

Closing Thought

I think the real question isn’t which tool performs better, but how practical—or valuable—is it to use an AI tool to check an AI text to see if an AI model generated it?