AI Is Already Eating Its Own Output and Breaking Down

Oxford University Professor Michael Wooldridge has raised significant concerns about artificial intelligence systems consuming their own outputs, a phenomenon that leads to what researchers call “model collapse.”

In discussing recent research co-authored by his colleague Yarin Gal at Oxford, Wooldridge explained a troubling experiment involving large language models. “They took a large language model which had basically been trained on human text but then they trained another model on the outputs of the large language model and then they did the same thing again,” he said. “So they trained another model on the outputs of the second generation model.”

The results were alarming. “It led to something that they call model collapse. Basically within about five generations it’s just producing gibberish,” Wooldridge revealed. This finding exposes a fundamental flaw in AI-generated content: despite appearing convincingly human, it lacks essential qualities that genuine human text possesses.

“What this tells you is that for all that the text that the original large language model was producing looks like human text, it has some qualities that human text doesn’t,” Wooldridge explained. This discovery carries profound implications for the future of AI development and training.

The professor painted a concerning picture of the future information landscape. “In a hundred years time, probably there’s going to be vastly more AI generated content. Not just text, but text, music, audio, spoken word, video, and virtual reality. Vastly more AI generated content out there than human generated content,” he predicted.

The crux of the problem lies in the fact that if AI-generated material proves unsuitable for training future AI systems, the technology industry faces a serious challenge. Some experts suggest we may be approaching the limits of available human-generated data scraped from the internet, making synthetic data increasingly important for AI development.

However, Wooldridge noted that creating realistic synthetic data presents significant difficulties, particularly in sensitive areas like health records where privacy concerns prevent the use of real human data.

As AI-generated content proliferates across the internet, the ability to distinguish between human and machine-created material becomes increasingly important for maintaining the quality and reliability of future AI systems.