The Structural Limits of Artificial Intelligence Revealed by Data Depletion and Model Collapse
Artificial intelligence (AI) has always seemed destined to become ever smarter. Each year, new models have emerged, their answers have grown more natural, and they have rapidly caught up with human capabilities. We have come to take the advancement of AI as a given.
However, recently, a different question has begun to surface within the AI industry and research community: What happens if AI has nothing left to learn? Can the progress of AI truly continue indefinitely?
The starting point for this question is 'training data.' AI does not experience the world on its own. It learns about the world through records left by humans-texts, images, videos, and audio data. The intelligence of AI is not determined solely by computational power, but is greatly influenced by the diversity and quality of the data it has learned from. Yet, there have been growing warnings that the supply of this essential learning material is approaching its limits.
The Internet 'Gold Mine' Is Running Dry: A Warning for 2026
Until now, large language models (LLMs) have grown by leveraging the vast amount of data publicly available on the internet. Web documents, news articles, books, and academic papers have served as textbooks for AI. However, much of the high-quality, publicly accessible data has already been collected.
Epoch, a global AI research institute, recently warned in a report that the stock of high-quality human-generated text data could be completely exhausted as early as between 2026 and 2030. The remaining data is often either strictly restricted due to copyright issues or available only as paid content requiring significant costs.
As a result, it has become virtually impossible for AI companies to continue 'unauthorized mass collection' for training as they have in the past. Securing data has shifted from a matter of technological competition to a realm dominated by massive capital and legal battles. In fact, copyright lawsuits filed by major media outlets such as The New York Times and by authors against companies like OpenAI symbolically illustrate the 'data barrier' now facing the AI industry.
Lee Seongyeop, Professor at Korea University's Department of Intellectual Property Strategy, explained, "Large language models have already scraped most of the publicly available data on the web. Simply increasing the quantity of data now only leads to the inclusion of redundant or reprocessed low-quality text, so the marginal utility for improving intelligence has sharply diminished."
He added, "What is needed now is not just simple corpora, but data that is intricately labeled with complex logical structures and human value judgments. However, the costs of producing and verifying such data are increasing exponentially."
The Paradox of Synthetic Data: The Invisible Wall of 'Model Collapse'
As an alternative to data shortages, the industry has turned its attention to 'synthetic data'-training next-generation AI with text and images produced by AI itself. The idea is that if human records are insufficient, AI can generate its own data and evolve independently. However, this approach has recently revealed a critical structural flaw known as 'model collapse.'
A joint research team from the University of Oxford, University of Cambridge, and University of Toronto demonstrated in a paper published in the journal Nature that models repeatedly trained on AI-generated data forget the original data distribution within just a few generations, leading to incoherent and regressive intelligence. The researchers analyzed how AI, by treating statistically rare cases (outliers) as mere errors and eliminating them, causes a rapid loss of informational diversity.
This is akin to repeatedly photocopying a photo until it becomes an unrecognizable blur-a phenomenon of degradation now occurring in the realm of intelligence. Ultimately, AI that relies solely on synthetic data becomes trapped in an 'echo chamber,' endlessly reproducing biased information.
Tech Giants Revise Their Strategies: Perspectives of Ilya Sutskever and Yann LeCun
This sense of crisis is clearly reflected in the statements of AI luminaries. Ilya Sutskever, co-founder and former Chief Scientist of OpenAI, pointed out in a recent keynote speech, "We have almost mined out the internet gold mine, and simply scaling up is no longer enough to reach the next level of intelligence." This signals a shift in the AI race from the number of GPUs to the possession of exclusive data that others do not have.
Yann LeCun, Chief AI Scientist at Meta, has also highlighted the fundamental limitations of text-based learning. In his books and academic lectures, he emphasizes, "Human children do not gain intelligence by reading trillions of words, but by interacting with the physical world in real time." He criticizes the current approach of relying solely on text data, arguing that it leads to a 'hallucination loop' disconnected from reality. He advocates for a shift in architecture toward 'world models' that can understand physical laws through video and sensory data, moving beyond text.
The Renewed Importance of 'Human Records' and New Questions
Ultimately, the stagnation of AI learning does not signal a technological disaster, but rather a transformation in the paradigm of growth. Whereas AI has so far bulked up by absorbing vast amounts of human records, going forward, the 'quality' of data-and the creative records generated by humans-will become the precious assets that determine the survival of AI, rather than mere quantity.
Meticulously observed laboratory data, vivid field notes, and the complex moral and philosophical judgments that only humans can make are areas AI cannot synthesize on its own. This is why tech giants like Google and Microsoft are now investing astronomical sums not just in data collection, but in hiring expert groups to directly craft high-quality problem sets to teach AI.
The next stage for AI does not lie solely within machines. The answer still resides in the physical world where humans live, and in the primary data generated within it. This moment, when it seems that AI has nothing left to learn, is not a limit of technology but a time for reflection on what humans should cherish, record, and preserve. We are moving from an era of asking what AI can do, to an era of considering what kind of world we will leave behind as data.
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.
![[Reading Science] Does AI Have Nothing Left to Learn?](https://cphoto.asiae.co.kr/listimglink/1/2025082921510412305_1756471864.jpg)
![[Reading Science] Does AI Have Nothing Left to Learn?](https://cphoto.asiae.co.kr/listimglink/1/2026011611222894189_1768530148.jpg)
![[Reading Science] Does AI Have Nothing Left to Learn?](https://cphoto.asiae.co.kr/listimglink/1/2026011611265494206_1768530414.jpg)
![[Reading Science] Does AI Have Nothing Left to Learn?](https://cphoto.asiae.co.kr/listimglink/1/2026011611254794202_1768530347.jpg)
![[Reading Science] Does AI Have Nothing Left to Learn?](https://cphoto.asiae.co.kr/listimglink/1/2026011611290394214_1768530543.jpg)

