본문 바로가기
bar_progress

Text Size

Close

[AI Data Shortage Crisis] "AI Growth May Halt in 2 Years"... Warning of Data Cliff

High-Quality Data Shortage for AI Training
Soaring Data Costs... Increasing Burden on Companies
Resorting to Overseas Data or Workarounds
"Third AI Winter Approaching" Concerns Raised

[AI Data Shortage Crisis] "AI Growth May Halt in 2 Years"... Warning of Data Cliff

#Domestic AI company A purchased overseas data due to a shortage of Korean language data for training. To provide translation services, data translating various languages into Korean is needed, but there was not enough. Ultimately, they bought data translated from Indonesian to Japanese and then converted it back into Korean. The CEO of company A lamented, "Errors or linguistic nuances may be lost after multiple rounds of translation, but we had no choice."


#AI startup B has spent about 40% of its investment funds this year on data acquisition. As data collection and processing costs skyrocketed, they paid more than for developer salaries or infrastructure expenses. The CEO of B hinted, "Some who cannot afford the costs resort to shortcuts like converting videos with ambiguous copyrights into text (Speech-to-Text) for training."


There is an emergency in securing data essential for AI training. Since the advent of large language models (LLMs), the supply of high-quality data required by AI models is rapidly depleting. There are even bleak forecasts that training data will run out within a few years, leading AI into a stagnation period.


"Data Cliff Within a Few Years"
[AI Data Shortage Crisis] "AI Growth May Halt in 2 Years"... Warning of Data Cliff

According to Epoch AI, an AI research institute, AI training data is expected to start running out from 2026, two years from now. This is because the speed of AI training is accelerating faster than the growth rate of data. Epoch AI particularly predicted, "Assuming AI is overtraining, data depletion could be faced as early as next year." AI overtraining refers to attempts to increase training data rather than enlarging the model for lightweight and efficiency.


AI has accelerated its development by expanding training to text, images, and videos. OpenAI’s GPT-3, released in 2020, trained on about 300 billion tokens (the smallest unit of sentences). GPT-4, released three years later, is estimated to have trained on 12 trillion tokens. Meta’s latest model LLaMA 3, released this year, trained on over 15 trillion tokens. The amount of training data increased 50-fold in just four years.


On the other hand, obtaining data is becoming increasingly difficult. Until now, AI mostly trained by scraping information available on the internet, including books and papers. News, social media, and blog content are also AI’s food sources. However, language data usable for AI training is only increasing by about 7% annually.


Even if data exists, copyright issues must be resolved. Due to criticism over unauthorized use of AI training data, the use of news and other content has been blocked. High-quality data for advancing LLMs has dried up. High-quality data means data with diverse topics and rich expressions. It requires consistent information without spelling or grammatical errors.


However, less than 10% of information available on the internet qualifies as high-quality data. As AI evolves into multimodal AI recognizing voice and drawing images, diverse data is also needed but hard to obtain. Voice and video data are scarce and difficult to use due to privacy issues. Choi Yura, senior researcher at Infinic, a specialized AI training data company, explained, "Unlike text data that can be used after resolving copyright issues, there is almost no unstructured data available for industrial use."


The shortage of Korean language data is even more severe. Since the user population is small, the amount of data that can be secured is limited. There is no data disclosure platform like Common Crawl. Common Crawl is a platform created by a U.S. nonprofit organization that collects and provides publicly available online data with collection permission. The data market has not been established to the extent that there is no standard for pricing data. Lee Moon-ki, director of data business at AI company Conan Technology, pointed out, "Even if you combine data from Korean companies like Naver and Kakao, it does not reach the trillion won level," adding, "It is only about 6-7% compared to big tech companies."

[AI Data Shortage Crisis] "AI Growth May Halt in 2 Years"... Warning of Data Cliff

"70% of Companies Face Data Shortage"

Domestic companies complain about data shortages. According to the '2023 AI Industry Survey' released by the Software Policy Research Institute under the Ministry of Science and ICT, 70.8% of domestic AI companies responded that they face difficulties due to data acquisition and quality issues. This was cited as the biggest problem after AI workforce shortages. The data issue showed a higher response rate than AI infrastructure shortages such as computing equipment (53.2%).


Large companies are no exception. It is reported that they purchase overseas data or use synthetic data due to data shortages. Naver stopped training news data last year due to copyright issues after training its massive AI 'HyperCLOVA X' on news and blogs. Discussions with media companies about data usage are ongoing but no consensus has been reached. An industry insider said, "Large companies need more data because their models are large," adding, "Due to data shortages, they sometimes buy English data or use data with ambiguous copyrights only for fine-tuning."


The difficulties are even greater for small and medium enterprises or startups. Not only data collection but also processing it for AI training incurs costs. Lee Dong-yoon, CEO of Ant Reality, which provides AI beauty solutions, said, "Faces are difficult to collect due to privacy issues, and there is not much publicly available data," expressing concern, "Startups may face barriers from the proof-of-concept (PoC) stage due to data shortages."


There are even forecasts that AI may face a third winter due to data shortages. AI experienced two stagnation periods in the 1970s and 1980s due to technical limitations. The emergence of generative AI like ChatGPT opened a new golden age, but it may now face a data cliff. The British science magazine New Scientist analyzed, "As training data runs out, the pace of AI development is likely to slow."


Recently, the stock price of U.S. semiconductor company Nvidia has fallen continuously, raising concerns about a renewed 'AI bubble' that had been quiet. The Associated Press noted, "The AI craze is overheated, raising concerns about excessive expectations." Director Lee said, "Due to data shortages, AI development may fail to meet market expectations," adding, "I worry whether a third winter will come."


© The Asia Business Daily(www.asiae.co.kr). All rights reserved.


Join us on social!

Top