본문 바로가기
bar_progress

Text Size

Close

[AI Data Shortage Crisis] 'If Data Is Insufficient, Make It Yourself?'... Synthetic Data Gaining Attention

AI Boom Outpaces Data Supply Demand
Growing Interest in Synthetic Data Created Virtually
Negative Views on Performance Decline and Lack of Diversity Also Present

[AI Data Shortage Crisis] 'If Data Is Insufficient, Make It Yourself?'... Synthetic Data Gaining Attention [Image source=Reuters Yonhap News]

As concerns arise that securing data necessary for artificial intelligence (AI) training may face limitations, artificially generated synthetic data is gaining attention. This involves using fictitious data for AI training, but there are also negative opinions suggesting potential performance degradation.


According to the '2023 Data Industry Status Survey Report' released last month by the Korea Data Industry Promotion Agency, the domestic data industry market was estimated at 27.1513 trillion KRW last year, growing 4.6% compared to the previous year. Until 2018, the market size was about 15.5684 trillion KRW, but it increased by more than 11.5 trillion KRW in five years. The domestic data industry market is expected to grow at an average annual rate of 12.6%, with the market size projected to approach 51.1413 trillion KRW by 2028. The global market research firm 360iResearch forecasted that the market size of training datasets used for AI model development worldwide will grow by more than 26% annually.


[AI Data Shortage Crisis] 'If Data Is Insufficient, Make It Yourself?'... Synthetic Data Gaining Attention

Interest in synthetic data appears to reflect concerns that supply may not keep up with data demand.


Synthetic data is virtual data created for AI training and is broadly divided into 'partial' and 'fully' synthetic data. Partial synthetic data is created by applying synthetic information to parts of real data. It is useful for protecting sensitive information.


Fully synthetic data means generating entirely new information. Although fictitious, it can use statistical properties identical to real data, allowing conclusions similar to those obtained using actual data.


Proponents of synthetic data highlight the ability to generate unlimited data as needed. They emphasize that data can be provided in sensitive fields such as finance and healthcare where personal information is critical. The global market research firm Gartner predicted that by 2030, the proportion of synthetic data used for AI training will surpass that of real data. For example, the use of synthetic data is increasing in autonomous driving model development. This is because it is difficult to secure actual traffic accident data, and synthetic data can also enable 3D rendering.


Hwang Min-young, Vice President of SelectStar, a domestic AI data startup, said, "As data that can be collected by conventional methods gradually depletes, reliance on synthetic data is expected to increase."


Since synthetic data is artificially created, there are also negative views. Because it is not real, quality issues may arise. Moreover, if poorly designed synthetic data is used for AI training, it is highly likely that it will fail to accurately reflect reality. If erroneous data is reproduced and used in the AI field, it can lead to performance degradation, distortion, and hallucination phenomena where AI models provide inaccurate answers.


Kim Myung-joo, President of the International Artificial Intelligence Ethics Association and Director of the Barun AI Research Center at Seoul Women’s University, explained, "There are experimental results showing that when next-generation AI models use synthetic data created by AI, their performance may decline compared to before," adding, "If AI models using synthetic data dominate, diversity may be lost." She further emphasized, "There is also a need for vigilance regarding the possibility that AI could homogenize human civilization."


© The Asia Business Daily(www.asiae.co.kr). All rights reserved.


Join us on social!

Top