본문 바로가기
bar_progress

Text Size

Close

Artificial Intelligence Suffering from Genetic Diseases [AI Answer Note]

Data Depletion: Relying on Synthetic Data Alone
Could Lead to Collapse Like the Habsburg Dynasty
Diversity and Human-Generated Information Remain Essential

Editor's NoteExamining failures is the shortcut to success. 'AI Error Notes' explores failure cases related to AI products, services, companies, and individuals.

The Habsburg family was a royal house that ruled Europe for nearly 600 years. Their secret was 'marriage.' They established their dominance by forming marital alliances with surrounding powers. Charles V of this family held more than 20 titles, including Holy Roman Emperor, King of Spain, King of Germany, Archduke of Austria, and Lord of the Netherlands. He was only 19 years old.


They tried to keep the royal bloodline 'pure.' Therefore, they repeatedly married within the family. Marriages between uncles and nieces, cousins, and other relatives were very common. This eventually led to genetic diseases.


Artificial Intelligence Suffering from Genetic Diseases [AI Answer Note] A representative genetic characteristic resulting from consanguineous marriage was the "Habsburg jaw." It is a condition where the lower jaw protrudes abnormally, causing difficulties in pronunciation and chewing. Carlos II of Spain was one of the individuals in whom this feature appeared most severely.



Due to continuous consanguineous marriages over several generations, genetic diversity became very low, and health problems emerged. In the case of Carlos II, he suffered from severe physical and mental disabilities. He died without leaving any offspring, marking the end of the Spanish Habsburg dynasty. They rose through marriage and fell through marriage. This is a clear example of how crucial genetic diversity is for the sustainability of a species.


The Habsburg case is often referenced in the AI industry as well. In fact, there is even a term called 'Habsburg AI.' It is a metaphorical expression derived from the historical fact that the Habsburg Empire collapsed due to inbreeding. It points out the potential problems that can arise when AI models rely excessively on 'synthetic data.'


Fast and Affordable Data: The Power of Synthetic Data
Artificial Intelligence Suffering from Genetic Diseases [AI Answer Note] 'Synthetic data' is a concept first proposed in 1993 by Donald Rubin, a professor in the Department of Statistics at Harvard University in the United States. Getty Image Bank

Synthetic data is artificially generated data that mimics real data. It is an alternative used when appropriate data for training is unavailable or when the cost of acquiring data is too high.


For example, it is useful in the development of autonomous vehicles. To train for collision avoidance, a variety of collision data is needed. However, such data is far less abundant compared to data for lane changes or sign recognition because collisions occur much less frequently. In such cases, virtual road driving and collision scenarios can be simulated through computer simulations to obtain the necessary data. It is fast and inexpensive.


It is also possible to artificially augment specific situations or rare cases where real data is lacking. This helps balance biased data as well.


Privacy protection is another strength of synthetic data. AI can be trained with data that has similar characteristics without using actual personal information. For example, virtual patient data can be created based on real patient records, or virtual transaction data with similar patterns can be generated by analyzing actual transaction histories.

Artificial Intelligence Suffering from Genetic Diseases [AI Answer Note]

According to global market research firm Gartner, by around 2030, synthetic data will be used more than real data for AI training. Gartner predicted, “High-performance, high-quality AI development will be impossible without synthetic data.”


AI startup xAI, founded by Elon Musk, unveiled its AI chatbot 'Grok 3' last month. At the live-streamed launch event, xAI claimed that “Grok 3 outperformed Alphabet’s Google Gemini, Anthropic’s Claude, and OpenAI’s GPT-4o in math, science, and coding benchmark tests.”


The outstanding performance was enough to attract attention. Musk said, “Grok 3’s computational power is more than 10 times that of the previous version,” calling it “the smartest AI on Earth.” The xAI research team explained, “Grok 3 produces more sophisticated results than Grok 2 through large synthetic data sets, self-error correction, and reinforcement learning.” The company was founded in July 2023, released the first 'Grok' in November of the same year, and launched 'Grok 2' six months ago in August.


One of the secrets behind creating such a powerful AI in such a short time was the 'synthetic data' previously revealed by the xAI research team.


Questions Surrounding Grok: The Risks of Synthetic Data
Artificial Intelligence Suffering from Genetic Diseases [AI Answer Note] Portrait image created by designer Martin Disley on the theme of 'Habsburg AI.' Martin Disley Instagram

However, Grok 3 immediately faced criticism. While its performance is excellent, the very advantage that made Grok 3 outstanding?synthetic data?can also be a poison.


A research team at the University of Oxford in the UK published a paper in the international journal Nature in June last year, revealing that the performance of AI trained on data generated by AI rather than humans can sharply decline. The researchers first created text information about 14th-century English church towers and buildings using an initial AI model. They then repeated a feedback process by using this information to generate new answers. As this process repeated, the AI began producing bizarre results. Stories related to medieval architecture disappeared, it responded in foreign languages without being prompted, and even brought up stories about rabbits.


The researchers conceptualized this as 'model collapse.' When AI-generated information is repeatedly fed back into AI for training, the value of the output regresses. Jason Shadsky, a data scientist at Monash University in Australia, named this phenomenon 'Habsburg AI,' likening it to the Habsburg family losing genetic diversity and collapsing due to continuous inbreeding.


There are other potential risks surrounding synthetic data. It can amplify existing biases in the data rather than neutralize them. AI models using synthetic data may replicate or even strengthen biases present in the original data. The Financial Times (FT) described this as “the reason big tech invests huge amounts of money to obtain human-generated data.”


Data Depletion is Inevitable... Finding the Optimal Data Mix
Artificial Intelligence Suffering from Genetic Diseases [AI Answer Note]

Despite these risks, synthetic data is expected to remain an important tool in AI development. AI industry experts recognize the dangers of synthetic data but do not deny its value. It is important to appropriately mix real and synthetic data and strictly manage the generation process and quality of synthetic data.


Similar findings can be seen in the Oxford research paper mentioned earlier. When a small amount of human-generated data was mixed into synthetic data, the rate of AI model collapse decreased. Including just 10% human data significantly slowed down model collapse.


Moreover, the reality that the amount of human-generated data is gradually depleting cannot be ignored. Ilya Sutskever, co-founder of OpenAI, compared data for AI model training to “finite fossil fuels” during a lecture in Vancouver, Canada, last year, stating, “The internet data that could help improve AI performance has already been exhausted.”


Because the 'AI genetic disease' is frightening, synthetic data should not be outrightly tabooed. It is time to clearly recognize the limitations of synthetic data, which cannot completely replace human-generated real data, and to adopt a balanced perspective that considers both its risks and possibilities.


© The Asia Business Daily(www.asiae.co.kr). All rights reserved.

Special Coverage


Join us on social!

Top