"Securing Data is Competitiveness"
Peeking into Competitors' Content
Pooling Data Together Bit by Bit
As concerns arise about potential data shortages, global IT companies are fiercely competing to secure data. Acquiring large volumes of high-quality data as quickly as possible directly translates to competitiveness in artificial intelligence (AI). Just as a person’s knowledge deepens by reading more books, AI becomes smarter by learning from more data.
According to the IT industry on the 26th, OpenAI recently signed a content learning agreement with the social media platform Reddit. This is to train on various data posted by 1.2 billion Reddit users. They have also signed content usage agreements with media outlets such as the Financial Times (FT) and the Wall Street Journal (WSJ).
In April, OpenAI enabled the use of ChatGPT without account registration or login, which was also an attempt to secure data. Lowering the service barrier allows them to gain more users and data.
Recently, it has been reported that OpenAI discussed the possibility of using Google and YouTube video transcripts for training their next-generation model, GPT-5. OpenAI is securing data through various methods, even considering competitors like Google as sources of information.
Apple has decided to invest $100 million (approximately 130 billion KRW) to secure AI training data. Initially, they plan to purchase data by paying $50 million (approximately 67 billion KRW) to the global image and video content provider Shutterstock. Additionally, they are negotiating content usage with groups such as IAC, which owns magazines like Vogue, The New Yorker, NBC News, and People. It is reported that they have proposed at least $50 million in exchange for using years’ worth of articles.
In South Korea, companies have also started pooling data collectively. AI startup Upstage has partnered with over 20 institutions and companies, including the National Information Society Agency (NIA) and Lotte Shopping, to create the '1T (1 trillion tokens) Club.' Tokens are the smallest units of sentences that AI can learn from. If collaboration partners provide more than 100 million Korean language tokens, Upstage offers them its proprietary large language model (LLM) at a discounted price or shares related profits.
Companies also create training data themselves using methods such as 'data augmentation' or 'data synthesis.' This involves diversifying data by modifying or synthesizing existing data. They also utilize AI-generated training data and sometimes change AI model structures to efficiently train AI with limited data.
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.
![[AI Data Shortage Crisis] ChatGPT Taps YouTube... Relentless Data Hunting Without Boundaries](https://cphoto.asiae.co.kr/listimglink/1/2024062512083028338_1719284910.jpg)

