"Pay for AI Training Data"
Data Ownership Debate Arises as Generative AI Emerges
The competition surrounding generative artificial intelligence (AI) has ignited a war over data ownership. Developers seeking to train AI with more data are clashing with companies aiming to charge fees for data usage. This tug-of-war, initially between tech companies holding the most data and those needing it, has expanded into conflicts with established media outlets such as news organizations.
"Only Big Tech Gets Richer"... Willing to Go to Court
On the 20th (local time), the U.S. IT magazine Wired reported that Stack Overflow plans to charge companies developing large-scale AI models within the year for data usage. Stack Overflow is a community where over 20 million developers share programming information. Prashanth Chandrasekar, CEO of Stack Overflow, emphasized, "Platforms fueling large language models (LLMs) should be compensated," adding, "This will help maintain high-quality data and benefit future AI as well."
Reddit also announced it will start charging AI developers commercially using its conversation data from June. Reddit is a U.S. social media platform visited by over 57 million users daily. Steve Huffman, CEO of Reddit, bluntly stated in an interview, "There is no need to give Reddit data for free to the world's largest companies."
Elon Musk, CEO of Twitter, has even threatened legal action, alleging that Microsoft (MS) illegally used Twitter data to train AI. The conflict began in February when Twitter started charging for access to its developer tools. After MS removed Twitter from its advertising platform, Twitter hinted at a lawsuit.
Established media outlets have also stepped forward. The News Media Alliance (NMA), representing about 2,000 North American news organizations including The New York Times, announced AI principles on the 20th. The principles state, "Unauthorized use of NMA content by generative AI constitutes property rights infringement," and "Such use must not be permitted without authorization and fair compensation must be provided." Simply put, if an AI’s answer to a question is actually reading out news articles, copyright fees must be paid.
Data Sources 'Opaque'... Monetizing Free Data
Until now, companies like OpenAI, which developed ChatGPT, as well as Google and Meta, have relied on data gathered online. Training generative AI requires vast amounts of data. They extensively used social media conversations, personal content, development code, news articles, and academic papers. Recently, The Washington Post analyzed datasets used in Google’s LLM with AI researchers and found that most came from patent information sites and online encyclopedias. Half of the top sites were news media, and over 500,000 personal blogs were included.
The issue is whether it is appropriate to use publicly available online content without permission. OpenAI does not even disclose what data was used starting from its latest model, GPT-4. They have also begun monetizing AI developed this way. For example, MS released 'GitHub Copilot,' an AI trained on code posted in open communities. The AI assists with coding and charges $19 per user per month. This has sparked backlash that big tech is getting richer by using data without compensation.
Experts see this as the start of a war over data ownership among IT companies. A fierce battle is underway between those who hold large amounts of data and those who want to use it for AI. Kim Myung-joo, professor of Information Security at Seoul Women’s University, said, "The atmosphere that tolerated data use for public purposes (new technology development) under the fair use principle is collapsing," adding, "Determining what data companies use and what rights are infringed is a very complex issue, so the debate will continue."
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.
![[Big Tech Paradigm Shift]② Give Away Data for Free? ... Companies Presenting the Bill](https://cphoto.asiae.co.kr/listimglink/1/2023042708463512897_1682552795.jpg)
![[Big Tech Paradigm Shift]② Give Away Data for Free? ... Companies Presenting the Bill](https://cphoto.asiae.co.kr/listimglink/1/2023042416112899330_1682320289.jpg)

