- Paper on Real-Time Broadcast Conversation Analysis Published in World’s Leading NLP Conference ‘EMNLP’
- Sogang University and KAIST Researchers Officially Announce Hate Speech Dataset ‘KOLD’
AI training data platform SelectStar (CEO Kim Se-yeop and Shin Ho-wook) announced on the 28th that their co-authored paper "Analyzing Norm Violations in Real-Time Live-Streaming Chat" was officially presented at EMNLP 2022. EMNLP (Empirical Methods in Natural Language Processing) is considered one of the top three global conferences in the field of natural language processing, alongside ACL and NAACL.
"Analyzing Norm Violations in Real-Time Live-Streaming Chat" is a paper studying viewers' chat during internet broadcasts. SelectStar co-authored the paper with NLP-specialized startup SoftlyAI and researchers from the University of Southern California (USC).
The research team focused on the fact that most existing AI systems that detect hate speech have only been trained on "asynchronous conversation data" collected from SNS services. Asynchronous conversations refer to dialogues where participants do not engage simultaneously, such as emails and comments. The study revealed that AI trained on asynchronous conversations achieved less than 35% accuracy when filtering hate speech in real-time live-streaming chats.
To collect real-time conversation data, SelectStar processed 4,664 Twitch chat logs from the internet broadcasting platform into AI training data. Data processors reviewed the broadcast video and chat logs from the previous two minutes, classifying hate speech within sentences according to the flow of conversation and adding annotations.
Kim Min-woo, a researcher on SelectStar's AI team, stated, "Until now, there was no AI training data that classified harmful expressions in real-time streaming environments. Through this research, we were able to confirm how and to what extent the data distributions differ between asynchronous conversation situations and real-time streaming situations."
Meanwhile, at EMNLP 2022, the KOLD dataset, built by SelectStar together with researchers from Sogang University and KAIST, was also officially presented. KOLD (Korean Offensive Language Dataset) is a Korean hate speech dataset composed of 40,429 hate comments collected from platforms such as Naver News and YouTube, along with their annotations. SelectStar provided a data processing environment to 3,124 workers through the crowdsourcing platform CashMission.
CEO Kim Se-yeop of SelectStar said, "The datasets created by SelectStar have been introduced not only at CVPR and NeurIPS, the most prestigious conferences in the AI field, but also at EMNLP. SelectStar is the only company in the domestic AI training data market whose data quality and reliability are guaranteed enough to be utilized in cutting-edge research."
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.


