Concerns Rise Over Renewed Independence Debate as Multimodal Competition Intensifies
"The second round of the government's 'Independent Artificial Intelligence (AI) Foundation Model (Dokpamo)' project will be a battle of 'multimodal' capabilities. The surviving teams, having witnessed Naver's elimination in the first round, will be closely watched to see if they avoid repeating the same mistakes."
According to the IT industry on January 28, while elite teams competed with text-based large language models (LLMs) during the first selection phase, the next stage will focus on updating to multimodal models that can comprehensively understand photos and videos, and even attempt omnimodal models that process speech. Since developing these capabilities independently in a short period is an even greater challenge than before, the teams selected in the first round-such as LG AI Research, SK Telecom, and Upstage-are currently developing their models while minimizing the exposure of their strategies.
Bohyung Han, Professor at the Department of Electrical and Computer Engineering at Seoul National University, said, "Each company may have a different development process. In the first evaluation, the focus was on language-based models, but Naver took an omnimodal approach to differentiate itself, which led to allegations of encoder plagiarism and ultimately to its elimination. Therefore, the elite teams facing the second evaluation must be even more meticulous."
While existing teams were text-based, the next challenge will be to implement conversations composed of images, videos, and speech in AI models, which will reveal the technological differences between companies. Especially, as seen in the case of Naver Cloud, the government has made it clear that using pre-trained weights from open source will incur penalties, making competitors even busier as they prepare.
An AI industry insider commented, "It is important to continue reinvesting in a large language model (LLM) to maintain a high-quality model over time. In this respect, Naver, with its infrastructure, was considered a strong contender. However, since Naver was disqualified for adopting the vision encoder from Alibaba Qwen and using its pre-trained weights, other companies will continue to face similar dilemmas."
An AI industry executive, who requested anonymity, said, "In the first round, Naver proposed an omnimodal approach compared to other text-based models, but competitors could face the same issues as Naver in the next stage beyond text. If the multimodal competition intensifies, controversies over independence at the encoder and module levels could resurface."
More Challenging Than Text-Based LLMs... The Task of Meeting 'Independence' Standards
As the view spreads that models encompassing images, speech, and video beyond text-based LLM competition will be the key variable in the second evaluation, both major elite teams and challengers are accelerating their preparations for multimodal capabilities.
A senior official at LG AI Research said, "Since Exaone 1.0, we have been developing models with multimodal capabilities in mind and already possess related technologies. Unlike other consortiums that must now reconsider how to implement multimodality for the second evaluation, we are at a stage of advancing our accumulated technologies."
SK Telecom is also preparing multimodality as a major pillar for the second evaluation. The elite SKT team proved its performance in the first round with its super-large LLM 'A.X K1,' which has over 500 billion parameters, and plans to sequentially apply multimodality starting with image data in the second stage. Beginning with features that recognize and summarize images of academic papers and work documents, they aim to expand the scope to include speech and video processing in the latter half of the year.
Startups such as Upstage and Trillion Labs are also preparing for multimodality, but their approaches are more cautious. Kwon Soonil, Vice President of Upstage, said, "During development, we may decide that it is better to implement multimodality sooner, and that part is quite flexible. For now, the key is to create a model with a high level of knowledge." Lim Jeonghwan, CEO of Motif Technologies, who is reapplying with a focus on independence, emphasized, "We are the only startup with experience developing both high-performance LLMs and large multimodal models as foundation models."
Trillion Labs, which has declared its candidacy for additional elite team recruitment, is also preparing its technology with multimodal models in mind and is continuing to hire more developers. A Trillion Labs representative said, "Research on multimodal models such as vision-language models (VLM) has already been conducted internally. If selected as an elite team and we begin to focus on the second stage, multimodal models that combine non-text data such as images will inevitably become an important option."
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.
![[The National AI Team Trapped by the 'Independence' Dilemma] Aftershocks from Naver... Intensifying Multimodal Competition ②](https://cphoto.asiae.co.kr/listimglink/1/2026012716181017555_1769498290.png)
![[The National AI Team Trapped by the 'Independence' Dilemma] Aftershocks from Naver... Intensifying Multimodal Competition ②](https://cphoto.asiae.co.kr/listimglink/1/2026012810121418606_1769562734.png)

