Concerns Rise Over Renewed Independence Debate as Multimodal Competition Intensifies
"The second round of the government's 'Independent AI Foundation Model (Dokpamo)' project will be a battle of 'multimodal' models. The surviving teams, having witnessed Naver's elimination in the first round, will be watching closely to ensure they do not repeat the same mistakes."
According to the IT industry on January 28, while elite teams competed with text-based large language models (LLMs) in the first round, the next phase will involve updating to multimodal models that can comprehensively process photos and videos, and even attempting omnimodal models that understand speech. Since independently achieving this in a short period is even more challenging than the previous round, first-round winners such as LG AI Research, SK Telecom, and Upstage are developing their models while minimizing the exposure of their strategies.
Bohyung Han, Professor of Electrical and Computer Engineering at Seoul National University, stated, "Each company may have a different development process. In the first evaluation, the focus was on language-based models, but Naver attempted to differentiate itself by pursuing an omnimodal approach, which led to allegations of encoder plagiarism and ultimately resulted in its elimination. Therefore, the elite teams undergoing the second evaluation must be even more meticulous."
While the previous teams focused on text-based models, future challenges will require AI models to handle conversations involving images, videos, and speech, which may reveal differences in each company's technological capabilities. Especially, as seen in the Naver Cloud case, the government has made it clear that using pre-trained weights from open-source models will be penalized, making competitors even busier.
An AI industry insider commented, "It is crucial to continuously reinvest in and maintain a high-performing model even after creating an LLM. In this respect, Naver, with its robust infrastructure, was considered a strong contender. However, since Naver was eliminated for adopting the vision encoder and weights from Alibaba Qwen, other companies will continue to grapple with similar concerns."
An AI industry representative, who requested anonymity, said, "During the first round, Naver proposed an omnimodal approach compared to other text-based models. However, it is possible that other competitors may face the same issues as Naver in subsequent stages. If the multimodal competition intensifies, the controversy over independence at the encoder and module level could resurface."
Implementing Multimodal Models Is More Challenging Than Text-Based LLMs... Meeting the 'Independence' Standard Remains a Key Task
As expectations grow that models encompassing images, speech, and video-beyond text-based LLMs-will be the key variable in the second evaluation, both leading teams and challengers are accelerating their preparations for multimodal capabilities.
A senior official at LG AI Research said, "From Exaone 1.0, we have been developing models with multimodality in mind and have already secured related technologies. Unlike other consortiums that now need to devise new approaches for multimodal implementation ahead of the second evaluation, we are at the stage of advancing our already accumulated technologies."
SK Telecom is also preparing multimodality as a major focus for the second evaluation. The SKT elite team, having demonstrated performance in the first round with its LLM 'A.X K1'-which has more than 500 billion parameters-plans to sequentially apply multimodality starting with image data from the second phase. Beginning with features that recognize and summarize images from research papers and work documents, the team aims to expand its scope to include speech and video processing in the second half of the year.
Startups such as Upstage and Trillion Labs are also preparing for multimodality, but their approaches are more cautious. Soonil Kwon, Vice President of Upstage, said, "During development, we may decide that it is better to implement multimodality sooner, and that aspect remains flexible. For now, the key is to develop a model with a high level of knowledge." Lim Junghwan, CEO of Motif Technologies, who is reapplying with a focus on independence, emphasized, "We are the only startup with experience developing both high-performance LLMs and large-scale multimodal models as foundation models."
Trillion Labs, which has also entered the race to recruit additional elite teams, is preparing its technology with multimodal models in mind and is actively hiring more developers. A Trillion Labs representative added, "Research on multimodal models such as vision-language models (VLMs) has already been conducted internally. If we are selected as an elite team and move into the second phase, multimodal models that combine non-text data such as images will inevitably become a key option."
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.
![[The National AI Team Trapped by the 'Independence' Dilemma] Aftershocks from Naver... Intensifying Multimodal Competition ②](https://cphoto.asiae.co.kr/listimglink/1/2026012716181017555_1769498290.png)
![[The National AI Team Trapped by the 'Independence' Dilemma] Aftershocks from Naver... Intensifying Multimodal Competition ②](https://cphoto.asiae.co.kr/listimglink/1/2026012810121418606_1769562734.png)
![User Who Sold Erroneously Deposited Bitcoins to Repay Debt and Fund Entertainment... What Did the Supreme Court Decide in 2021? [Legal Issue Check]](https://cwcontent.asiae.co.kr/asiaresize/183/2026020910431234020_1770601391.png)
