[Independent AI Announcement] Beyond "Book-Smart LLMs"... Naver Cloud Targets On-Site AX with "Omnimodal"

First Announcement of the Independent AI Foundation Model Project
"Surpassing the Limits of Text"… Unveiling Omnimodal HyperCLOVA X
Top Scores in Major CSAT Subjects… Direct Photo Understanding Without Text Input

Nakho Sung, Head of Hyperscale AI Technology at Naver Cloud, is presenting at the first announcement event of the Ministry of Science and ICT's "Independent AI Foundation Model" project held on the 30th at COEX, Seoul. Photo by Yujin Park

"A large language model (LLM) is like a brain that has only read books and studied. It has a wealth of knowledge, but it has never seen, heard, or touched the real world."

Nakho Sung, Head of Hyperscale AI Technology at Naver Cloud, explained the limitations of existing LLMs in this way at the first announcement event of the Ministry of Science and ICT's "Independent AI Foundation Model" project, held on December 30 at COEX, Seoul. While these models excel at understanding text, they lack the sensory capabilities needed to solve complex real-world problems. This is precisely where Naver Cloud's newly introduced "Omnimodal HyperCLOVA X" begins.

On this day, Naver Cloud unveiled two open-source models: "HyperCLOVA X Seed 8B Omni," the first domestically developed model to adopt a native omnimodal structure, and "HyperCLOVA X Seed 32B Sync," which combines visual, audio, and tool-utilization capabilities with inference-based AI. The company announced its intention to "accelerate the realization of AI agents that anyone can use in daily life and industrial settings." Omnimodal refers to a single model capable of integrated understanding and generation across different data types, such as audio, images, and video.

The 8B Omni model can integratively understand context within a unified semantic space, even when information takes different forms. This makes it a next-generation technology with high applicability in real-world environments, where spoken and written language, as well as visual and audio information, are exchanged in complex ways. Naver Cloud emphasized that global big tech companies are also positioning omnimodal models as a core pillar of next-generation foundation models.

In particular, Sung cited the processing of graphs and charts, which frequently appear in industrial settings, as a representative example. He said, "Text-based LLMs cannot directly recognize graphs, requiring separate integration with technologies like OCR, which leads to semantic loss and additional implementation costs." In contrast, the omnimodal approach "can understand entire images and grasp the organic relationships within information, thereby reducing development and operational costs."

The 32B Sync model, also released by Naver Cloud, combines visual comprehension, voice conversation, and tool-use capabilities with inference-based AI to create an omnimodal agent experience capable of understanding and solving complex inputs and requests. According to Naver Cloud, based on indices calculated by Artificial Analysis, a global AI evaluation agency, which aggregates ten major benchmarks including general knowledge, advanced reasoning, coding, and agent-type tasks, the model performs within a similar range to leading global AI models. The company added that it demonstrated particular competitiveness in practical areas such as Korean-language general knowledge, visual understanding, and agent performance based on tool utilization.

The results of the national college entrance exam problem-solving were also released. Naver Cloud reported that the 32B Sync model achieved top-tier (Grade 1) results in all major subjects-including Korean, mathematics, English, and Korean history-when solving this year’s College Scholastic Ability Test questions, and even achieved perfect scores in English and Korean history. The company especially highlighted that the model could solve problems directly from photographs, without re-entering the questions as text. Sung stated, "Even when compared to much larger models, it demonstrates similar problem-solving capabilities, but with significantly lower development and operational costs, making it a much more cost-efficient model."

Sung added, "We have confirmed that expanding the sensory capabilities of AI horizontally, while simultaneously strengthening reasoning and inference skills, greatly enhances its ability to solve real-world problems. Building on a solid foundational structure, we will gradually scale up to develop AI that is not just large in size, but truly useful in practice."