Naver Completes Development of Next-Generation 'Omni-modal' AI, Set for Release as Early as This Month

Learning and Reasoning by Integrating All Information
A Turning Point for Generative AI Expected

Naver has virtually completed the development of 'Omni-modal,' a next-generation artificial intelligence (AI) technology that learns and infers all information at once, surpassing the level of understanding text, images, and voice individually, and plans to unveil it as early as the end of this month. Yonhap News

Naver has virtually completed the development of "Omni-modal," a next-generation artificial intelligence (AI) technology that learns and infers all information at once, going beyond the level of understanding text, images, and voice individually. Unlike "multi-modal" models, which connect images or voice to words, the key feature of omni-modal is its significantly improved speed and breadth of understanding. If released at the end of this month, it is expected to be a turning point that could fundamentally change the structure of generative AI.

According to the ICT industry on December 19, Naver is preparing a new generative AI model based on its own AI platform, HyperCLOVA X. Instead of processing text, images, and voice separately, the core lies in an "omni-modality" structure that integrates and understands different types of information from the learning stage.

In South Korea's AI industry, competition in multi-modal technology has already become intense. Companies such as NC AI are achieving results in multi-modal AI that combines various data types-text, voice, images, and motion-for content creation. This approach involves processing each modality in a sophisticated way and then connecting them, which allows for rapid application in real-world services.

The omni-modal approach that Naver emphasizes is different in its aim from multi-modal. While multi-modal focuses on "effectively combining different types of information," omni-modal is designed so that text, images, voice, and video are understood simultaneously as a single recognition system from the learning stage. This enables comprehensive judgment that includes situations, context, and environment. As a result, this model is evaluated as not just an extension of HyperCLOVA X's capabilities, but as a fundamental redesign of the information processing method itself.

Omni-modal technology, by understanding various types of information such as images, voice, and video simultaneously, is closer to the way humans perceive the world. While the performance of existing language-centric AI services depended on how well questions were asked, omni-modal can grasp intent by synthesizing surrounding context and visual or auditory information, even if the question is not precisely formulated.

For example, if the model is trained not only on the Korean language but also on diverse image data such as Korean street scenes, K-pop artists, and trending fashion, it can implement an AI model with a deep understanding of Korean society and culture. Because it observes and learns about users in a multidimensional way, the service can evolve to deliver higher user satisfaction the more it is used.

Naver plans to first introduce a lightweight version of the omni-modal model, rather than a large-scale model. The strategy is to reliably verify the new development methodology and then gradually expand the model by deploying advanced GPUs and additional data. The new model's name has not yet been finalized.

This direction is also being concretized in the government's "independent AI foundation model" project. Naver Cloud, one of the five companies selected as a lead operator for the project, is developing the "Omni Foundation Model" by combining Naver's language and voice-based multi-modal technology with video AI technology from Twelve Labs, a specialist in the field.

The Naver Cloud consortium plans to provide AI services based on the Omni Foundation Model that anyone can easily experience. To this end, it will support individuals and companies in developing, registering, and distributing AI agents directly through an AI agent marketplace. The strategy also includes building a K-AI global export model based on its experience with sovereign AI, and releasing lightweight and inference-specialized models as open source to broaden their utility.

Overseas, AI models based on the omni-modal concept have already appeared. OpenAI's GPT-4o (Omni), released last year, is a generative AI that processes text, images, and voice in real time within a single model. Unlike previous approaches that required separate voice recognition, image processing, and language models, it is designed to handle everything within a unified model system, enabling more natural interactions.