SKT Unveils Two Multimodal and Document Interpretation Technologies Based on Proprietary LLM

A.X Encoder: Fast and Efficient Processing of Long Documents
A.X 4.0 VL Light: Superior Performance for Businesses with Lower Costs

SK Telecom has unveiled a universal document interpretation technology for vision-language model (VLM) and large language model (LLM) training, based on its proprietary large language model, A.Dot X (A.X).

SKT Unveils Two Multimodal and Document Interpretation Technologies Based on Proprietary LLM

SK Telecom Euljiro Office Building Exterior. Provided by SK Telecom

On July 29, SKT released two models to the open-source community Hugging Face: the 'A.X Encoder' and 'A.X 4.0 Vision Language Light (VL Light)'. These models are freely available for academic research and commercial use.

SKT explained that the A.X Encoder was developed to be applied throughout the entire data processing pipeline required for the A.X model. In natural language processing, an encoder is a core component that converts input sentences into contextual representations, enabling a variety of natural language processing tasks. It identifies the interrelationships among all words in a sentence, allowing for a comprehensive understanding of overall meaning and context.

The A.X Encoder, operating with 149 million (149M) parameters, can process long documents quickly and efficiently, making it suitable for large-scale LLM training. In natural language understanding performance metrics, it achieved a global state-of-the-art (SOTA) average score of 85.47, surpassing the performance metric (80.19) of 'RoBERTa-base', which was released by the KLUE team based on a global open-source model.

The A.X Encoder can process up to 16,384 tokens, enabling up to three times faster inference speed and twice the training speed compared to existing models. While previous models handled 512 tokens?typically covering sentences or paragraphs?the A.X Encoder processes much larger contexts quickly and efficiently. SKT expects that this large-scale, high-speed document processing technology will benefit not only LLM training but also various AI-based document processing tasks.

The A.X 4.0 VL Light is a vision-language model (VLM) trained on a large-scale multimodal Korean dataset. It excels not only in understanding visual and linguistic information related to Korean but also in business applications such as interpreting tables, graphs, and manufacturing drawings. It was developed based on the A.X 4.0 Light model, which has 7 billion (7B) parameters.

The A.X 4.0 VL Light achieved an average score of 79.4 on the Korean visual benchmark, demonstrating superior performance despite being smaller than Qwen2.5-VL32B (73.4). In the Korean text benchmark, it ranked among the top domestic models with an average score of 60.2.

On K-Viscuit, a multimodal benchmark designed to evaluate Korean cultural and contextual understanding, it scored 80.2. On KoBizDoc, which focuses on understanding complex document structures, charts, and tables, it achieved 89.8. Notably, when processing the same Korean data input, it uses about 41% fewer text tokens compared to Qwen2.5-VL32B, reducing costs.

Previously, SKT sequentially released two A.X 4.0 models (standard and lightweight) based on large-scale pre-training (CPT), as well as two A.X 3.1 models (standard and lightweight) developed from scratch. The company plans to further enhance the usability and performance of LLMs through upcoming releases, including the A.X 4.0 inference model.

Kim Taeyoon, head of Foundation Models at SK Telecom, stated, "Securing proprietary technology is key to sovereign AI. We will continue to strengthen our own capabilities and accelerate collaboration with consortium partners to achieve world-class AI competitiveness."