본문 바로가기
bar_progress

Text Size

Close

"Improved English-Based LLM" Demonstrates Potential for High-Performance Korean LLM Development

A domestic research team has demonstrated that it is possible to develop a high-performance Korean Large Language Model (LLM) by improving an English-based LLM, without incurring astronomical costs.


An LLM is an artificial intelligence (AI) model that learns from vast amounts of text data to understand and generate human language.


"Improved English-Based LLM" Demonstrates Potential for High-Performance Korean LLM Development (From left) Youngjun Son PhD candidate, Yeongyeong So PhD candidate, Chanwoo Park Master’s candidate, Jaejin Lee Professor (Principal Investigator), Jinpyo Kim PhD candidate, Jiheon Seok PhD candidate, Gyeongje Jo PhD candidate, Jongwon Park Master’s candidate, Jongmin Kim PhD candidate, and other research team members are posing for a commemorative photo. Provided by Professor Jaejin Lee’s research team at Seoul National University

According to the National Research Foundation of Korea on July 4, Professor Jaejin Lee’s research team at Seoul National University recently improved the English-based language model “Llama” to develop a Korean-specialized language model called “Llama-Thunder-LLM,” a Korean-only tokenizer “Thunder-Tok,” and the “Thunder-LLM Korean Benchmark,” which objectively evaluates the performance of Korean LLMs. The team has released these resources online.


A tokenizer refers to the words or units into which a sentence is divided so that a language model can understand it. A benchmark is a tool for measuring and evaluating performance by comparing it to specific standards, and is used in various fields such as computer hardware (HW), software (SW), and ALC management strategies.


AI technology utilizing LLMs is attracting global attention. In Korea as well, interest in developing language models specialized for Korean is gradually increasing.


However, building an LLM requires massive amounts of data, but in reality, there is not enough data to meet these requirements, and the development process incurs enormous costs. As a result, LLM development has mainly been carried out by large corporations and overseas big tech companies.


It has not been easy for small and medium research institutes or universities to conduct research and development related to LLMs.


In contrast, the research team has achieved a breakthrough that overturns these existing limitations. They independently managed every stage of language model training, from data collection to post-training, and demonstrated that it is possible to build a high-performance language model with limited resources, similar to China’s LLM “DeepSeek.”


Although the team utilized an open-source English model, the technologies they applied include all the techniques necessary for developing an independent model. This demonstrates that the research team possesses the technical capabilities to develop high-performance proprietary language models.


The “Llama-Thunder-LLM” developed by the research team is described as a Korean-specialized large language model that collected and preprocessed 3TB of Korean web data, and applied improvement techniques such as continual pre-training and post-training to the previously released Llama model.


Continual pre-training refers to the process of expanding a model’s capabilities by training it with new data, while post-training means additional fine-tuning to enhance performance in specific tasks such as user question-and-answer interactions.


The tokenizer “Thunder-Tok,” which reflects the grammatical characteristics of Korean, reduced the number of tokens by 44% compared to the original Llama tokenizer, thereby improving both inference speed and training efficiency. Since current AI models generate one token at a time, reducing the number of generated tokens results in lower operating costs.


The “Thunder-LLM Korean Benchmark,” which includes a Korean evaluation dataset independently developed by the research team, provides a foundation for objectively and systematically evaluating the performance of Korean LLMs. A dataset refers to a structured collection of data used for purposes such as AI model training, testing, data visualization, research, or statistical analysis.


Professor Jaejin Lee stated, “This research demonstrates that academia, not just large corporations or overseas big tech companies, can develop independent LLMs, and it is a meaningful achievement that contributes to Korea’s sovereign AI. The research team has made the Korean-based LLM, tokenizer, benchmark dataset, and the entire development process available online, laying the foundation for anyone to use them for follow-up or reproducibility studies.”


Sovereign AI is a term that combines “sovereign,” meaning autonomous or possessing sovereignty, with AI, and refers to AI systems that a particular country can operate and control independently within its own borders.


The research results have been made publicly available on the website of the “Center for Optimization of Hyper-scale AI Models and Platforms” so that anyone can freely use them.


Meanwhile, this research was supported by the Leading Research Center (ERC) program promoted by the Ministry of Science and ICT and the National Research Foundation of Korea.


© The Asia Business Daily(www.asiae.co.kr). All rights reserved.


Join us on social!

Top