본문 바로가기
bar_progress

Text Size

Close

KAIST Develops Learning Optimization Simulation Including ChatGPT

A training optimization simulation (deriving the optimal parallelization configuration) that can increase GPU utilization and reduce training costs for ultra-large artificial intelligence (AI) models such as Chat GPT and DeepSeek has been developed domestically. This research achievement is meaningful in that training time and costs vary significantly depending on how the training process is parallelized and distributed.


Large language models are trained on large-scale distributed systems equipped with tens of thousands of GPUs for data centers, and in the case of GPT-4, the estimated cost of training the model approaches 140 billion KRW.


KAIST Develops Learning Optimization Simulation Including ChatGPT

KAIST announced on the 13th that a research team led by Professor Minsoo Yoo of the Department of Electrical and Electronic Engineering and Samsung Electronics’ Samsung Advanced Institute of Technology jointly developed a simulation framework (hereinafter referred to as vTrain) that can predict and optimize the training time of large language models (LLMs) in large-scale distributed systems.


To improve the training efficiency of large language models, finding the optimal distributed training strategy is essential. However, the vast number of possible strategies and the enormous cost and time required to test each strategy’s performance in real environments pose obstacles to finding the optimal distributed training strategy.


This is also why companies training large language models use only a few empirically verified strategies. This leads to inefficient GPU utilization and unnecessary cost increases, but the lack of simulation technology for large-scale systems greatly affects companies’ inability to effectively solve related problems.


Accordingly, the joint research team developed vTrain to accurately predict the training time of large language models and quickly explore various distributed parallelization strategies.


KAIST Develops Learning Optimization Simulation Including ChatGPT (From left) Professor Minsoo Yoo, Ph.D. candidate Jehyeon Bang, and Dr. Yujeong Choi, Department of Electrical Engineering, KAIST. Provided by KAIST

When compared with actual training time measurements of various large language models in multi-GPU environments, it was verified that training time can be predicted with a mean absolute percentage error (MAPE) of 8.37% on a single node (8 A100 GPUs) and 14.73% on multiple nodes (up to 512 A100 GPUs).


The joint research team expects that vTrain will contribute to establishing the optimal distributed training strategy in data center environments by providing the ability to quantitatively evaluate various parallelization techniques and predict training time.


Through this, GPU resources can be utilized as efficiently as possible, training costs can be reduced, and the efficiency of operating large-scale AI systems can be further enhanced, according to the joint research team.


In particular, the joint research team has open-sourced the vTrain framework and more than 1,500 actual training time measurement data sets so that AI researchers and companies can freely utilize them.


Professor Minsoo Yoo of KAIST said, “vTrain is a profiling-based simulation technique that explored training strategies that can increase GPU utilization and reduce training costs compared to existing empirical methods,” and added, “By open-sourcing it, we expect companies to efficiently reduce the training costs of ultra-large AI models.”


Meanwhile, this research was conducted with support from the National Research Foundation of Korea, the Institute for Information & Communications Technology Planning & Evaluation, and Samsung Electronics. The research results were presented last November at the MICRO international conference on microarchitecture, jointly organized by the IEEE and ACM.


© The Asia Business Daily(www.asiae.co.kr). All rights reserved.

Special Coverage


Join us on social!

Top