‘Financial Specialized Korean Corpus’
A Large-Scale Korean Language Dataset
Available for Application via the Financial Settlement Service Data Sharing Platform
The Financial Services Commission announced that, as a follow-up measure to the "Support Plan for the Use of Generative Artificial Intelligence (AI) in the Financial Sector" announced last December, it will provide a "financial specialized Korean corpus" starting from the 31st.
The financial specialized Korean corpus is a large-scale Korean language data set collected in a form that AI models can process, handle, and analyze various specialized knowledge in the financial sector.
Until now, financial companies have utilized commercial AI developed overseas for general users. However, there has been a lack of specialized data such as financial terms in Korean and Korean financial regulations, causing difficulties in performing financial specialized tasks.
The financial specialized Korean corpus is provided in various forms to be used for AI models' learning of financial expertise, improving the accuracy of responses, and evaluating performance and ethics. First, a training corpus is supported. The pre-training corpus for learning general knowledge in the financial sector utilized financial terminology dictionaries and general financial knowledge materials from the Financial Supervisory Service, the Korea Federation of Banks, and others. The additional training corpus for service development was built using domestic financial policy and system explanation materials, financial regulations and guidelines, and basic training materials from the Korea Insurance Training Institute.
A corpus for retrieval-augmented generation (RAG) is also supported to enable AI models to refer to the latest external information or specialized data to derive accurate answers. Additionally, an evaluation support corpus is provided to assess AI models' financial knowledge, reasoning ability, and potential harmfulness. This corpus, built separately from the training data, can be used to verify AI's objective performance and fairness.
The financial common domain corpus provided this time consists of 12,600 cases, totaling approximately 45 gigabytes (GB). It is composed of 6,700 cases for pre-training, 1,100 cases for additional training, 3,800 cases for retrieval-augmented generation, and 1,000 cases for evaluation support.
Those wishing to use it can apply and download it through the Financial Settlement Service's data sharing platform. It will be provided free of charge during this pilot project period, which runs until the end of June this year.
The Financial Services Commission plans to expand the types and scale of the corpus in the second half of this year by reflecting additional demands and opinions from financial companies. Starting next year, it will continuously consult with various original data holders and related organizations to support specialized corpora by industry sector.
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.



