The Ultimate Challenge in Korean... AI Chatbot Competitiveness Ultimately Depends on 'Korean Language Proficiency'

Naver AI Training Requires 'Korean Linguistics Knowledge'
AI Performance May Depend on Superiority in Pure Academic Fields

OpenAI's artificial intelligence (AI) chatbot 'ChatGPT' and Naver, a major domestic IT conglomerate, developed 'HyperCLOVA'. Although it is difficult to compare the performance of these two giant machine intelligences on a precise 1:1 basis, Naver has claimed that HyperCLOVA's performance is at least the best in 'Korean'.

What is the source of Naver's confidence? Of course, HyperCLOVA has learned from a much larger amount of Korean data than ChatGPT. However, that alone does not guarantee a definite advantage. What makes HyperCLOVA the best Korean AI is none other than 'Korean linguistics'.

US ChatGPT vs KR HyperCLOVA

The Ultimate Challenge in Korean... AI Chatbot Competitiveness Ultimately Depends on 'Korean Language Proficiency'

[Image source=Pixabay]

It is difficult to place ChatGPT and HyperCLOVA on exactly the same level. ChatGPT is a chatbot based on OpenAI's large natural language processing (NLP) model 'GPT-3', while HyperCLOVA is a multipurpose NLP integrated with various services such as translation, search assistance, and transcription.

At first glance, GPT-3 and HyperCLOVA seem difficult to clearly rank. The 'model size', which primarily determines the initial performance of AI models, i.e., the number of parameters, is similar: GPT-3 has 175 billion, and HyperCLOVA has 204 billion.

Comparison between ChatGPT and HyperCLOVA

According to Naver, the amount of Korean data learned by HyperCLOVA is 6,500 times that of GPT-3, but the size of the dataset is not necessarily an absolute measure that determines AI accuracy. Google's chatbot 'Bard', ambitiously unveiled on the 6th (local time), is also a massive AI trained on countless data, but it made errors from its first demonstration.

Naver AI Ideal for Korean Understanding... The Secret is 'Korean Linguistics'

Naver's Artificial Intelligence (AI) HyperCLOVA Logo / Photo by Naver

At the 2021 'Naver AI Now' conference, Naver emphasized HyperCLOVA as "the first large-scale Korean AI that best understands and uses the Korean language." Where does that confidence come from?

Clues can be found in the AI-related paper released by Naver in 2021 titled 'What Changes Will Large AI Models Bring? - Focused Study on HyperCLOVA'. According to the paper, much effort was devoted to optimizing AI models, mainly developed by research institutions in the US and UK, to fit the 'Korean language environment'.

ChatGPT, which understands the context of conversations to provide accurate answers, seems as if it 'understands the meaning of words', but contrary to common belief, computers do not comprehend language like humans do.

Instead, AI breaks down human speech into minimal units and converts them into byte data that computers can recognize, then finds patterns there and combines the most appropriate words. This series of processes is called 'tokenization'. The biggest difference between HyperCLOVA and ChatGPT lies in their tokenization approaches.

ChatGPT, mainly used in English-speaking countries, uses Byte Pair Encoding (BPE), a tokenization technique suitable for alphabetic characters. However, BPE is suitable for English, where meaningful words are formed by arranging letters.

On the other hand, the way Korean words are combined cannot be fully covered by English alone. Naver also realized that with the existing BPE method, "some Korean characters like 'jek' cannot be included as tokens," and to overcome this limitation, developed a token splitting method suitable for Korean morphemes (the smallest meaningful units of language). Because of this, HyperCLOVA was able to surpass ChatGPT significantly in Korean semantic understanding performance.

The Paradox of AI... Pure Academic Disciplines May Determine Performance

OpenAI Natural Language Processing Model ChatGPT / Photo by Yonhap News

The case of HyperCLOVA shows how important 'Korean linguistics experts' are in cutting-edge AI development. Although the AI model itself is based on mathematics and programming, to enable AI to analyze human speech, in-depth knowledge of the language itself must be a prerequisite. Such pure academic knowledge underlying AI learning is called 'domain knowledge'.

As AI becomes active across research, industry, arts, and other fields, the importance of such 'domain experts' is expected to expand further. Professor Jin-ho Park of Chung-Ang University's Humanities Content Research Institute pointed out this trend in his 2019 paper titled 'The Role of Domain Knowledge in Deep Learning-Based Natural Language Processing'.

Professor Park stated, "To prove the continued importance of linguists even in the deep learning era, I developed a Korean morpheme analyzer," adding, "Although segmenting Korean meanings is not easy, by reframing the segmentation task as a classification problem, machine learning made it easier to solve." He concluded, "This experiment shows that linguists' knowledge remains important even in the deep learning era." Excellent AI does not rely solely on IT or semiconductors but blooms when an outstanding pure academic foundation is established.