Lee Hwak-young, Chair of the Cause Investigation Subcommittee of the Kakao Emergency Response Committee (CEO of Grep), is explaining the cause of the service disruption at IF Kakao on the 7th. (Photo by IF Kakao online video capture)
[Asia Economy Reporter Seungjin Lee] Kakao, which experienced a large-scale service disruption due to the fire at the Pangyo data center on October 15, submitted a painful reflection report. Kakao analyzed that the service disruption was exacerbated by the lack of data center redundancy and the absence of a control tower to respond to the crisis.
At the developer conference "If Kakao Dev 2022 (If Kakao)" held at 11 a.m. on the 7th, the cause of the service disruption caused by the fire at the Pangyo SK C&C data center was analyzed, and measures to prevent recurrence were disclosed.
Lee Hwak-young, CEO of Grep and chairman of the Kakao Emergency Response Committee’s cause investigation subcommittee, said on the day, “Based on experience with Kakao’s services and infrastructure, we were able to relatively quickly understand the current situation,” and pointed out several causes.
Currently serving as CEO of Grep, Lee Hwak-young previously worked at Samsung SDS, Freechal, and NHN, and served as Chief Technology Officer (CTO) at Kakao in 2007. Given his deep knowledge of Kakao’s services and an outsider’s perspective, he was appointed as chairman of the cause investigation subcommittee.
The first cause he mentioned was the issue of ‘redundancy between data centers.’ He pointed out, “Even if there was a problem with the entire data center, if all systems had been redundantly backed up at another data center, recovery would have been faster. However, some systems were redundantly backed up only within the Pangyo data center (SK), which delayed the recovery from the disruption.”
The second issue was the lack of redundancy in operational management tools used for service development and management. He explained, “Systems that store and manage container images and some monitoring tools became unusable due to the fire’s impact, causing significant difficulties in recovery.”
Insufficient personnel and resources to respond to a total data center failure were also cited as causes. The committee chairman said, “Due to a shortage of available personnel, even after power was restored to the center, it took time to normalize the systems,” adding, “Since KakaoTalk and Kakao Work were unavailable, there was no communication channel to convey important matters or share decision-making.”
He added, “Because there were not enough available resources to fully replace the entire Pangyo data center, recovery could not be completed until power was restored and all systems were normalized.”
The absence of a control tower to oversee the service disruption was also pointed out. He said, “Although the Kakao community responded to the disruption simultaneously, there was no company-wide organization established in advance to coordinate and support collaboration overall.”
He stated, “The current cause analysis report has been submitted to the emergency committee,” and added, “We have set higher goals than before and are making efforts so that Kakao’s services can regain trust and be loved again by users.”
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.

