"Hello" One Word Is Enough to Steal Everything... Voice Cloned in 3 Seconds to Deceive Even Family

Possible to Implement Tone and Sentences with 3-Second Voice Sample
Ministry of Science and ICT to Promote Institutionalization of Voice Watermarking

A post has gone viral warning that one should not speak first when receiving a call from an unknown number, as even a short conversation can be exploited for voice phishing using artificial intelligence (AI).

On the 20th, a post titled "I Avoided Voice Phishing Thanks to My Professor" posted on the anonymous university community 'Everytime' has attracted attention.

"Hello" One Word Is Enough to Steal Everything... Voice Cloned in 3 Seconds to Deceive Even Family

Voice phishing. [Photo by Asia Economy DB]

According to the post, "I answered the phone, but the caller did not say a word," and "I was about to say 'Hello,' but I remembered my professor's advice during class: 'If you receive a call from an unknown number and the caller says nothing, never speak first,' so I hung up immediately."

The writer explained, "If I had spoken at that moment, they would have recorded my voice to scam my family," and added, "If it weren't for my professor, it could have been a disaster."

Professor Jo Sooyoung of Sookmyung Women's University, who appeared in the post, told Hankyoreh in a phone interview on the 19th, "During my lecture on 'The Fourth Industrial Revolution and Law,' I mentioned this as one of the methods to prevent increasingly sophisticated voice phishing crimes due to technological advancements," and added, "Recently, voice phishing groups have been recording call voices and combining them with other texts to create new voices for use in threats."

Professor Jo said, "Even just two or three short words like 'Hello' or 'Who is this?' can be exploited." They use the learned voice to demand money from family or friends by claiming urgent situations such as traffic accidents.

Deep voice, which mimics a specific person's voice from a short audio clip, is a technology that learns voices through AI-based deep learning and generates words not actually spoken using text-to-speech (TTS) and other methods.

According to McAfee, a U.S. computer antivirus company, just a 3-second voice sample can reproduce a person's speech patterns and sentences to some extent. It is difficult to distinguish between real and synthesized voices, and the more sophisticated the technology, the harder it is to verify authenticity.

In fact, in October 2021, a bank in the United Arab Emirates (UAE) was deceived by deep voice phishing mimicking the voice of an executive from a company they regularly dealt with, resulting in a transfer of 35 million dollars (about 42 billion KRW). In March last year, parents in Canada sent 21,000 Canadian dollars (about 20 million KRW) after being tricked by a fake voice of their son.

The Ministry of Science and ICT plans to institutionalize voice watermarking to counter deep voice. The voice watermark, devised by the U.S. AI startup Resemble AI that provides generated voice services, is a technology that analyzes the sound waves of a voice and automatically distinguishes frequencies lower than those sound waves. It is not only difficult to distinguish from real sounds but also connected to similar frequency voice information, making it hard to remove.