The Rise of Synthetic Audio Deepfakes

by | Jul 19, 2020 | Blog

Can Audio Deepfakes Really Fake a Human?

Audio deepfakes are the new frontier for business compromise schemes and are becoming more common pathways for criminals to deceptively gain access to corporate funds. Nisos recently investigated and obtained an original attempted deepfake synthetic audio used in a fraud attempt against a technology company. The deepfake took the form of a voicemail message from the company’s purported CEO, asking an employee to call back to “finalize an urgent business deal.” The recipient immediately thought it suspicious and did not contact the number, instead referring it to their legal department, and as a result the attack was not successful.

Nisos investigated the phone number the would-be attacker used and determined it was a VOIP service with no owner registration information. It was likely simply acquired and used as a “burner” for this fraud attempt only. While there was no actual voicemail message associated with the number, we made no attempt for live contact with the owner of the phone number for legal reasons.

Deepfake Audio Analysis

Nisos analyzed the deepfake voicemail audio recording with an audio spectrogram tool called Spectrum3d. Looking to detect any anomalies, we immediately noticed the highs spiking repeatedly in the spectrogram (see graphic below). We initially suspected the deepfake creator used audio playing over on multiple channels to help mask the voice.

Spectrogram Analysis of Deepfake Audio

Graphic 1: Spectrogram analysis of the deepfake audio, displaying major inconsistencies in pitch and tone.

We additionally noticed the audio was very choppy and not consistent with a similar human voice recording. When we altered the audio speed and played back at 1.2 speed, the audio then sounded more like a standard text to speech system. Most interesting, when we amplified the sound to detect any background noise we were unable to find any traces, which further indicated this was manipulated audio.

We then compared the deepfake spectrogram analysis with results of a “normal” human voice on a similar recording. We can immediately see how the pitch and tone is more smoothed out, as well as the ability to detect faint background noise.

Spectrogram Analysis of ‘Normal’ Human Voice

Graphic 2: Spectrogram analysis of ‘normal’ human voice, displaying more consistent pitch and tone.

We were unable to determine the exact software or voice model used to create this deepfake as we would have required access to a large enough sample of the attacker’s other deepfake audio files (we would likely need tens if not hundreds of files, and this assumes the attacker made more than just this one). However, we note several complicating factors an actor would have to overcome to create a more realistic deepfake audio:

  1. Capturing high quality audio with little to no background noise.
  2. Staging the call for audio delivering in a realistic scenario (tone of the person talking, background noise and reason for the call) in which the person wouldn’t feel the need to call the person back.
  3. Finding a way to leave a message, so they can avoid an in-person conversation.

It is likely the attacker in this scenario above utilized a feature most cell/VOIP service providers offer, which is the ability to bypass the ring option and go straight into the voice mail using the `#` key.

Has This Happened Before?

The most famous use of deep fake synthetic audio technology in criminal fraud was a September 2019 incident involving a British energy company. The criminals reportedly used voice-mimicking software to imitate the British executive’s speech and trick his subordinate into sending hundreds of thousands of dollars to a secret account.

The managing director of this company, believing his boss was on the phone, followed orders to wire more than $240,000 to an account in Hungary.

Symantec security researchers reported in February on three cases of audio deepfakes used against private companies by impersonating the voice of the business’s CEO. The criminals reportedly trained machine learning engines from audio obtained on conference calls, YouTube, social media updates and even TED talks, to copy the voice patterns of company bosses.

They created audio deepfakes replicating the CEO’s voice and called senior members of the finance department to ask for funds to be sent urgently. There was no additional reporting on which companies these were, whether the techniques were successful, or whether Symantec was able to obtain recordings of the deepfakes themselves.

Without actual digital capture of the audio, and additional forensics analysis, it is unclear whether these attempts were in fact deepfake synthetic manipulated audio. Regardless, the ability to generate synthetic audio extends an e-criminal’s toolkit and the criminal at the end of the day still has to effectively use social engineering tactics to induce someone into taking an action.

Criminals and potentially broader nation state actors also learn from each other, so as these high-profile cases gain more notoriety and success, we anticipate more illicit actors trying them and learning from others who have paved the way.

Additionally, as deepfakes become easier to create or purchase, and the quality of synthetic manipulation -both audio (and video)- increases, we anticipate wider deployment of these e-crime exploits. If a fraud operation requires the use of a completely fake doctored video or audio for maximum impact, and it is worth the money and resources, it will be used. However, Nisos researchers have not seen the ability to easily outsource this type of deepfake for single individual or mass production.

Our researchers have contact with a few deepfake channels where we asked about this type of attack vector and participants were unsure something like this would be possible in the near future. The central issue with audio deepfakes has to do with capturing not only the person’s tone but also specific speech mannerisms. Future scenarios will likely materialize, however, where tools similar to a Yandex reverse image search (but for voice) could be used to gather numerous samples and then build and train a model that could help convert the source voice into the target voice.

What Can Be Done?

The most immediate action an employee can take, if they sense something suspicious in a voicemail (or any audio) instruction, is to call the person back directly using a known number and get them on the line.

Deepfake technology is not sophisticated enough to mimic an entire phone call with someone. Additionally, the company can exercise a series of ‘challenge questions’ using information that is not publicly known or conversation points that an actor could not readily answer, to vet the identity of the individual on the line.

This fraud scheme is a form of a business email compromise (a more sophisticated and AI-developed version) where typically the attacker will pretend to be a senior executive at a company and get a more susceptible ‘lower-level’ employee to send money to a bank account.

We would anticipate a deepfake audio would be the first step in a series of social engineering attempts to get an employee to wire money to a specific location. Phishing emails, additional phone calls, or even deepfake videos purporting to authorize an action could be used in furtherance of the criminal scheme.

Deepfake audio may also be used for reasons ancillary to the ones listed above. For example, criminals can leave fake messages instructing employees to provide network or physical access to the company, allowing attackers to easily compromise the network or physical assets of the company.

The availability of this nascent but rapidly-emerging technology emphasizes the criticality with which companies must develop security practices that encompass these measures. Any time an unusual incident occurs, and certainly when large financial transactions are involved, employees should be trained to ask challenge questions to senior executives.