Microsoft Develops AI Model VALL-E 2 to Mimic Human Voice

With advancements in Artificial Intelligence (AI), Microsoft has developed an AI model capable of mimicking human voices. Named VALL-E 2, this text-to-speech generator can replicate a voice based on just a few seconds of audio.
VALL-E 2 utilizes zero-shot learning, meaning it can recognize and reproduce concepts without prior examples.
According to Microsoft Research, VALL-E 2 can produce “accurate, natural speech in the exact voice of the original speaker, comparable to human performance.” It can synthesize complex sentences as well as short phrases. The model employs Repetition Aware Sampling and Grouped Code Modeling to achieve this.
Repetition Aware Sampling addresses the issue of repetitive tokens, helping the AI model understand human language by processing words or parts of words. This feature prevents recurring sounds or phrases, making the speech sound more natural. Grouped Code Modeling limits the number of tokens the model processes at once, resulting in faster outputs.
While VALL-E 2’s ability to mimic human voices is groundbreaking, it raises concerns about potential misuse. Instances of “vishing” (voice phishing) have increased, where scammers impersonate trusted individuals over the phone. There are also national security risks, such as a recent robocall using President Joe Biden’s voice to mislead voters.
Microsoft has stated that VALL-E 2 will remain a research project with no plans for public release. This decision comes amid scrutiny over Microsoft’s AI implementations, particularly regarding antitrust and data privacy issues.

Leave a Reply

Your email address will not be published. Required fields are marked *