Skip to main content

On Thursday, Microsoft Inc. introduced VALL-E, a text-to-speech artificial intelligence tool. Multiple sources reported that VALL-E could reproduce someone’s voice based on a three-second audio sample.

VALL-E creators explained that it has the capacity to create high-quality text-to-speech applications. It can also facilitate speech editing, enabling users to modify what they originally said.

The company describes VALL-E as a “neural codec language model” built on Meta’s EnCodec technology, which was released in October 2022.

Most text-to-speech models generate speech by modifying waveforms, but VALL-E produces audio from text and acoustic prompts using audio codec codes, as explained by ArsTechnica.

VALL-E can mimic a speaker’s voice and the “acoustic environment” of the sample audio. For example, if the sample audio sounds like a phone call, VALL-E will generate audio with similar sounds.

VALL-E breaks down information into “tokens” and then uses training data to produce outcomes to recognize how a person sounds.

“To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively,” Microsoft wrote in a paper published on ArXiv.

“Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder.”

How VALL-E works

The project’s researchers explained that VALL-E was built on 60,000 hours of English language speech from 7,000+ speakers in Meta’s LibriLight audio library.

Microsoft offers myriad VALL-E audio examples on its website, among which is the three-second audio “Speaker Prompt” sample that VALL-E must resemble.

Another section called the “Ground Truth” is a previously recorded version of the same speaker saying a specific phrase for reference purposes.

Meanwhile, the “Synthesis of Diversity” section shows that by adjusting the random seed used during the generation process, VALL-E can create variations in voice tone.

On the VALL-E Github page, the team demonstrates how the construction works but obtained “mixed” outcomes — some produced machine-like sounds, while others sounded realistic.

Microsoft now focuses on enhancing the model by generating more training data and reducing ambiguous or missed words.

Potential threats

The company has opted not to make the code open source, possibly to avoid risks associated with AI that “can put words in someone’s mouth,” a source explains. It will, however, still adhere to its “Microsoft AI Principles” in the VALL-E’s advancement.

Several tech analysts have voiced their thoughts on how “a powerful tool like VALL-E” can be threatening. They are concerned that certain people can use it to spread misinformation by impersonating politicians, journalists or other public figures.

However, VALL-E’s team acknowledged the potential risks and claimed they would continue implementing Microsoft AI Principles.

“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” they wrote.

“To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.”

It remains to be seen whether VALL-E is ever launched to the public market to produce tailored celebrity voices or to imitate a specific person’s voice for a product advertisement.

Microsoft has made substantial investments in artificial intelligence, such as OpenAI, the company behind ChatGPT and DALL-E. In 2019, the company invested $1 billion in OpenAI. A new report published earlier this week revealed that the software giant was considering investing an additional $10 billion in the company.


Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, (wikipedia)

Ars Technica

Ars Technica is a website covering news and opinions in technology, science, politics, and society, created by Ken Fisher and Jon Stokes in 1998. It publishes news, reviews, (wikipedia)