ElevenLabs introduces AI Dubbing, translating video and audio into 20 languages
5 mins read

ElevenLabs introduces AI Dubbing, translating video and audio into 20 languages

One year ago, ElevenLabs, a company founded by former employees of Google and Palantir, launched AI Dubbing, a dedicated product that effortlessly translates all types of speech, including extensive content, across more than 20 languages.

This offering, accessible to all platform users, represents a groundbreaking approach to dubbing audio and video content, revolutionizing a field that has traditionally relied on manual processes.

Even more noteworthy, it can shatter language barriers for smaller content creators who lack the resources to engage human translators for global content expansion.

Mati Staniszewski, CEO and co-founder of ElevenLabs shared insights: “We have rigorously tested and refined this feature in collaboration with numerous content creators, enabling them to dub their content and enhance its accessibility to a broader audience. We foresee significant potential for independent creatives, spanning from video content and podcast producers to film and television studios.”

ElevenLabs asserts that this feature can produce high-quality translated audio within minutes (depending on content length) while preserving the original speaker’s voice, complete with their emotions and intonation.

However, in an era where AI is ubiquitous and enterprises are increasingly exploring language models to boost efficiency, ElevenLabs is not the sole player venturing into speech-to-speech translation.

Also Read | Snapchat: Snap AI chatbot ‘may risk children’s privacy’

How AI Dubbing works?

AI-powered translation encompasses various intricate stages, including noise elimination, speech translation, and more. However, with ElevenLabs’ AI Dubbing tool, end-users are spared the complexities of these processes. They simply need to select the AI Dubbing tool on ElevenLabs, initiate a new project, choose the source and target languages, and upload the content file.

Once the content is uploaded, the tool autonomously identifies the number of speakers and commences the process, displaying a progress bar on the screen – a user-friendly experience akin to other online conversion tools. Upon completion, the transformed file is available for download and immediate use.

Behind the scenes, the tool leverages ElevenLabs’ exclusive approach. It first eliminates background noise and distinguishes between music, noise, and the actual spoken dialogue. It recognizes each speaker and transcribes their original language using a speech-to-text model. This text is then translated, adapted for length consistency, and voiced in the target language, all while preserving the speaker’s distinctive voice characteristics.

Subsequently, the translated speech is synchronized with the original music and background noise (initially removed from the file), preparing the dubbed output for utilization. This accomplishment stems from ElevenLabs’ extensive research in voice cloning, text and audio processing, and multilingual speech synthesis.

For generating the final speech from translated text, the company utilizes its latest Multilingual v2 model, which currently supports over 20 languages, including Hindi, Portuguese, Spanish, Japanese, Ukrainian, Polish, and Arabic. This wide language support empowers users to globalize their content effectively.

Prior to introducing this end-to-end interface, ElevenLabs had separate tools for voice cloning and text-to-speech synthesis. To translate audio content, such as a podcast, into a different language, users had to create a voice clone on the platform, transcribe and translate the audio separately, and then employ the translated text file and the cloned speech to produce audio using the text-to-speech model. Furthermore, this approach only worked for content without substantial background music or noise.

Mati Staniszewski affirmed that the new dubbing feature will be accessible to all platform users, but certain character limits will apply, as with text-to-speech generation. Typically, one minute of AI Dubbing corresponds to approximately 3,000 characters, he noted.

Also Read | Adobe AI push introduces new image-creation tools

AI-based voices are coming

While ElevenLabs is making significant strides in the realm of AI-driven voicing, it’s not the sole player in this field. A few weeks ago, Microsoft-backed OpenAI introduced multimodal capabilities to ChatGPT, enabling it to engage in conversations in response to voice prompts, much like Alexa.

In this case, speech-to-text and text-to-speech models are employed to convert audio, but the technology is currently accessible to select partners. OpenAI has restricted its availability to prevent misuse of these capabilities, with Spotify being one of the chosen partners. Spotify is using this technology to assist podcasters in transcribing their content into different languages while retaining their original voice.

Mati Staniszewski, from ElevenLabs, highlighted that their AI Dubbing tool stands out by offering translation for video or audio of any length, accommodating any number of speakers, and preserving their unique voices and emotions across more than 20 languages, delivering exceptional quality results.

Numerous other players are actively engaged in the AI-powered voice and speech synthesis space, including companies like MURF.AI, Play.ht, and WellSaid Labs.

Also Read | ChatGPT mobile app hit a record $4.58M in revenue last month, but growth is slowing

Recently, Meta also introduced SeamlessM4T, an open-source multilingual foundational model with the ability to comprehend nearly 100 languages from speech or text and produce real-time translations into one or both languages.

According to Market US, the global market for such tools reached $1.2 billion in 2022, and it is projected to approach $5 billion by 2032, with a compound annual growth rate (CAGR) slightly exceeding 15.40%. The future of AI-driven voicing and speech synthesis is certainly promising and continues to attract significant attention and investment.

Leave a Reply

Your email address will not be published. Required fields are marked *