June 19, 2024

Video-to-Audio AI Tool Is Here: Google Announced V2A AI Model

Add realistic sound effects, dialogs, and soundtracks to your AI-generated videos.

by

Jim Clyde Monge

In the past few weeks, we’ve seen a wave of text-to-video and image-to-video tools like Google Veo, Kuaishou’s Kling, Luma Lab’s Dream Machine, and the newly announced Runway Gen-3 Alpha.

These AI video tools generate impressive results, but they share a common limitation — they are all silent.

No dialog, no soundtrack, and no sound effects.

Today, Google shared an update about an internal technology they are developing that can generate audio from video input.

What is Google V2A?

Google’s video-to-audio (V2A) combines video pixels with natural language text prompts to generate rich soundscapes for the on-screen action.

V2A not only creates realistic sound effects and dialogue that match the characters and tone of a video, but it can also generate soundtracks for various traditional footage, including archival material, silent films, and more.

Examples

Here are five examples that the Google Deepmind team shared in a blog post:

1. Drums

Prompt for audio: A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd

2. Cars

Prompt for audio: cars skidding, car engine throttling, angelic electronic music

3. Wolf

Prompt for audio: Wolf howling at the moon

4. Underwater Jellyfish

Prompt for audio: jellyfish pulsating under water, marine life, ocean

5. Horror Scene

Prompt for audio: Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete

These are wild!

While there are limitations, such as artifacts and distortions, the overall quality of the output is still enough to significantly enhance the video experience.

It’s high time these AI-generated videos are paired with an audio generator, and V2A is a promising step in that direction.

How it works

Google experimented with various approaches to find the most scalable AI architecture for audio generation, and the diffusion-based method provided the most realistic results for synchronizing video and audio.

Diffusion is the process by which an AI model is trained to recompose visuals (still or moving) of concepts from pixellated “noise,” based on learning those concepts from annotated images or video and text pairs.

The V2A system begins by encoding video input into a compressed form. Using a diffusion model, the audio is iteratively refined from random noise, guided by the visual input and natural language prompts to generate synchronized, realistic audio. The final audio output is then decoded, turned into an audio waveform, and combined with the video.

To enhance audio quality and guide the model toward specific sounds, the researchers incorporated AI-generated annotations with detailed sound descriptions and transcripts of spoken dialogue during the training process. This allows the technology to associate specific audio events with various visual scenes based on the provided annotations or transcripts.

For more details, check out Google’s blog post here.

Limitations

Despite the advancements, there are still several limitations Google is working to address:

Dependency on Video Quality: The audio output quality is highly dependent on the quality of the video input. Artifacts or distortions in the video, which are outside the model’s training distribution, can lead to a noticeable drop in audio quality.
Lip Synchronization Issues: For videos involving speech, V2A attempts to generate speech from input transcripts and synchronize it with characters’ lip movements. However, the paired video generation model may not be conditioned on transcripts, creating a mismatch that often results in uncanny lip-syncing.
Alignment of Audio and Video Elements: While V2A doesn’t require manual alignment of generated sound with the video, refining the synchronization of different sound elements, visuals, and timings remains a challenge.

The team working on this tech says that further research is underway to address these limitations and enhance the capabilities of the V2A system.

Final Thoughts

Despite being in the preview phase, the initial results of Google’s V2A technology are already impressive. Video generators are advancing at an unprecedented pace, and it’s high time these AI videos are paired with an audio generator.

I can’t wait to hear the audio of all the memes people are generating with AI video generators.

However, the timeline for public access to V2A remains unclear. According to Google, this tech will have to undergo some rigorous tests before giving access to the public.

Before we consider opening access to it to the wider public, our V2A technology will undergo rigorous safety assessments and testing.

Nevertheless, it’s encouraging to know that such technology is in development, and we could soon see AI video generators seamlessly integrating audio.

‍

Stay ahead. Stay updated.