AI tools
January 26, 2024

Google Raises The Text-to-Video Bar with Lumiere

Google’s new approach to text-to-video AI models.

John Paul Ada
John Paul Ada

Lumiere is a new AI model from Google which generates beautiful, smooth, and high quality video.

It uses this particular architecture they call a Space-Time U-Net:

Space-Time U-Net architecture (from the project page)

Okay, so what??

Alright, chill. Let me explain, simply.

Previous Approaches

What previous AI video generation approaches did, in simple terms, was to generate images showing what the video would look like at certain points in time. After that, it would use a different model to create a transition from image to image, by creating multiple images/frames between the images.

For example, the AI can create the image at the start of the video and another image at the end of the video. Then another AI will fill in the gap in between to complete the video.

It’s like one of those flip books!

A lot of times, this results in rough and awkward transitions.

Lumiere’s Approach

Lumiere, on the other hand, does it differentlycoz if it did it the same way, what’s the point of this article, yeah?

The biggest change is this: instead of creating images then filling in the gap in between them,

Lumiere creates all of the frames all at once.

This creates a noticeably smoother and higher quality video.

Check out the demo video below to see what I mean.

Demo Video

Demo video from the project page.


This new approach works great not only on text-to-video, but also:

  • Image-to-Video
  • Image-to-Video (Cinemagraph)
  • Video-to-Video Style Copy
  • Video-to-Video with Text Prompts
  • Video In-Painting


Figure on the model performance (from the paper)

Based on their user testing (not sure if this can be trusted though), their new approach outperforms the baseline AI models, in terms of Video Quality of Image-to-Video and Text-to-Video modes, as well as Text Alignment.

Currently it can only generate 5 seconds of video, but that length is good enough for most single video shots.

My Thoughts

Honestly, this approach is pretty interesting. I’m just a tad bit disappointed that they did not release the model itself because I would LOVE to play around with it. I’ve already been burned by video-to-video AI that had rough and awkward transitions 🥲

They also use a pre-trained text-to-image model as part of their process, so I’m thinking maybe they can try out the more compute-efficient Hourglass architecture.



Get your brand or product featured on Jim Monge's audience