Right after Meta’s Make-A-Video, Google said Wednesday it also built an AI-powered text-to-video system. This one is called Imagen Video.
We dare say that last week’s public reveal of Make-A-Video spurred the Big G to suddenly shout about its own competing system lest it appear as if it has fallen behind Mark Zuckerberg’s team. Or maybe Meta found out about Google’s planned announcement and rushed to spoil it with his own revelation. It seems too much of a coincidence.
Given a text prompt such as “Sprout out text ‘Imagen Video’ from a storybook. Smooth video,” Google’s software generates a sequence of images to create the short clip, as shown below.
Prompt: "Sprouts in the shape of text 'Imagen Video' coming out of a fairytale book."Model Output: pic.twitter.com/FVgnM0UAAn
— Durk Kingma (@dpkingma) October 5, 2022
There are numerous other examples of fully fabricated footage of the model from prompts, e.g. B. from “a teddy bear walking through New York City” or “incredibly detailed science fiction scene set on an alien planet, view of a marketplace. pixel art.”
Imagen Video builds on Google’s previous text-to-image system, Imagen, which launched in May. However, instead of a single still image, Imagen Video creates a video from multiple source frames.
Text-to-video systems are more computationally expensive to train and run than text-to-image systems. For example, Imagen Video consists of seven types of models. For one, it not only has to generate a frame from its text prompt, but also predict what the next frames would be to form a coherent motion animation – each frame a slight evolution of the previous one – rather than a series of related frames playing back would look like a mess.
“Imagen Video generates high-resolution video using Cascaded Diffusion Models,” according to a Google research note.
“The first step is to take a text prompt and encode it into textual embeds using a T5 text encoder.
“A baseline video diffusion model then generates 16-frame video at 24×48 resolution and three frames per second; this is followed by multiple Temporal Super-Resolution (TSR) and Spatial Super-Resolution (SSR) models to upsample and generate a final 128-frame video at 1280×768 resolution and 24 frames per second – resulting in 5.3 Seconds high definition video leads.
Like Meta’s Make-A-Video, the quality of Google’s Imagen Video is a bit blurry. Image edges are blurry and the resolution is not that great yet. However, research and development of generative visual models is progressing rapidly, and it will only be a matter of time before a new architecture creates fake videos that are sharper and in high resolution for longer periods of time.
An example given by Google of how Imagen Video generates a clip frame by frame
These models show that computers are good at learning the logical sequence of events to simulate events like a water balloon popping or ice cream melting. Google Brain’s Boffins, in a non-peer-reviewed research paper, described Imagen Video as “timely coherent” and “well timed with the prompt given.” [PDF].
An internal Google dataset of 14 million video-text samples and 60 million image-text pairs as well as information from the publicly available LAION 400M image-text dataset were used to train Imagen Video.
“Generative video models can be used to positively impact society, for example by amplifying and enhancing human creativity. However, these generative models can also be misused, for example to generate fake, hateful, explicit or harmful content,” the researchers said. The LAION-400M dataset is also known to contain pornographic and other types of problematic images.
Although the team has applied content filters to block unsavory text prompts or images in model-generated videos, Imagen Video still tends to create content with “social bias and stereotypes” in it, and it’s not yet safe to experiment with. “We have decided not to release the Imagen Video model or its source code until these concerns are resolved,” they concluded.
So, like Meta’s toys, Imagen Video is not available to the general public, potentially leading to the public revealing more recruiting tools – Hey come and work on cool stuff like this – than anything else at the moment. ®
https://www.theregister.com/2022/10/06/google_ai_imagen_video/ Imagen Video • The Register