Meta Movie Gen Guide: How It Works, Examples, Comparison

Learn how Meta Movie Gen works, how it compares to competitors like OpenAI’s Sora and Runway Gen 3, and understand their capabilities and limitations.

Oct 17, 2024 · 8 min read

Meta recently released Meta Movie Gen, which is both a significant and somewhat unexpected event in the text-to-video generation landscape—I don’t think Meta was on many people’s radar when it comes to video generation.

Given the competitive environment, with several high-profile companies such as OpenAI already working on robust models like Sora, Meta's release is noteworthy.

The model performs well across various tasks, outperforming or matching the quality of offerings from established players like Runway Gen3, LumaLabs, and, notably, OpenAI's Sora.

In this blog, I'm going to cover what Meta Movie Gen is, how it works, its capabilities and limitations, and the safety considerations surrounding its use.

If you want to get an overview of the video generation field, I recommend going over this article on top AI video generators.

What Is Meta Movie Gen?

Meta Movie Gen is a collection of foundational models for generating various types of media, including text-to-video, text-to-audio, and text-to-image. It consists of four models:

Movie Gen Video
Movie Gen Audio
Personalized Movie Gen Video
Movie Gen Edit

Movie Gen Video model

Movie Gen Video is a 30-billion-parameter model designed to create images and videos based on text descriptions. The model can generate high-quality videos up to 16 seconds long that match the given text prompt. It can produce content in different sizes, resolutions, and lengths.

Here’s an example of a video generated with the prompt:

“A sloth with pink sunglasses lays on a donut float in a pool. The sloth is holding a tropical drink. The world is tropical. The sunlight casts a shadow.”

Source: Meta AI

The water reflections in the video are remarkable. I also found the shadows on the face of the sloth to be quite interesting. These small details, when executed well, often go unnoticed. However, they are striking and disrupt the illusion when done poorly.

Movie Gen Audio model

Movie Gen Audio is a 13-billion-parameter model that can create video soundtracks that match the content of a video. It can also generate audio from a text prompt.

It generates high-quality, 48kHz audio that syncs with videos of different lengths, even several minutes long. The model can also produce ambient sounds, matching unseen sources in the video, sound effects that sync with actions, and fitting background music.

Here’s an example of the music and sound effects generated for one of the videos using the prompt:

“Rustling leaves and snapping twigs, with an orchestral music track.”

Source: Meta AI

Note that the mood of the music track wasn’t present in the prompt. The model was able to pick up the mood from the video's content.

Personalized Movie Gen Video model

The Movie Gen Video model can generate videos of a chosen person based on a text description and an image of that person. The video keeps the person's identity intact while following the text prompt.

The following example was generated by combining the selfie of the person with the prompt:

“A woman DJ spins records on a rooftop in LA. She is wearing a pink jacket and giant headphones. There is a cheetah next to the woman. The background is a cityscape.”

Source: Meta AI

The model has the potential to revolutionize video editing workflows across various industries. Its ability to accurately understand and execute text-based editing instructions could significantly improve the editing process, making it faster, more efficient, and accessible to a wider range of users. This technology could benefit filmmakers, content creators, educators, and anyone who works with video content.

How Does Movie Gen Video Work?

The training of the Movie Gen Video model involved several key components: data collection and preparation, the training process itself, fine-tuning for enhanced quality, and the upsampling techniques used to achieve high-resolution outputs.

Data and pre-processing

The Movie Gen Video model was trained on a massive dataset containing hundreds of millions of video-text pairs and over a billion image-text pairs.

Each video in the dataset undergoes a rigorous curation process that involves filtering for visual quality, motion characteristics, and content relevance. The filtering process aims to select videos with non-trivial motion, single-shot camera work, and a diverse range of concepts, including a significant portion featuring humans.

Source: Movie Gen research paper

High-quality captions are very important for effectively training the model. Movie Gen Video uses the LLaMa3-Video model to generate detailed captions that accurately describe the video content. These captions go beyond simply labeling objects; they include descriptions of actions, camera movements, and even lighting information. This rich textual information helps the model learn a more nuanced understanding of visual storytelling.

Training process

The training process for Movie Gen Video is divided into several stages to improve efficiency and model scalability.

The model is first trained on the text-to-image task using lower-resolution images. This "warm-up" phase allows the model to learn fundamental visual concepts before tackling the more complex task of video generation.

The model is then trained jointly on both the text-to-image and text-to-video tasks, progressively increasing the resolution of the input data. This joint training approach helps the model use the abundance and diversity of image-text datasets while also learning the complexities of video generation.

Source: Movie Gen research paper

To handle the computational demands of video data, Movie Gen Video uses a Temporal Autoencoder model (TAE) to compress and decompress the videos in time. This is analogous to zipping and unzipping files on a computer, allowing the model to work with a more manageable representation of the video data.

Source: Movie Gen research paper

Movie Gen Video uses a training objective called Flow Matching. Instead of directly generating the final video, flow matching guides the model to gradually transform a sample of random noise into the desired video output, step by step.

Flow Matching involves predicting the velocity of samples in the latent space rather than directly predicting the samples themselves. This method is more efficient and performant than traditional diffusion-based methods.

Fine-tuning

To improve the quality and aesthetics of the generated videos, the model undergoes supervised fine-tuning using a smaller set of manually curated videos with high-quality captions. This is like an artist refining their technique through expert guidance, leading to more visually appealing and realistic video outputs.

The final Movie Gen Video model is created by averaging multiple models trained with different datasets, hyperparameters, and pretraining checkpoints. This technique helps combine the strengths of different models, leading to more robust and reliable performance.

Upsampling

The model produces videos at a resolution of 768 pixels and uses upsampling to exchange the resolution to full HD 1080p, resulting in sharper and more visually appealing videos. This approach is commonly used because it provides a computationally efficient solution since the base model can process fewer tokens while still producing high-resolution outputs.

The Spatial Upsampler functions as a video-to-video generation model. It takes the lower-resolution video as input, upsamples it to the target resolution using bilinear interpolation, and then encodes it into a latent space using a frame-wise Variational Autoencoder (VAE).

A latent space model then generates the latents of the HD video, conditioned on the encoded latents of the lower-resolution video. Finally, the VAE decoder transforms the resulting HD video latents back into the pixel space, producing the final high-resolution output.

Source: Movie Gen research paper

Evaluation the Movie Gen Model

Evaluating a text-to-video model is more challenging than a text-to-image model because of the added temporal dimension. The Meta team used three main measures to evaluate the model:

Alignment with the prompt
Quality and consistency
Realness and aesthetics

These parameters are evaluated manually by humans who grade the videos according to these parameters.

Alignment with the prompt

Prompt alignment measures how well the generated video matches the input text prompt, including descriptions of the subject, motion, background, and other details. This is further broken down into:

Subject match: Focuses on the alignment of subject appearance, background, lighting, and style.
Motion match: Assesses the alignment of motion-related descriptions.

Quality and consistency

The quality and consistency of the generated video are evaluated independently of the text prompt. This measure is broken down into:

Frame consistency: Checks for inconsistencies or artifacts in object appearance, lighting, and background across frames.
Motion completeness: Assesses whether the video contains a sufficient amount of motion, particularly for unusual subjects or activities.
Motion naturalness: Judges the realism and naturalness of the motion, considering aspects like limb movement, facial expressions, and adherence to physics.
Overall quality: A holistic judgment that considers the balance of the previous three sub-axes to determine the overall goodness of the video.

Realness and aesthetics

Evaluators were asked to rate the model's ability to produce photorealistic and aesthetically pleasing videos. This criteria is divided into:

Realness: Assesses how closely the video resembles a real video, or for fantastical prompts, how well it mimics a realistic art style.
Aesthetics: Judges the visual appeal of the video based on factors like content, lighting, color, and camera effects.

Evaluation benchmark

The benchmark data was designed to evaluate video generations comprehensively. It includes 1,000 prompts that cover various testing aspects, such as human activity, animals, nature, physics, and unusual subjects or activities.

Their benchmark is over three times larger than those used in previous studies. Each prompt is tagged with a motion level (high, medium, low) to assess generation quality across different motion intensities.

The evaluation uses the entire prompt set and examines specific testing metrics, particularly focusing on how well models handle unusual subjects and motions to test their generalization capabilities.

The following diagram shows a breakdown of the prompts used in the evaluation. On the left, we are given the distribution of the concepts of the prompts. On the right side, we see common nouns and verbs used inside the prompts (the bigger the word, the more frequent it is):

Source: Movie Gen research paper

The table below shows the results of the human evaluation of the videos produced by different models over several evaluation criteria. The numbers represent net win rates which are calculated as (Win % - Loss %). It ranges from -100 (complete loss) to 100 (complete win), with 0 indicating a tie.

Source: Movie Gen research paper

Meta Movie Gen vs. Sora vs. Runway Gen 3

According to their experiments, Movie Gen Video has strong performance in text-to-video generation, outperforming several existing models, including commercial systems like Runway Gen3 and LumaLabs, based on human evaluations using the MovieGen Video Bench. Specifically, Movie Gen Video shows significant advantages in:

Overall quality: Evaluators consistently rated Movie Gen's outputs as having higher overall quality, particularly in terms of visual fidelity, motion naturalness, and alignment with the text prompts.
Motion naturalness and consistency: Movie Gen exhibits a significant edge in generating videos with natural and consistent motion, outperforming Runway Gen3 and OpenAI's Sora in these aspects. This strength is likely attributed to the model's ability to learn and simulate real-world physics through its Flow Matching training objective and the extensive use of video data during training.
Realness and aesthetics: Movie Gen also excels in generating videos with higher realness and aesthetic appeal compared to Runway Gen3, LumaLabs, and Kling1.5.

While MovieGen generally outperforms other models, Kling1.5 sometimes generates videos with larger motion magnitudes, although at the cost of distortions and inconsistencies. This highlights the trade-off between motion completeness and visual quality, where Movie Gen Video prioritizes generating realistic and consistent motion over simply maximizing movement magnitude.

Below are some screenshots from the Meta Connect 2024 presentation showcasing a few comparison examples of video generation prompts:

Source: Movie Gen research paper

Here are some comparison examples of video personalization where a photo is used to generate a personalized video:

Source: Movie Gen research paper

Finally, the screenshots below showcase comparison examples of video editing:

Source: Movie Gen research paper

Overall, it seems that Meta Movie Gen pushed the bar and set a new standard in text-to-video models. The evaluation procedure highlighted in the paper appears fair but, of course, it’s hard to know for sure unless these models become publicly available and neutral third parties can compare them.

Model Limitations

While Meta Movie Gen looks great in media generation, there are still areas for improvement.

The model sometimes struggles with complex scenes involving intricate geometry, object manipulation, and realistic physics simulations. For example, generating convincing interactions between objects, accurately depicting state transformations (like melting or shattering), and realistically simulating the effects of gravity or collisions can still pose challenges for the models.

Audio synchronization can be problematic in certain scenarios. Specifically, when dealing with motions that are visually small, occluded, or require a high level of visual understanding to generate the corresponding sound, the models may struggle to achieve perfect synchronization. Examples of such challenges include accurately synchronizing footsteps with a walking person, generating appropriate sounds for objects that are partially hidden from view, and recognizing subtle hand movements on a guitar to produce the correct musical notes.

Additionally, the Movie Gen Audio model, as currently designed, does not support voice generation.

Safety Considerations and Model Release

Realistic AI-generated videos bring significant risks and ethical challenges that must be considered. These risks include the manipulation of trust and the spread of misinformation through deepfakes, which could damage reputations, misinform the public, and harm businesses.

On one hand, it can be an Instagram feature that lets me create a cool video with myself in it to share with my friends but on the other hand, the same feature makes it possible for me to create a fake video of someone else. As these technologies improve, it may become impossible to distinguish them from the real thing.

AI-generated content poses privacy threats, potentially leading to extortion, blackmail, or cyberbullying.

Moreover, these AI tools could undermine art and copyright by mimicking unique art styles without credit. Solutions involve spreading awareness and education about media literacy, investing in detection technology, establishing legal and ethical frameworks, and employing watermarking for authentication.

While AI video generation holds great potential, it requires careful handling to maintain trust and authenticity.

Meta has been open with its AI developments by open-sourcing the Llama models. However, for safety reasons, the Meta Movie Gen models aren’t going to be released yet. According to their article, multiple improvements and safety considerations are required before deploying them.

Conclusion

In this blog post, we introduced Meta's Movie Gen models, a suite of AI models for generating and editing videos using text prompts.

We explored the four distinct models, their capabilities, and how they work. Through a comparative analysis, we showcased how Meta Movie Gen stacks up against other AI video generation tools.

While Meta Movie Gen has set a new benchmark in the field, the technology's potential misuse raises significant concerns. Meta has acknowledged these risks and is taking a cautious approach to releasing these models.

To learn more about Meta AI’s ecosystem, I recommend these blogs:

When will Meta Movie Gen be released?

How long can the Meta Movie Gen be?

Will Meta Movie Gen be open-sourced?

What's the video quality of Meta Movie Gen?

Author

François Aubry

Full-stack engineer & founder at CheapGPT. Teaching has always been my passion. From my early days as a student, I eagerly sought out opportunities to tutor and assist other students. This passion led me to pursue a PhD, where I also served as a teaching assistant to support my academic endeavors. During those years, I found immense fulfillment in the traditional classroom setting, fostering connections and facilitating learning. However, with the advent of online learning platforms, I recognized the transformative potential of digital education. In fact, I was actively involved in the development of one such platform at our university. I am deeply committed to integrating traditional teaching principles with innovative digital methodologies. My passion is to create courses that are not only engaging and informative but also accessible to learners in this digital age.

Topics

Artificial Intelligence

Large Language Models

Learn AI with these courses!

Track

AI Fundamentals

0 min

Discover the fundamentals of AI, learn to leverage AI effectively for work, and dive into models like ChatGPT to navigate the dynamic AI landscape.

See Details

Start Course

Course

Image Processing in Python

4 hr

53.2K

Learn to process, transform, and manipulate images at your will.

See Details

Start Course

Course

Working with Llama 3

2 hr

11.3K

Explore the latest techniques for running the Llama LLM locally and integrating it within your stack.

See Details

Start Course

blog

What Is OpenAI's Sora? How It Works, Examples, Features

Discover OpenAI’s Sora through example videos and explore its features, including Remix, Re-cut, Loop, Storyboard, Blend, and Style Preset.

Richie Cotton

8 min

blog

What is Runway Gen-3 Alpha? How it Works, Use Cases, Alternatives & More

Discover the features of Runway Gen-3 alpha, the latest advancement in generative AI. Explore its cinematic, high-resolution video capabilities and more.

Iva Vrtaric

11 min

blog

What Is Google's Veo 2? How to Access It, Features, Examples

Learn about Google's new AI video generation tool, Veo 2, including its features, how to use it, and how it compares to OpenAI's Sora.

Dr Ana Rojo-Echeburúa

8 min

Robot running physics simulations to represent the concept of Genesis physics engine

blog

Genesis Physics Engine: A Guide With Examples

Learn what Genesis is, why it's a major advancement in AI physics simulation, and how it compares to Sora or Veo 2.

Dr Ana Rojo-Echeburúa

8 min

Robot investigator to represent openai's deep research

blog

OpenAI's Deep Research: A Guide With Practical Examples

Learn about OpenAI's new Deep Research tool, which can perform in-depth, multi-step research.

Alex Olteanu

8 min

See More See More

Meta Movie Gen Guide: How It Works, Examples, Comparison