If you are keeping up to date with the revolutionary changes that
artificial intelligence is bringing to the table, then you probably already
know about generative AI. Generative AI models can create content by “inferring”
what the next word or output should be. This is the case for image generation
models like DALLE 3 and Stable Diffusion. These models let you input text and
get an image as a response. However, did you know that soon you might be able
to do the same for video? That is what we might see soon with the introduction
of models like the new model “Sora” by OpenAI, which has already accomplished
an incredible level of video quality. Keep reading to learn more about
generative AI for video and what it might in the future.
What is Generative AI Video?
Generative AI Video is an artificial intelligence model that takes text
instructions and converts them into video output. Sora is one such video
generation model. According to OpenAI, “Sora is an AI model that can create
realistic and imaginative scenes from text instructions.” It will generate a
video of up to 1 minute in length. The Sora model has recently made waves in the
AI field since its showcase on February 15, 2024, when OpenAI first released
high-definition videos showcasing what AI video generation is capable of,
opening the doors for future models to be trained in similar ways.
How does Generative AI Video Work?
The Sora model works by training jointly on videos and images of variable quality, such as resolution and aspect ratio. They use “spacetime” data of these videos (meaning, data that contains both the video images and their position in time) to train it to be able to generate videos of different sizes and with surprisingly accurate continuity. Before this, it has been a difficult issue to create an AI model that could generate video since it would very rapidly forget about its context, and things would “disappear” or “appear” from thin air. This is still an issue that needs to be addressed, but the current improvements are promising and generating video content that is very believable. See below for a diagram that sums up the AI video training and generation process.
Applications for Generative AI Video
The first thing we think about when imagining the possible use cases for
an AI video generation model, is to generate stock video for different uses.
The amount of time, money, and effort required to create some simple videos
will be dramatically reduced and their quality improved. There are some other
uses that are really interesting to see. OpenAI showcases how they use methods
for video editing through prompts, which lets you “request” changes to an
original video such as changing its style or even completely change the geographic
location in which an event occurs on the video. They also demonstrate combining
and transitioning between videos. Generative AI Video capabilities will unchain
the power of video editing and effects to a new level with reduced effort in
the future.
However, there are some other use cases that OpenAI mentions. For example, they
state that such models could show a “promising path towards building general
purpose simulators of the physical world”. Such an application could benefit
many areas from physics simulations used for research, to improved graphics on
videogames and movies with less computing requirements.
Limitations
OpenAI’s Sora is an incredible model that has achieved groundbreaking results
in AI generative video; however, it is important to remember that there are
limitations:
- Expressiveness: There could be a specific need that might be difficult to convey through text.
- Duration: The output is currently limited to 1-minute-long videos.
- Physics: There are limitations in trying to simulate some physical interactions such as breaking glass, objects interacting with each other (like a person taking a bite of food), and we have seen bizarre moving objects (e.g. boats on water, moving bicycles, people walking).
- Closed Model: Currently, Sora is the only published generative AI video model and is not available for public use. Even after it is released as a service, based on OpenAI's recent track record, it is most likely never going to be open source. Meaning, we probably will not be able to download these specific models to run on our own servers. Though it’s unlikely to happen soon due to the massive amounts of computation required, we are crossing our fingers for an open-source breakthrough.
FAQs
- What is Sora’s underlying technology?
- Sora uses a transformers AI model, trained on spacetime video and image data, to create a text-conditional diffusion model. Whereas LLMs have text tokens, Sora has visual "spacetime" patches.
- How long are the videos generated by Sora?
- Sora can generate videos of up to 1 minute in length (with no audio). According to OpenAI latest statements.
- Can Sora handle complex prompts?
- Sora was trained using labeled videos with highly descriptive captions as input. This enables Sora to generate high quality videos that accurately follow user prompts.
- Is Sora available for public use?
- Currently, the Sora model is closed and not available for public use. There is still no known date for its release.
External Links
Learn more about AI and software for your business
We help you keep up with innovation