OpenAI's Sora turns AI prompts into photorealistic videos


We already know that OpenAI's chatbots can pass the bar exam without going to law school. Now, just in time for the Oscars, a new OpenAI app called Sora hopes to help you master cinema without going to film school. Right now for a research product, Sora is going to a select few creators and several security experts who will red-team it for security vulnerabilities. OpenAI plans to make it available to all interested authors at an unspecified date, but decided to preview it in advance.

From giants like Google to startups like Runway, other companies have already revealed text-to-video AI projects. But OpenAI says Sora is distinguished by its amazing photorealism – something I haven't seen in its competitors – and its ability to create longer clips, usually up to a minute, compared to the brief snippets of other models. The researchers I spoke to wouldn't say how much time it took to render that entire video, but when pressed, they compared it to “taking a few days off” and “going out for a burrito.” Told more at the ballpark. If the select examples I've seen are to be believed, it's worth the effort.

OpenAI didn't allow me to enter my own prompts, but it did share four examples of Sora's power. (None reached the alleged one-minute limit; the longest was 17 seconds.) The first came from an elaborate prompt that sounded like the setup of an obsessive screenwriter: “The beautiful, snowy city of Tokyo is bustling with camera city. Strolls down a bustling street with many people enjoying the beautiful snowy weather and shopping at nearby stalls. Beautiful sakura petals are blowing in the wind along with snowflakes.

AI-generated video created with OpenAI's Sora.

Courtesy of OpenAI

The result is a clear view of the magical moment that is undoubtedly Tokyo, when snowflakes and cherry blossoms exist side by side. A virtual camera, as if attached to a drone, follows a couple as they slowly stroll through a street scene. One of the passersby is wearing a mask. On the road along the river bank, on the left, carts rumble by, and on the right, a row of small shops with shoppers coming in and out.

It is not complete. Only when you watch the clip a few times do you realize what a dilemma the main characters – a couple strolling on a snow-covered sidewalk – would have faced if the virtual camera had remained on. The footpath they occupy seems to be finished; They would have had to cross a small railing and take a strange parallel path to their right. Despite this minor glitch, Tokyo Example is a stunning exercise in world-building. Going forward, production designers will debate whether this is a powerful ally or a job destroyer. Furthermore, the people in this video – which are generated entirely by a digital neural network – are not shown in close-up, and they do not say anything emotional. But the Sora team says that in other cases they have fake actors showing real emotions.

Other clips are also impressive, particularly one featuring “an animated scene of a little fluffy monster kneeling near a red candle” as well as some detailed stage directions (“wide eyes and open mouth”) and the desired vibe. Details have been sought. clip. Sora creates a Pixar-esque creature whose DNA seems to be from Furby, Gremlin and Sully monsters Inc, I remember when that latter film came out, Pixar talked a lot about how difficult it was to create the ultra-complex texture of a monster's fur as the creature moved around. It took all the wizards at Pixar several months to get it right. OpenAI's new text-to-video machine…did just that.

“It learns about 3D geometry and stability,” says project research scientist Tim Brooks of that achievement. “We didn't bake it — it emerged entirely from looking at a lot of data.”

Created with the AI-generated video prompt, “The animated scene shows a close-up of a small fluffy monster kneeling near a melting red candle. The art style is 3D and realistic, with a focus on lighting and textures. The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and an open mouth. Its posture and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. The use of warm colors and dramatic lighting further enhances the relaxing atmosphere of the image.

Courtesy of OpenAI

While the visuals are certainly impressive, Sora's most surprising abilities are the ones he hasn't been trained for. Powered by a version of the diffusion model used by OpenAI's Dell-3 image generator as well as GPT-4's Transformer-based engine, Sora not only produces videos that meet the demands of the signal, but In a way let's do this. This reflects an emerging understanding of cinematic grammar.

This translates into storytelling. In another video that was created with the prompt “a gorgeously rendered paper world of a coral reef filled with colorful fish and sea creatures.” Bill Peebles, another researcher on the project, says that Sora creates a narrative thrust with its camera angles and timing. “There are actually multiple shot changes – these are not stitched together, but created all at once by the model,” he says. “We didn't tell it to do that, it did it automatically.”

The AI-generated video was created with the caption “A gorgeously rendered paper world of a coral reef teeming with colorful fish and sea creatures.”Courtesy of OpenAI

In another example that I didn't see, Sora was inspired to visit a zoo. “It started with the zoo's name on a big sign, gradually scaled it down, and then did several shot transitions to show the different animals that live at the zoo,” says Peebles. It was not clearly instructed to do so in a cinematic manner.

One feature in Sora that the OpenAI team did not show, and may not release for quite some time, is the ability to generate video from a single image or sequence of frames. “This is going to be another really great way to improve your storytelling abilities,” says Brooks. “You can draw exactly what you have in your mind and then bring it to life.” OpenAI is aware that this feature also has the potential to generate deepfakes and misinformation. “We would be very careful about all the security implications of this,” says Peebles.