New email: Product Hunt, February 15th, 2024—Sam Altman just launched “Sora by OpenAI”
Hey there, Yael!
Sam Altman just launched Sora by OpenAI - Create minute-long videos using text prompts
Sora is an AI model that can create realistic and imaginative scenes from text instructions.
OpenAI will now let you create videos from verbal cues. Sora is capable of creating “complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background,” according to OpenAI’s introductory blog post1.
The company also notes that the model can understand how objects “exist in the physical world,” as well as “accurately interpret props and generate compelling characters that express vibrant emotions.”
Best to be honest, the demos on the page are so realistic that for a moment you feel it is from the movies. Sora is capable of generating entire videos all at once or extending generated videos to make them longer.
Until not that long, many mentioned AI will never be able to do creative tasks.
Now it seems we are about to exceeded that limitation.
By giving the model foresight of many frames at a time, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily.
“The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.” ChatGPT and LLMs can do anything (or look like they can).
Where are we heading? What can we do with them? How do we know?
So, what is important, now?
Do we move to chatbots as a magical general-purpose interface, or do we unbundle them back into single-purpose software?
The quality is outstanding, but how long do these renders take to generate?
Almost a decade ago2, the new machine learning meant that speech recognition and natural language processing were good enough to build a generalized and open input.
We could ask anything, though. But, actually, the system could only answer 10, 20, or 50 things. Each of those had to be built by hand, one by one, by someone at Google, Amazon or Apple. Fast forward to now, these impressive clips will abound everywhere.
“All are generated by models without any modifications, highlighting their ability to create realistic and imaginative scenes.”
—OpenAI
The understandable issue is that it can rapidly become intoxicating. Just a year ago, early image generators were outputting garbled smudges, and now in a few sentences, we can generate Pixar-looking 3D cartoons and landscapes. This brings us to two new problems: a product problem a science problem.
Re-we can ask anything, and the system will try to answer, but it might be wrong; and, even if it answers correctly, an answer might not be the right way to achieve what we want and/or need (that might be the bigger problem).
The science problem: last year (2023) the news was talking about ‘error rates’ or ‘hallucinations’. The breakthrough of LLMs is to create a statistical model that can be built by machine at a large scale, but instead of a deterministic model that (today) must be built by hand and doesn’t scale. The goal behind Sora (and others) is to develop models that understand and simulate real-world dynamics, assisting in solving practical problems. And even if the models can generate videos based on still images, as well as fill in missing frames on an existing video or extend it, we will need this human checkup. So the struggle with accurately simulating the physics of a complex scene will remain for a while.
The product problem: we get false impressions of certainty. Even if LLMs can now do perfect natural language generation, which tends to hide the flaws in the underlying model, we have a reality that is altered: the product questions are much broader than the error rate. The Sora-generated demos include an aerial scene of California during the gold rush, a video that looks as if it were shot from the inside of a Tokyo train, and others. How do we present and package uncertainty?
An example of inaccurate physical modeling and unnatural object “morphing:”
The things to know
Improvements in text-to-video models and offshoots like image-to-video and video-to-video are gaining steam. What’s next for AI video?
What is now possible to do with chatGPT, the system:
Understands the physical world, ensuring characters and scenes behave in a believable manner.
Maintains visual quality and consistency throughout the video.
Has the ability to generate detailed videos from complex prompts, like a stylish woman walking in Tokyo or a movie trailer.
Utilizes a diffusion model and transformer architecture for superior scaling, capable of extending videos and animating still images accurately.
Users can guide what Sora remembers and forgets, enhancing personalization in future interactions.
Re-what's next for AI video?
Legitimate questions abound. Where should we draw the line between privacy and justice? And between protecting personal freedom and protecting society?
Wherever we draw the line, it will gradually move toward reduced privacy. This is to compensate for the fact that evidence gets easier to fake.
Once AI can generate fully realistic fake videos of people committing crimes, will you vote for a system where the government tracks everyone’s whereabouts at all times?
It could provide society with an ironclad alibi if needed.
Where should we move the needle and accept—or deny—innovation is part of human’s evolution?
Explore more
AI generated videos just changed forever by Marques Brownlee on Youtube
Share you thoughts
How do you think artificial intelligence is transforming the world?
Please take a moment to comment and share your thoughts.
Continue exploring
📌 AI case studies
You are receiving this email because you signed up for Wild Intelligence by Yael Rozencwajg. Thank you for being so interested in our newsletter!
AI case studies are part of Wild Intelligence, approaches and strategies.
We share tips to help you lead, launch and grow your sustainable enterprise.
Become a premium member, and get our tools to start building your AI based enterprise.
Learning to generate images with perceptual similarity metrics (from Nov 2015)