当前位置:首页|资讯|Sora

Sora 团队专访:怎么开发的?生成要多久?啥时候能用?

作者:人人都是产品经理发布时间:2024-03-19

原标题:Sora 团队专访:怎么开发的?生成要多久?啥时候能用?

前段时间,Sora 的核心团队接受了一个采访,透露了很多未说的信息。我把采访记录回听了 4 遍,整理下了英文逐字稿,并翻译成了中文。

主持人:能邀请各位百忙之中抽空来参加这次对话,真是十分荣幸~

在对话开始之前,要不先做个自我介绍?比如怎么称呼,负责哪些事情?

First of all thank you guys for joining me. I imagine you’re super busy, so this is much appreciated. If you don’t mind, could you go one more time and give me your names and then your roles at OpenAI.

Bill Peebles:

Bill Peedles,在 OpenAI 负责 Sora 项目

My name is Bill Peebles. I’m a lead on Sora here at OpenAI.

Tim Brooks:

Tim Brooks,负责 Sora 项目的研究

My name is Tim Brooks. I’m also a research lead on Sora.

Aditya Ramesh:

Aditya,一样的,也是负责人

I’m a Aditya. I lead Sora Team

主持人:

我对 Sora 了解一些,主要还是看了你们发布的那些宣传资料、网站,还有一些演示视频,真挺牛的。能简单说说 Sora 究竟是咋实现的吗?我们之前有讨论过 DALL-E 和 Diffusion,但说实话,我对 Sora 的原理确实摸不透。

Okay, so I’ve reacted to Sora. I saw the announcement and the website and all those prompts and example videos that it made that you guys gave, and it was super impressive. Can you give me a super concise breakdown of how exactly it works? Cause we’ve explained DALL-E before and diffusion before, but how does Sora make videos?

Bill Peebles:

简单来说,Sora 是个生成模型。最近几年,出现了很多很酷的生成模型,从 GPT 系列的语言模型到 DALL-E 这样的图像生成模型。

Yeah, at a high level Sora is a generative model, so there have been a lot of very cool generative models over the past few years, ranging from language models like the GPT family to image generation models like DALL-E.

Bill Peebles:

Sora 是专门生成视频的模型。它通过分析海量视频数据,掌握了生成各种现实和虚拟场景的视频内容的能力。

具体来说,它借鉴了 DALL-E 那样基于扩散模型的思路,同时也用到了 GPT 系列语言模型的架构。可以说,Sora 在训练方式上和 DALL-E 比较相似,但架构更接近 GPT 系列。

Sora is a video generation model, and what that means is it looks at a lot of video data and learns to generate photorealistic videos. The exact way it does that kind of draws techniques from both diffusion-based models like DALL-E as well as large language models like the GPT family. It’s kind of like somewhere in between; it’s trained like DALL-E, but architecturally it looks more like the GPT family. But at a high level, it’s just trained to generate videos of the real world and of digital worlds and of all kinds of content.

主持人:

听起来,Sora 像其他大语言模型一样,是基于训练数据来创造内容等。那么,Sora 的训练数据是什么呢?

It creates a huge variety of stuff, kind of the same way the other models do, based on what it’s trained on. What is Sora trained on?

Tim Brooks:

这个不方便说太细😊

但大体上,包括公开数据及 OpenAI 的被授权数据。

We can’t go into much detail on it, but it’s trained on a combination of data that’s publicly available as well as data that OpenAI has licensed.

Tim Brooks:

不过有个事儿值得分享:

以前,不论图像还是视频模型,大家通常只在一个固定尺寸上进行训练。而我们使用了不同时长、比例和清晰度的视频,来训练 Sora。

One innovation that we had in creating Sora was enabling it to train on videos at different durations, as well as different aspect ratios and resolutions. And this is something that’s really new. So previously, when you trained an image or video generation model, people would typically train them at a very fixed size like only one resolution, for example.

Tim Brooks:

至于做法,我们把各种各样的图片和视频,不管是宽屏的、长条的、小片的、高清的还是低清的,我们都把它们分割成了一小块一小块的。

But what we do is we take images, as well as videos, of all wide aspect ratios, tall long videos, short videos, high resolution, low resolution, and we turn them all into these small pieces we call patches.

Tim Brooks:

接着,我们可以根据输入视频的大小,训练模型认识不同数量的小块。

通过这种方式,我们的模型就能够更加灵活地学习各种数据,同时也能生成不同分辨率和尺寸的内容。

And then we’re able to train on videos with different numbers of patches, depending on the size of the input, and that allows our model to be really versatile to train on a wider variety of data, and also to be used to generate content at different resolutions and sizes.

主持人:

你们已经开始使用、构建和发展它一段时间了,可否解答我一个疑惑?

我本身是做视频的,能想到这里要处理的东西有很多,比如光线啊、反光啊,还有各种物理动作和移动的物体等等。

所以我就有个问题:就目前而言,你觉得 Sora 擅长做什么?哪些方面还有所欠缺?比如我看到有个视频里一只手竟然长了六个手指。

You’ve had access to using it, building it, developing it for some time now. And obviously, there’s a, maybe not obviously, but there’s a ton of variables with video. Like I make videos, I know there are lighting, reflections, you know, all kinds of physics and moving objects and things involved. What have you found that Sora in its current state is good at? And maybe there are things that are specifically weaknesses, like I’ll show the video that I asked for in a second, where there are six fingers on one hand. But what have you seen are our particular strengths and weaknesses of what it’s making?

Tim Brooks:

Sora 特别擅长于写实类的视频,并且可以很长,1分钟那么长,遥遥领先。

但在一些方面它仍然存在不足。正如你所提到的,Sora 还不能很好的处理手部细节,物理效果的呈现也有所欠缺。比如,在之前发布的一个3D打印机视频中,其表现并不理想。特定场景下,比如随时间变化的摄像机轨迹,它也可能处理不佳。因此,对于一些物理现象和随时间发生的运动或轨迹,Sora 还有待改进。

It definitely excels at photo realism, which is a big step forward. And the fact that the videos can be so long, up to a minute, is really a leap from what was previously possible. But some things it still struggles with. Hands in general are a pain point, as you mentioned, but also some aspects of physics. And like in one of the examples with the 3D printer, you can see it doesn’t quite get that right. And also, if you ask for a really specific example like camera trajectory over time, it has trouble with that. So some aspects of physics and of the motion or trajectories that happen over time, it struggles with.

主持人:

看到 Sora 在一些特定方面做得这么好,实在是挺有趣的。

像你提到的,有的视频在光影、反射,乃至特写和纹理处理上都非常细腻。这让我想到 DALL-E,因为你同样可以让 Sora 模仿 35mm 胶片拍摄的风格,或者是背景虚化的单反相机效果。

但是,目前这些视频还缺少了声音。我就在想,为 AI 生成的视频加上 AI 生成的声音,这个过程是不是特别有挑战性?是不是比我原先想象的要复杂很多?你们认为要实现这样的功能,我们还需要多久呢?

It’s really interesting to see the stuff it does well, because like you said, there are those examples of really good photorealism with lighting and reflections and even close-ups and textures. And just like DALL-E, you can give it styles like shot in 35mm film or shot, you know, like from a DSLR with a blurry background. There are no sounds in these videos, though. I’m super curious if it would be a gigantic extra lift to add sound to these, or if it’s more complicated than I’m realizing. How far does it feel like you are from being able to also have AI-generated sound in an AI-generated video?

Bill Peebles:

这种事情很难具体说需要多久,并非技术难度,而是优先级排期。

我们现在的当务之急是要先把视频生成模型搞得更强一些。毕竟,以前那些AI生成的视频,最长也就四秒,而且画质和帧率都不太行。所以,我们目前的主要精力都在提升这块。

当然了,我们也觉得视频如果能加上声音,那效果肯定是更棒的。但现在,Sora 主要还是专注于视频生成。

It’s hard to give exact timelines with these kinds of things. For first one, we were really focused on pushing the capabilities of video generation models forward, because before this, you know, a lot of AI-generated video was like 4 seconds of pretty low frame rate and the quality wasn’t great. So that’s where a lot of our effort so far has been. We definitely agree though that, you know, adding in these other kinds of content would make videos way more immersive. So it’s something that we’re definitely thinking about. But right now, Sora is mainly just a video generation model and we’ve been focused on pushing the capabilities in that domain, for sure.

主持人:

你们在 Sora 身上做了大量工作,它的进步有目共睹。我很好奇,你们是怎么判断它已经达到了可以向世界展示的水平的?

就像 DALL-E 一样,它在发布之初就惊艳全场,这一定是一个值得铭记的时刻。另外,在 Sora 已经表现出色的方面,你们是如何决定下一步的改进方向的呢?有什么标准或者参考吗?

So okay, DALL-E has improved a lot over time. It’s gotten better, it’s improved in a lot of ways and you guys are constantly developing and working towards making Sora better. First of all, how did you get to the point where you’d gotten good enough with it that you knew it was ready to share with the world and we had this mic drop moment? And then how do you know how to keep moving forward and making things that it’s better at?

Tim Brooks:

你可能会注意到,我们目前并没有正式的发布 Sora,而是通过比如博客、Twitter、Tiktok 等渠道发布一些视频。这里的主要原因是,我们希望在真正准备好之前,更多的获得一些来自用户的反馈,了解这项技术如何能为人们带来价值,同时也需要了解在安全方面还有哪些工作要做,这将为我们未来的研究指明方向。

现在的 Sora 还不成熟,也还没有整合到 ChatGPT 或其他任何平台中。我们会基于收集到的意见进行不断改进,但具体内容还有待探讨。

我们希望通过公开展示来获取更多反馈,比如从安全专家那里听取安全意见,从艺术家那里了解创作思路等等,这将是我们未来工作的重点。


Copyright © 2024 aigcdaily.cn  北京智识时代科技有限公司  版权所有  京ICP备2023006237号-1