关于视频模型的相关内容如下:
We find that video models exhibit a number of interesting emergent capabilities when trained at scale.These capabilities enable Sora to simulate some aspects of people,animals and environments from the physical world.These properties emerge without any explicit inductive biases for 3D,objects,etc.—they are purely phenomena of scale.我们发现,当在大规模训练时,视频模型展现出许多有趣的新兴能力。这些能力使Sora能够模拟物理世界中的人、动物和环境的某些方面。这些属性没有任何明确的三维、物体等归纳偏置——它们完全是规模现象。3D consistency.Sora can generate videos with dynamic camera motion.As the camera shifts and rotates,people and scene elements move consistently through three-dimensional space.三维一致性。Sora可以生成具有动态摄像机移动的视频。随着摄像机的移动和旋转,人物和场景元素在三维空间中一致地移动。Long-range coherence and object permanence.A significant challenge for video generation systems has been maintaining temporal consistency when sampling long videos.We find that Sora is often,though not always,able to effectively model both short- and long-range dependencies.For example,our model can persist people,animals and objects even when they are occluded or leave the frame.Likewise,it can generate multiple shots of the same character in a single sample,maintaining their appearance throughout the video.
卷疯了卷疯了,短短十几小时内,OpenAI和谷歌接连发布核弹级成果。国内还没睡的人们,经历了过山车般的疯狂一晚。就在刚刚,OpenAI突然发布首款文生视频模型——Sora。简单来说就是,AI视频要变天了!它不仅能够根据文字指令创造出既逼真又充满想象力的场景,而且生成长达1分钟的超长视频,还是一镜到底那种。Runway Gen 2、Pika等AI视频工具,都还在突破几秒内的连贯性,而OpenAI,已经达到了史诗级的纪录。60秒的一镜到底,视频中的女主角、背景人物,都达到了惊人的一致性,各种镜头随意切换,人物都是保持了神一般的稳定性。
生成视频。该模型是基于ControlNet调整得到的,其中新增了三个机制:1.跨帧注意力:在自注意力模块中添加完整的跨帧交互。它引入了所有帧之间的交互,其做法是将所有时间步骤的隐含帧映射到?、?、?矩阵,这不同于Text2Video-Zero(其是让所有帧都关注第一帧)。2.交替式帧平滑器(interleaved-frame smoother)机制是通过在交替帧上采用帧插值来减少闪烁效应。在每个时间步骤?,该平滑器会插值偶数或奇数帧,以平滑其相应的三帧剪辑。请注意,平滑步骤后帧数会随时间推移而减少。3.分层式采样器能在内存限制下保证长视频的时间一致性。一段长视频会被分割成多段短视频,其中每一段短视频都会选出一帧关键帧。该模型会使用完全跨帧注意力预生成这些关键帧以实现长期一致性,而每段相应的短视频都基于这些关键帧按顺序合成。图15:ControlVideo概览。原文链接:https://lilianweng.github.io/posts/2024-04-12-diffusion-video/