All models are trained on ImageNet with an input shape of 256x256. All models downsample the images to a spatial size of 16x16, leading to a latent representation of 16x16xK bits per image.
Dynamic videos are available for viewing on our project page. Measured under resolution 256x256, frames 16, DDIM steps 25, text-image CFG 7.5, camera CFG 1.0. Compared with CamCo-like method (Plucker ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results