See4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting




TL;DR

See4D enables high-quality 4D scene generation from a single unposed video by combining depth-guided lifting, spline-based virtual camera planning, and view-conditional diffusion in a temporally consistent, autoregressive framework.

No Pose?  No Problem

  • Given an unposed source video, See4D spline-interpolates a trajectory of virtual camera poses and estimate per-frame depth to lift each frame into 3D.
  • Depth-guided forward warping produces intermediate latents which, together with source latents, are processed by our view-conditional diffusion model.
  • The model is further augmented with spatiotemporal attention and noise-adaptive conditioning in an auto-regressive framework, enabling the synthesis of the desired target-view sequence.
Source Video
Depth

Video-to-Video Generation

Warp Mask 1
Warp Mask 2
Warp Mask 3
Warp Mask 4
Warp Mask 5
Warp Mask 6
Gen View 1
Gen View 2
Gen View 3
Gen View 4
Gen View 5
Gen View 6

Warp Mask 1
Warp Mask 2
Warp Mask 2
Warp Mask 3
Warp Mask 4
Warp Mask 5
Gen View 1
Gen View 2
Gen View 3
Gen View 3
Gen View 4
Gen View 5


4D Generation


Source Video
Warp Video 1
Warp Video 2
Warp Video 3
Depth
Generated Video 1
Generated Video 2
Generated Video 3
Source Video
Warp Video 1
Warp Video 2
Warp Video 3
Depth
Generated Video 1
Generated Video 2
Generated Video 3
Source Video
Warp Video 1
Warp Video 2
Warp Video 3
Depth
Generated Video 1
Generated Video 2
Generated Video 3
Source Video
Warp Video 1
Warp Video 2
Warp Video 3
Depth
Generated Video 1
Generated Video 2
Generated Video 3
Source Video
Warp Video 1
Warp Video 2
Warp Video 3
Depth
Generated Video 1
Generated Video 2
Generated Video 3


4D Reconstruction

Click & Drag



Applications



  Movie Clips

Source Video
Generated Video 1
Generated Video 3
Depth
Generated Video 2
Generated Video 4
Source Video
Generated Video 1
Generated Video 3
Depth
Generated Video 2
Generated Video 4


  Robot Grasping

Source Video
Generated Video 1
Generated Video 3
Depth
Generated Video 2
Generated Video 4
Source Video
Generated Video 1
Generated Video 3
Depth
Generated Video 2
Generated Video 4


  Autonomous Driving

Source Video
Generated Video 1
Generated Video 3
Depth
Generated Video 2
Generated Video 4
Source Video
Generated Video 1
Generated Video 3
Depth
Generated Video 2
Generated Video 4

Benchmark


4D Reconstruction on iPhone Dataset


Metric: PSNR (↑, the higher the better)

MethodVenueAppleBlockPaperSpinTeddyAvg
GCDECCV'249.8212.309.7510.3711.6110.77
ViewCrafterarXiv'2410.1910.2810.6311.1511.5010.75
Shape-of-MotionarXiv'2411.0611.7211.9311.2810.4211.28
DaSSIGGRAPH'2510.0211.6410.2711.1111.8210.97
ReCamMasterarXiv'2510.9612.6711.8812.2512.3712.02
TrajectoryCrafterarXiv'2513.8814.2114.8914.5113.7314.24
See4DOurs13.9814.6715.2414.7214.2214.56

Metric: SSIM (↑, the higher the better)

MethodVenueAppleBlockPaperSpinTeddyAvg
GCDECCV'240.2150.4580.3980.3240.3850.356
ViewCrafterarXiv'240.2450.4270.3440.3080.3720.339
Shape-of-MotionarXiv'240.1970.4460.4250.3190.3570.349
DaSSIGGRAPH'250.2170.3880.3560.3120.3810.331
ReCamMasterarXiv'250.2640.4540.4710.3440.4010.387
TrajectoryCrafterarXiv'250.2850.5280.4820.3800.4110.417
See4DOurs0.3090.5550.5140.3990.4340.442

Metric: LPIPS (↓, the lower the better)

MethodVenueAppleBlockPaperSpinTeddyAvg
GCDECCV'240.7380.5900.5350.5760.6290.614
ViewCrafterarXiv'240.7500.6150.5210.5330.6060.605
Shape-of-MotionarXiv'240.8790.6010.4860.5600.6500.635
DaSSIGGRAPH'250.7320.5930.5200.5510.6080.601
ReCamMasterarXiv'250.6830.5370.4910.5450.5720.566
TrajectoryCrafterarXiv'250.6120.4790.4710.5180.5130.519
See4DOurs0.5810.4550.4390.5010.4860.492


4D Generation on VBench


Method Frame Consistency Temporal Consistency Frame Quality
Subj. Consist Back. Consist Flick Smooth Image Quality Aesthetic Quality
DaS 89.44 91.69 96.11 95.58 50.64 37.49
ReCamMaster 90.56 93.42 95.11 98.12 52.47 39.65
TrajectoryCrafter 89.61 92.55 92.78 93.49 52.20 37.62
See4D 92.18 94.63 96.66 97.87 53.15 41.35