See4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Dongyue Lu^1* Ao Liang^1* Tianxin Huang² Xiao Fu³ Yuyang Zhao¹ Baorui Ma⁴
Liang Pan⁵ Wei Yin⁶ Lingdong Kong^1,† Wei Tsang Ooi¹ Ziwei Liu⁷

¹NUS ²HKU ³CUHK ⁴THU ⁵Shanghai AI Laboratory ⁶Horizon Robotics ⁷NTU

TL;DR

See4D enables high-quality 4D scene generation from a single unposed video via a trajectory-to-camera formulation, leveraging depth-warped conditioning and spatio-temporal autoregression to produce coherent 4D scenes.

From Casual Video to 4D

Given an unposed source video, See4D spline-interpolates a trajectory of virtual camera poses and estimate per-frame depth to lift each frame into 3D.
Depth-guided forward warping produces intermediate latents which, together with source latents, are processed by our view-conditional diffusion model.
The model is further augmented with spatiotemporal attention and noise-adaptive conditioning in an auto-regressive framework, enabling the synthesis of the desired target-view sequence.

Source Video

Depth

Video-to-Video Generation

Warp Mask 1

Warp Mask 2

Warp Mask 3

Warp Mask 4

Warp Mask 5

Warp Mask 6

Gen View 1

Gen View 2

Gen View 3

Gen View 4

Gen View 5

Gen View 6

Warp Mask 1

Warp Mask 2

Warp Mask 3

Warp Mask 4

Warp Mask 5

Gen View 1

Gen View 2

Gen View 3

Gen View 4

Gen View 5

4D Generation

Source Video

Warp Video 1

Warp Video 2

Warp Video 3

Depth

Generated Video 1

Generated Video 2

Generated Video 3

Source Video

Warp Video 1

Warp Video 2

Warp Video 3

Depth

Generated Video 1

Generated Video 2

Generated Video 3

Source Video

Warp Video 1

Warp Video 2

Warp Video 3

Depth

Generated Video 1

Generated Video 2

Generated Video 3

Source Video

Warp Video 1

Warp Video 2

Warp Video 3

Depth

Generated Video 1

Generated Video 2

Generated Video 3

Source Video

Warp Video 1

Warp Video 2

Warp Video 3

Depth

Generated Video 1

Generated Video 2

Generated Video 3

4D Reconstruction

Click & Drag

Applications

Movie Clips

Source Video

Generated Video 1

Generated Video 3

Depth

Generated Video 2

Generated Video 4

Source Video

Generated Video 1

Generated Video 3

Depth

Generated Video 2

Generated Video 4

Robot Grasping

Source Video

Generated Video 1

Generated Video 3

Depth

Generated Video 2

Generated Video 4

Source Video

Generated Video 1

Generated Video 3

Depth

Generated Video 2

Generated Video 4

Autonomous Driving

Source Video

Generated Video 1

Generated Video 3

Depth

Generated Video 2

Generated Video 4

Source Video

Generated Video 1

Generated Video 3

Depth

Generated Video 2

Generated Video 4

Benchmark

4D Reconstruction on iPhone Dataset

Metric: PSNR (↑, the higher the better)

Method	Venue	Apple	Block	Paper	Spin	Teddy	Avg
GCD	ECCV'24	9.82	12.30	9.75	10.37	11.61	10.77
ViewCrafter	arXiv'24	10.19	10.28	10.63	11.15	11.50	10.75
Shape-of-Motion	arXiv'24	11.06	11.72	11.93	11.28	10.42	11.28
DaS	SIGGRAPH'25	10.02	11.64	10.27	11.11	11.82	10.97
ReCamMaster	arXiv'25	10.96	12.67	11.88	12.25	12.37	12.02
TrajectoryCrafter	arXiv'25	13.88	14.21	14.89	14.51	13.73	14.24
See4D	Ours	13.98	14.67	15.24	14.72	14.22	14.56

Metric: SSIM (↑, the higher the better)

Method	Venue	Apple	Block	Paper	Spin	Teddy	Avg
GCD	ECCV'24	0.215	0.458	0.398	0.324	0.385	0.356
ViewCrafter	arXiv'24	0.245	0.427	0.344	0.308	0.372	0.339
Shape-of-Motion	arXiv'24	0.197	0.446	0.425	0.319	0.357	0.349
DaS	SIGGRAPH'25	0.217	0.388	0.356	0.312	0.381	0.331
ReCamMaster	arXiv'25	0.264	0.454	0.471	0.344	0.401	0.387
TrajectoryCrafter	arXiv'25	0.285	0.528	0.482	0.380	0.411	0.417
See4D	Ours	0.309	0.555	0.514	0.399	0.434	0.442

Metric: LPIPS (↓, the lower the better)

Method	Venue	Apple	Block	Paper	Spin	Teddy	Avg
GCD	ECCV'24	0.738	0.590	0.535	0.576	0.629	0.614
ViewCrafter	arXiv'24	0.750	0.615	0.521	0.533	0.606	0.605
Shape-of-Motion	arXiv'24	0.879	0.601	0.486	0.560	0.650	0.635
DaS	SIGGRAPH'25	0.732	0.593	0.520	0.551	0.608	0.601
ReCamMaster	arXiv'25	0.683	0.537	0.491	0.545	0.572	0.566
TrajectoryCrafter	arXiv'25	0.612	0.479	0.471	0.518	0.513	0.519
See4D	Ours	0.581	0.455	0.439	0.501	0.486	0.492

4D Generation on VBench

Method	Frame Consistency		Temporal Consistency		Frame Quality
Method	Subj. Consist	Back. Consist	Flick	Smooth	Image Quality	Aesthetic Quality
DaS	89.44	91.69	96.11	95.58	50.64	37.49
ReCamMaster	90.56	93.42	95.11	98.12	52.47	39.65
TrajectoryCrafter	89.61	92.55	92.78	93.49	52.20	37.62
See4D	92.18	94.63	96.66	97.87	53.15	41.35