Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis.
In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K real-world shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent.
Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learning-based cinematography, paving the way for future advancements in camera control and filmmaking. Our code and data will be publicly available.
Overview. Top: DataDoP data construction. Given RGB video frames, we extract RGBD images and camera poses, then tag the pose sequence with different motion categories (in different colors). With LLM, we generate two types of captions from motion tags and RGBD inputs: Motion Caption describes the camera movements, while Directorial Caption describes the camera movements along with their interaction with the scene and directorial intent. Bottom: Our GenDoP method supports multi-modal inputs for trajectory creation. The generated camera sequence can be easily applied to various video generation tasks, including text-to-video (T2V) and image-to-video (I2V) generation. GenDoP paves the way for future advancements in camera-controlled video generation.
Dataset | Traj Type | Domain | Caption | Statistics | |||||
---|---|---|---|---|---|---|---|---|---|
Traj | Scene | Intent | #Vocab | #Sample | #Frame | #Avg (s) | |||
MVImgNet | Object/Scene-Centric | Captured | - | 22K | 6.5M | 10 | |||
RealEstate10k | Object/Scene-Centric | Youtube | - | 79K | 11M | 5.5 | |||
DL3DV-10K | Object/Scene-Centric | Captured | - | 10K | 51M | 85 | |||
CCD | Tracking | Synthetic | 48 | 25K | 4.5M | 7.2 | |||
E.T. | Tracking | Film | 1790 | 115K | 11M | 3.8 | |||
DataDoP (Ours) | Free-Moving | Film | 8698 | 29K | 11M | 14.4 |
DataDoP Statistics. We compare the DataDoP dataset to other datasets containing camera trajectories. DataDoP is a large dataset focusing on artistic, free-moving trajectories, each accompanied by high-quality caption annotations. The provided captions detail the camera movements, their interactions with scene content, and the underlying directorial intent. To capture more intricate camera movements, each video clip spans 10-20 seconds, averaging 14.4 seconds.
The camera moves forward and remains static for the rest of the duration.
The camera moves forward, then remains static, focusing closely on the subject's expressions and gestures. This highlights their interactions without any background distractions.
The camera moves left while yawing left, continues moving left, and then moves left while yawing right.
The camera moves left while yawing left, showcasing the desert landscape and hills. It continues left, then shifts to yaw right, highlighting the terrain's contours and depth.
DataDoP Cases. We showcase several representative cases from our dataset. Each case includes the original video frames, camera trajectory visualization, and corresponding motion and directorial captions.
Our Auto-regressive Generation Model. Our model supports multi-modal inputs and generates trajectories based on these inputs. By treating the task as an auto-regressive next-token prediction problem, the model sequentially generates trajectories, with each new pose prediction influenced by previous camera states and input conditions.
Additional Results. We demonstrate how TrajectoryCrafter applies our trajectory generation method to camera-controlled video, ensuring the generated sequences align with the provided camera descriptions and motion criteria.
@misc{zhang2025gendopautoregressivecameratrajectory,
title={GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography},
author={Mengchen Zhang and Tong Wu and Jing Tan and Ziwei Liu and Gordon Wetzstein and Dahua Lin},
year={2025},
eprint={2504.07083},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.07083},
}