Easy to burst video memory

A trajectory-oriented DiT framework that simultaneously integrates text, vision, and trajectory conditions to generate videos. Specifically, Tora consists of a trajectory extractor (TE), spatiotemporal DiT, and a motion-guided fusion (MGF). TE encodes arbitrary trajectories into hierarchical spatiotemporal motion blocks using a 3D video compression network. MGF integrates motion blocks into DiT blocks to generate consistent videos that follow the trajectory. Our design fits seamlessly with the scalability of DiT, allowing precise control of the dynamics of video content with different durations, aspect ratios, and resolutions. Extensive experiments demonstrate that Tora excels in achieving high motion fidelity while carefully simulating the motion of the physical world.

https://github.com/alibaba/Tora?tab=readme-ov-file#-inference