FantasyTalking project jointly launched by Alibaba and Beijing University of Posts and Telecommunications is another major breakthrough in the digital human project. With just one ID photo, it can generate digital human videos with vivid expressions and natural movements.



Three major innovative modules
Audio-visual alignment strategy: Captures the global correlation between audio and facial expressions, body movements, and background dynamics
Facial cross-attention: Locks facial features with only 3% parameter volume, identity drift rate of 10-minute video <0.3%
Motion intensity modulation network: Independently controls 22 parameters for facial/body amplitude (e.g., eyebrow height, shoulder swing frequency)
Breakthrough in generation effects
Supports 9 generation modes: Close-up/half-body/full-body front view/side view/dynamic background
Covers multiple styles including real person/cartoon/animal, lip-sync error <40ms
360° surround view generation with realistic details such as hair fluttering and neck wrinkles
Performance comparison advantages
In the OmniHuman 1 benchmark test, leading in motion continuity (CIDEr↑18%) and identity preservation (SSIM↑23%)



Model download link: https://pan.quark.cn/s/184684a6d030