
MMAudio can generate synchronized audio based on video and/or text inputs. Our key innovation is multimodal joint training, which allows training on a wide range of audio-visual and audio-text datasets. Additionally, the synchronization module aligns the generated audio with video frames.
MMAudio generates synchronized audio given video and/or text inputs. Our key innovation is multimodal joint training which allows training on a wide range of audio visual and audio text datasets. Moreover, a synchronization module aligns the generated audio with the video frames.
MMAudio can generate synchronized audio based on video and/or text inputs. Our key innovation is multimodal joint training, which allows training on a wide range of audio-visual and audio-text datasets. Additionally, the synchronization module aligns the generated audio with video frames.
MMAudio generates synchronized audio given video and/or text inputs. Our key innovation is multimodal joint training which allows training on a wide range of audio visual and audio text datasets. Moreover, a synchronization module aligns the generated audio with the video frames.