Through this ComfyUI workflow, we can directly input an audio clip (such as a song or speech) and a reference image to generate high-quality videos that are lip synced to the audio.

This technique allows you to generate videos that perfectly lip sync to any audio track using just a reference image and a sound file. I have included two versions of the workflow, focusing heavily on the Low VRAM optimized version using GGUF models, which surprisingly delivered better quality in my tests.