https://github.com/bytedance/DreamFit


The diffusion model for generating human bodies centered on clothing based on text or image prompts has attracted much attention due to its immense application potential. However, existing methods often face a dilemma: lightweight methods (such as adapters) tend to produce inconsistent textures; while fine-tuning-based methods are costly to train and struggle to maintain the generalization ability of pre-trained diffusion models, limiting their performance in different scenarios. To address these challenges, we propose DreamFit, which integrates a lightweight Anything Dressing Encoder specifically tailored for clothing-centered human generation.

DreamFit has three main advantages:

Lightweight training: By introducing the proposed adaptive attention and LoRA modules, DreamFit significantly reduces model complexity to 83.4M trainable parameters.
Anything dressing: Our model is remarkably suitable for various (non-)clothing, creative styles, and prompt descriptions, consistently delivering high-quality results across diverse scenarios.
Plug-and-play: DreamFit is designed to seamlessly integrate with any community control plugin for diffusion models, ensuring easy compatibility and minimizing adoption barriers.
To further enhance generation quality, DreamFit leverages a pre-trained large multimodal model (LMM) to enrich prompts with fine-grained clothing descriptions, narrowing the prompt gap between training and inference. We conducted comprehensive experiments on 768 × 512 high-resolution benchmarks and natural scene images. DreamFit outperforms all existing methods, demonstrating its leading advantage in the field of clothing-centered human generation.