
Sana: Efficient High-Resolution Image Synthesis Using Linear Diffusion Transformers
Powered by DC AE with 32x latent space.
Tips support English, Chinese, and emojis.
We introduce Sana, a text to image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high resolution, high quality images with strong text image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder only text encoder: we replaced T5 with modern decoder only small LLM as the text encoder and designed complex human instruction with in context learning to enhance the image text alignment. (4) Efficient training and sampling: we propose Flow DPM Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana 0.6B is very competitive with modern giant diffusion model (e.g. Flux 12B), being 20 times smaller and 100 times faster in measured throughput. Moreover, Sana 0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.
We introduce Sana, a text to image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high resolution, high quality images with strong text image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder only text encoder: we replaced T5 with modern decoder only small LLM as the text encoder and designed complex human instruction with in context learning to enhance the image text alignment. (4) Efficient training and sampling: we propose Flow DPM Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana 0.6B is very competitive with modern giant diffusion model (e.g. Flux 12B), being 20 times smaller and 100 times faster in measured throughput. Moreover, Sana 0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.
Sana: Efficient High-Resolution Image Synthesis Using Linear Diffusion Transformers
Powered by DC AE with 32x latent space.
Tips support English, Chinese, and emojis.
We introduce Sana, a text to image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high resolution, high quality images with strong text image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder only text encoder: we replaced T5 with modern decoder only small LLM as the text encoder and designed complex human instruction with in context learning to enhance the image text alignment. (4) Efficient training and sampling: we propose Flow DPM Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana 0.6B is very competitive with modern giant diffusion model (e.g. Flux 12B), being 20 times smaller and 100 times faster in measured throughput. Moreover, Sana 0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.
We introduce Sana, a text to image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high resolution, high quality images with strong text image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder only text encoder: we replaced T5 with modern decoder only small LLM as the text encoder and designed complex human instruction with in context learning to enhance the image text alignment. (4) Efficient training and sampling: we propose Flow DPM Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana 0.6B is very competitive with modern giant diffusion model (e.g. Flux 12B), being 20 times smaller and 100 times faster in measured throughput. Moreover, Sana 0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.