Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

1Yandex Research        2HSE University        3MIPT        4Skoltech
MY ALT TEXT

Abstract

This work presents Switti, a scale-wise transformer for text-to-image generation. Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. We then observe that self-attention maps of our pretrained scale-wise AR model exhibit weak dependence on preceding scales. Based on this insight, we propose a non-AR counterpart facilitating ~11% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~20% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Human evaluation

MY ALT TEXT

Switti vs competing AR and diffusion-based models.



Inference performance evaluation

MY ALT TEXT

Comparison of models’ 512×512 image generation time.



Automated Metrics

MY ALT TEXT

Quantitative comparison of Switti to competing AR and diffusion-based models.

The best model is highlighted in red, the second-best in blue, and the third-best in yellow according to the respective automated metric.

BibTeX


      @article{voronov2024switti,
        title={Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis}, 
        author={Anton Voronov and Denis Kuznedelev and Mikhail Khoroshikh and Valentin Khrulkov and Dmitry Baranchuk},
        year={2024},
        eprint={2412.01819},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2412.01819}
      }