Abstract: Although text-to-image models excellently create realistic images from text, they struggle with long-form text due to token limits in the pretrained text encoder. In this paper, we propose ...