TY - GEN
T1 - Towards Language-Free Training for Text-to-Image Generation
AU - Zhou, Yufan
AU - Zhang, Ruiyi
AU - Chen, Changyou
AU - Li, Chunyuan
AU - Tensmeyer, Chris
AU - Yu, Tong
AU - Gu, Jiuxiang
AU - Xu, Jinhui
AU - Sun, Tong
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - One of the major challenges in training text-to-image generation models is the need of a large number of highquality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pretrained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.
AB - One of the major challenges in training text-to-image generation models is the need of a large number of highquality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pretrained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.
KW - Image and video synthesis and generation
KW - Vision + language
UR - https://www.scopus.com/pages/publications/85137221576
U2 - 10.1109/CVPR52688.2022.01738
DO - 10.1109/CVPR52688.2022.01738
M3 - Conference contribution
AN - SCOPUS:85137221576
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 17886
EP - 17896
BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PB - IEEE Computer Society
T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Y2 - 19 June 2022 through 24 June 2022
ER -