Skip to main navigation Skip to search Skip to main content

Towards Language-Free Training for Text-to-Image Generation

  • Yufan Zhou
  • , Ruiyi Zhang
  • , Changyou Chen
  • , Chunyuan Li
  • , Chris Tensmeyer
  • , Tong Yu
  • , Jiuxiang Gu
  • , Jinhui Xu
  • , Tong Sun
  • SUNY Buffalo
  • Adobe Research
  • Microsoft USA

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

176 Scopus citations

Abstract

One of the major challenges in training text-to-image generation models is the need of a large number of highquality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pretrained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.

Original languageEnglish
Title of host publicationProceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PublisherIEEE Computer Society
Pages17886-17896
Number of pages11
ISBN (Electronic)9781665469463
DOIs
StatePublished - 2022
Event2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States
Duration: Jun 19 2022Jun 24 2022

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2022-June
ISSN (Print)1063-6919

Conference

Conference2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/TerritoryUnited States
CityNew Orleans
Period06/19/2206/24/22

Keywords

  • Image and video synthesis and generation
  • Vision + language

Fingerprint

Dive into the research topics of 'Towards Language-Free Training for Text-to-Image Generation'. Together they form a unique fingerprint.

Cite this