Skip to main navigation Skip to search Skip to main content

TEXT-IMAGE DE-CONTEXTUALIZATION DETECTION USING VISION-LANGUAGE MODELS

  • Mingzhen Huang
  • , Shan Jia
  • , Ming Ching Chang
  • , Siwei Lyu
  • SUNY Buffalo
  • SUNY Albany

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Scopus citations

Abstract

Text-image de-contextualization, which uses inconsistent image-text pairs, is an emerging form of misinformation and drawing increasing attention due to the great threat to information authenticity. With real content but semantic mismatch in multiple modalities, the detection of de-contextualization is a challenging problem in media forensics. Inspired by the recent advances in vision-language models with powerful relationship learning between images and texts, we leverage the vision-language models to the media de-contextualization detection task. Two popular models, namely CLIP and VinVL, are evaluated and compared on several news and social media datasets to show their performance in detecting image-text inconsistency in de-contextualization. We also summarize interesting observations and shed lights to the use of vision-language models in de-contextualization detection.

Original languageEnglish
Title of host publication2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages8967-8971
Number of pages5
ISBN (Electronic)9781665405409
DOIs
StatePublished - 2022
Event2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022 - Hybrid, Singapore
Duration: May 22 2022May 27 2022

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2022-May
ISSN (Print)1520-6149

Conference

Conference2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022
Country/TerritorySingapore
CityHybrid
Period05/22/2205/27/22

Keywords

  • de-contextualization
  • online misinformation
  • out-of-text detection
  • text-image inconsistency

Fingerprint

Dive into the research topics of 'TEXT-IMAGE DE-CONTEXTUALIZATION DETECTION USING VISION-LANGUAGE MODELS'. Together they form a unique fingerprint.

Cite this