Skip to main navigation Skip to search Skip to main content

SibNet: Sibling convolutional encoder for video captioning

  • SUNY Buffalo
  • Snap Inc.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

57 Scopus citations

Abstract

Video captioning is a challenging task owing to the complexity of understanding the copious visual information in videos and describing it using natural language. Different from previous work that encodes video information using a single flow, in this work, we introduce a novel Sibling Convolutional Encoder (SibNet) for video captioning, which utilizes a two-branch architecture to collaboratively encode videos. The first content branch encodes the visual content information of the video via autoencoder, and the second semantic branch encodes the semantic information by visual-semantic joint embedding. Then both branches are effectively combined with soft-attention mechanism and finally fed into a RNN decoder to generate captions. With our SibNet explicitly capturing both content and semantic information, the proposed method can better represent the rich information in videos. Extensive experiments on YouTube2Text and MSR-VTT datasets validate that the proposed architecture outperforms existing methods by a large margin across different evaluation metrics.

Original languageEnglish
Title of host publicationMM 2018 - Proceedings of the 2018 ACM Multimedia Conference
PublisherAssociation for Computing Machinery, Inc
Pages1425-1434
Number of pages10
ISBN (Electronic)9781450356657
DOIs
StatePublished - Oct 15 2018
Event26th ACM Multimedia conference, MM 2018 - Seoul, Korea, Republic of
Duration: Oct 22 2018Oct 26 2018

Publication series

NameMM 2018 - Proceedings of the 2018 ACM Multimedia Conference

Conference

Conference26th ACM Multimedia conference, MM 2018
Country/TerritoryKorea, Republic of
CitySeoul
Period10/22/1810/26/18

Keywords

  • Autoencoder
  • Video captioning
  • Visual-semantic joint embedding

Fingerprint

Dive into the research topics of 'SibNet: Sibling convolutional encoder for video captioning'. Together they form a unique fingerprint.

Cite this