TY - GEN
T1 - SibNet
T2 - 26th ACM Multimedia conference, MM 2018
AU - Liu, Sheng
AU - Ren, Zhou
AU - Yuan, Junsong
N1 - Publisher Copyright:
© 2018 Copyright held by the owner/author(s). Publication rights licensed to the Association for Computing Machinery.
PY - 2018/10/15
Y1 - 2018/10/15
N2 - Video captioning is a challenging task owing to the complexity of understanding the copious visual information in videos and describing it using natural language. Different from previous work that encodes video information using a single flow, in this work, we introduce a novel Sibling Convolutional Encoder (SibNet) for video captioning, which utilizes a two-branch architecture to collaboratively encode videos. The first content branch encodes the visual content information of the video via autoencoder, and the second semantic branch encodes the semantic information by visual-semantic joint embedding. Then both branches are effectively combined with soft-attention mechanism and finally fed into a RNN decoder to generate captions. With our SibNet explicitly capturing both content and semantic information, the proposed method can better represent the rich information in videos. Extensive experiments on YouTube2Text and MSR-VTT datasets validate that the proposed architecture outperforms existing methods by a large margin across different evaluation metrics.
AB - Video captioning is a challenging task owing to the complexity of understanding the copious visual information in videos and describing it using natural language. Different from previous work that encodes video information using a single flow, in this work, we introduce a novel Sibling Convolutional Encoder (SibNet) for video captioning, which utilizes a two-branch architecture to collaboratively encode videos. The first content branch encodes the visual content information of the video via autoencoder, and the second semantic branch encodes the semantic information by visual-semantic joint embedding. Then both branches are effectively combined with soft-attention mechanism and finally fed into a RNN decoder to generate captions. With our SibNet explicitly capturing both content and semantic information, the proposed method can better represent the rich information in videos. Extensive experiments on YouTube2Text and MSR-VTT datasets validate that the proposed architecture outperforms existing methods by a large margin across different evaluation metrics.
KW - Autoencoder
KW - Video captioning
KW - Visual-semantic joint embedding
UR - https://www.scopus.com/pages/publications/85058233487
U2 - 10.1145/3240508.3240667
DO - 10.1145/3240508.3240667
M3 - Conference contribution
AN - SCOPUS:85058233487
T3 - MM 2018 - Proceedings of the 2018 ACM Multimedia Conference
SP - 1425
EP - 1434
BT - MM 2018 - Proceedings of the 2018 ACM Multimedia Conference
PB - Association for Computing Machinery, Inc
Y2 - 22 October 2018 through 26 October 2018
ER -