Skip to main navigation Skip to search Skip to main content

DEFORMABLE VISTR: SPATIO TEMPORAL DEFORMABLE ATTENTION FOR VIDEO INSTANCE SEGMENTATION

  • SUNY Buffalo
  • InnoPeak Technology Inc

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Video instance segmentation (VIS) task requires classifying, segmenting, and tracking object instances over all frames in a video clip. Recently, VisTR [1] has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance. However, VisTR is slow to converge during training, requiring around 1000 GPU hours due to the high computational cost of its transformer attention module. To improve the training efficiency, we propose Deformable VisTR, leveraging spatio-temporal deformable attention module that only attends to a small fixed set of key spatio-temporal sampling points around a reference point. This enables Deformable VisTR to achieve linear computation in the size of spatio-temporal feature maps. Moreover, it can achieve on par performance as the original VisTR with 10× less GPU training hours. We validate the effectiveness of our method on the Youtube-VIS benchmark. Code is available at https://github.com/skrya/DefVIS.

Original languageEnglish
Title of host publication2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3303-3307
Number of pages5
ISBN (Electronic)9781665405409
DOIs
StatePublished - 2022
Event2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022 - Hybrid, Singapore
Duration: May 22 2022May 27 2022

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2022-May
ISSN (Print)1520-6149

Conference

Conference2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022
Country/TerritorySingapore
CityHybrid
Period05/22/2205/27/22

Keywords

  • deformable convolution
  • efficient framework
  • video instance segmentation

Fingerprint

Dive into the research topics of 'DEFORMABLE VISTR: SPATIO TEMPORAL DEFORMABLE ATTENTION FOR VIDEO INSTANCE SEGMENTATION'. Together they form a unique fingerprint.

Cite this