OV-VIS: Open-Vocabulary Video Instance Segmentation

  • Haochen Wang
  • , Cilin Yan
  • , Keyan Chen
  • , Xiaolong Jiang
  • , Xu Tang
  • , Yao Hu
  • , Guoliang Kang
  • , Weidi Xie*
  • , Efstratios Gavves
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Conventionally, the goal of Video Instance Segmentation (VIS) is to segment and categorize objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation (OV-VIS), which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark OV-VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1196 diverse categories, significantly surpassing the category size of existing datasets by more than an order of magnitude. Third, we propose a transformer-based OV-VIS model, OV2Seg+, which associates per-frame segmentation masks with a memory-induced transformer and clarifies objects in videos with a voting module given language guidance. In addition, to monitor the progress, we set up the evaluation protocols for OV-VIS and propose a set of strong baseline models to facilitate future endeavors. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg+. The dataset and code are released here https://github.com/haochenheheda/LVVIS. The competition website is provided here https://www.codabench.org/competitions/1748.

Original languageEnglish
Pages (from-to)5048-5065
Number of pages18
JournalInternational Journal of Computer Vision
Volume132
Issue number11
DOIs
StatePublished - Nov 2024

Keywords

  • Large-vocabulary video segmentation dataset
  • Open-vocabulary
  • Transformer
  • Video instance segmentation

Fingerprint

Dive into the research topics of 'OV-VIS: Open-Vocabulary Video Instance Segmentation'. Together they form a unique fingerprint.

Cite this