introduction
Multiple object video object segmentation is a challenging task, specially for the zero-shot case, when no object mask is given at the initial frame and the model has to find the objects to be segmented along the sequence. In our work, we propose a Recurrent network for multiple object Video Object Segmentation (RVOS) that is fully end-to-end trainable. Our model incorporates recurrence on two different domains: (i) the spatial, which allows to discover the different object instances within a frame, and (ii) the temporal, which allows to keep the coherence of the segmented objects along time. We train RVOS for zero-shot video object segmentation and are the first ones to report quantitative results for DAVIS-2017 and YouTube-VOS benchmarks. Further, we adapt RVOS for one-shot video object segmentation by using the masks obtained in previous time steps as inputs to be processed by the recurrent module. Our model reaches comparable results to state-of-the-art techniques in YouTube-VOS benchmark and outperforms all previous video object segmentation methods not using online learning in the DAVIS-2017 benchmark. Moreover, our model achieves faster inference runtimes than previous methods, reaching 44ms/frame on a P100 GPU.
If you find this work useful, please consider citing:
Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques and Xavier Giro-i-Nieto. “RVOS: End-to-End Recurrent Network for Video Object Segmentation”, CVPR 2019.
@InProceedings{Ventura_2019_CVPR, author = {Ventura, Carles and Bellver, Miriam and Girbau, Andreu and Salvador, Amaia and Marques, Ferran and Giro-i-Nieto, Xavier}, title = {RVOS: End-to-End Recurrent Network for Video Object Segmentation}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2019} }
Model
Our proposed architecture where RNNs are considered in both spatial and temporal domains. We show an example where each predicted instance mask is displayed with a different color.
Results
Results on YouTube-VOS validation set for the semi-supervised task (one-shot):
Results on DAVIS-2017 test-dev set for the semi-supervised task (one-shot):
Results on YouTube-VOS validation set for the unsupervised task (zero-shot):
Examples
Results on YouTube-VOS validation set for the semi-supervised task (one-shot):
Results on DAVIS-2017 test-dev set for the semi-supervised task (one-shot):
Results on YouTube-VOS validation set for the unsupervised task (zero-shot):
Results on DAVIS-2017 test-dev set for the unsupervised task (zero-shot):
Poster
code
We implement our models using Pytorch.
Source code is available here.
acknowledgements
We want to thank our technical support team:
We gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used in this work. | |
The Scene Understanding and Artificial Intelligence (SUnAI) group at Universitat Oberta de Catalunya (UOC) is a SGR17 Preconsolidated Research Group recognized by the Catalan Government (Generalitat de Catalunya) through its AGAUR office. | |
The Emerging Technologies for Artificial Intelligence group at Barcelona Supercomputing Center is part of the SGR-2017 1414 sponsored by the Catalan Government (Generalitat de Catalunya) through its AGAUR office. | |
The Image Processing Group at the UPC is a SGR17 Consolidated Research Group recognized and sponsored by the Catalan Government (Generalitat de Catalunya) through its AGAUR office. | |
This research was supported by Industrial Doctorates 2017-DI-064 and 2017-DI-028 from the Government of Catalonia. | |
This work has been developed in the framework of projects TIN2015-66951-C2-2-R, TIN2015-65316-P and TEC2016-75976-R, financed by the Spanish Ministerio de Economía y Competitividad and the European Regional Development Fund (ERDF). |