Recent advances in deep learning have brought significant progress in visual grounding tasks such as language-guided video object segmentation. However, collecting large datasets for these tasks is expensive in terms of annotation time, which represents a bottleneck. To this end, we propose a novel method, namely SynthRef, for generating synthetic referring expressions for target objects in an image (or video frame), and we also present and disseminate the first large-scale dataset with synthetic referring expressions for video object segmentation. Our experiments demonstrate that by training with our synthetic referring expressions one can improve the ability of a model to generalize across different datasets, without any additional annotation cost. Moreover, our formulation allows its application to any object detection or segmentation dataset.
If you find this work useful, please consider citing:
Ioannis Kazakos, Carles Ventura, Miriam Bellver, Carina Silberer, and Xavier Giro-i-Nieto. “SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation”, NAACL 2021 Visually Grounded Interaction and Language (ViGIL) Workshop.
We would like to thank Yannis Kalantidis for his enriching discussions and guidance in this work:
We want to thank our wonderful technical support staff:
|We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce GTX Titan Z and Titan X used in this work.|
|This work has been developed in the framework of project TEC2016-75976-R, financed by the Spanish Ministerio de Economía y Competitividad and the European Regional Development Fund (ERDF).|