Publication
This work explores how temporal regularization in egocentric videos may have a positive or negative impact in saliency prediction depending on the viewer behavior. Our study is based on the new EgoMon dataset, which consists of seven videos recorded by three subjects in both free-viewing and task-driven set ups. We predict a frame-based saliency prediction over the frames of each video clip, as well as a temporally regularized version based on deep neural networks. Our results indicate that the NSS saliency metric improves during task-driven activities, but that it clearly drops during free-viewing. Encouraged by the good results in task-driven activities, we also computed and publish the saliency maps for the EPIC Kitchens dataset.
Find the full paper on arXiv or download the PDF directly from here.
If you find this work useful, please consider citing:
Panagiotis Linardos, Eva Mohedano, Monica Cherto, Cathal Gurrin, Xavier Giro-i-Nieto. “Temporal Saliency Adaptation in Egocentric Videos”, Extended abstract at the ECCV Workshop on Egocentric Perception, Interaction and Computing (EPIC), 2018.
@inproceedings{Linardos2018videosalgan, title={Temporal Saliency Adaptation in Egocentric Videos}, author={Panagiotis Linardos, Eva Mohedano, Monica Cherto, Cathal Gurrin, Xavier Giro-i-Nieto}, journal={arXiv preprint arXiv:1808.09559}, year={2018} }
Model
Our work is based on SalGAN, a computational model of saliency to predict human fixations on still images. In terms of architecture, we have added a convolutional LSTM layer on top of the frame-based saliency predictions.
Dataset
The Egomon Gaze & Video dataset can be downloaded as a single file, or by components:
- Full dataset (22G)
- Gaze Data (xlsx) (2.2G)
- Gaze Data (csv) (1.9G)
- Narrative (Only for the botanic gardens) (57M)
- Clean Videos (5.7G)
- Videos with overlaid gaze fixations (5G)
More qualitative examples can be observed in this site.
Results
We evaluated SalGAN with and without our temporal regularization on different datasets:
Performance on DHF1K
Performance on DHF1K and EgoMon
Analytical Results on the 2 types of EgoMon recordings.
When it comes to visual attention there is not always a direct relationship between actions and fixations. For example, a person can easily carry an object in her hand and put it on the table without looking at it. The daily art of cooking, on the other hand, is a series of object-manipulation tasks that require hand-eye coordination. Actions such as cutting onions or pouring a liquid into a bottle are hard to accomplish without using both hands and eyes in coordination. For that reason, we expect that using the salient maps of the video will bring the model closer to the features that are most intimately linked with the tasks carried out by the subjects during the Epic Kitchens dataset acquisition.
You may download saliency maps from here:
Epic-Kitchens (SalGAN) (25G) Epic-Kitchens (+convLSTM) (737M)
EgoMon (SalGAN) (216M) EgoMon (+convLSTM) (93M)
Presentation
Poster
Download the PDF here
code
This project was developed with Python 3.6.5 and PyTorch 0.4.0. To download and install PyTorch, please follow the official guide.
acknowledgements
We especially want to thank our technical support team:
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce GTX Titan Z and Titan X used in this work. | |
The Image Processing Group at the UPC is a SGR17 Consolidated Research Group recognized by the Government of Catalonia (Generalitat de Catalunya) through its AGAUR office. | |
This work has been developed in the framework of projects TEC2013-43935-R and TEC2016-75976-R, financed by the Spanish Ministerio de Economía y Competitividad and the European Regional Development Fund (ERDF). |