Temporal Regularization of Saliency Maps

in Egocentric Videos

Monica Cherto

Fork me on GitHub

upc-logo dcu-logo

Universitat Politècnica de Catalunya

Dublin City University


This work explores how temporal regularization in egocentric videos may have a positive or negative impact in saliency prediction depending on the viewer behavior. Our study is based on the new EgoMon dataset, which consists of seven videos recorded by three subjects in both free-viewing and task-driven set ups. We predict a frame-based saliency prediction over the frames of each video clip, as well as a temporally regularized version based on deep neural networks. Our results indicate that the NSS saliency metric improves during task-driven activities, but that it clearly drops during free-viewing. Encouraged by the good results in task-driven activities, we also computed and publish the saliency maps for the EPIC Kitchens dataset.

Find the full paper on arXiv or download the PDF directly from here.

If you find this work useful, please consider citing:

Panagiotis Linardos, Eva Mohedano, Monica Cherto, Cathal Gurrin, Xavier Giro-i-Nieto. “Temporal Saliency Adaptation in Egocentric Videos”, Extended abstract at the ECCV Workshop on Egocentric Perception, Interaction and Computing (EPIC), 2018.

title={Temporal Saliency Adaptation in Egocentric Videos},
author={Panagiotis Linardos, Eva Mohedano, Monica Cherto, Cathal Gurrin, Xavier Giro-i-Nieto},
journal={arXiv preprint arXiv:1808.09559},

Our work is based on SalGAN, a computational model of saliency to predict human fixations on still images. In terms of architecture, we have added a convolutional LSTM layer on top of the frame-based saliency predictions.



The Egomon Gaze & Video dataset can be downloaded as a single file, or by components:

Tobii glasses used for gaze data recording
Narrative clip equipment
Example of a clean video
Example of a video with overlaid gaze fixations

More qualitative examples can be observed in this site.


We evaluated SalGAN with and without our temporal regularization on different datasets:

Performance on DHF1K

Performance on DHF1K and EgoMon

Analytical Results on the 2 types of EgoMon recordings.

When it comes to visual attention there is not always a direct relationship between actions and fixations. For example, a person can easily carry an object in her hand and put it on the table without looking at it. The daily art of cooking, on the other hand, is a series of object-manipulation tasks that require hand-eye coordination. Actions such as cutting onions or pouring a liquid into a bottle are hard to accomplish without using both hands and eyes in coordination. For that reason, we expect that using the salient maps of the video will bring the model closer to the features that are most intimately linked with the tasks carried out by the subjects during the Epic Kitchens dataset acquisition.

Examples of epic-kitchen frames with their saliency maps. 2nd row corresponds to Vanilla SalGAN predictions and 3rd row to the Augmented SalGAN predictions.

You may download saliency maps from here:

Epic-Kitchens (SalGAN) (25G) Epic-Kitchens (+convLSTM) (737M)

EgoMon (SalGAN) (216M) EgoMon (+convLSTM) (93M)


Download the PDF here


This project was developed with Python 3.6.5 and PyTorch 0.4.0. To download and install PyTorch, please follow the official guide.



We especially want to thank our technical support team:

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce GTX Titan Z and Titan X used in this work. logo-nvidia
The Image Processing Group at the UPC is a SGR17 Consolidated Research Group recognized by the Government of Catalonia (Generalitat de Catalunya) through its AGAUR office. logo-catalonia
This work has been developed in the framework of projects TEC2013-43935-R and TEC2016-75976-R, financed by the Spanish Ministerio de Economía y Competitividad and the European Regional Development Fund (ERDF). logo-spain