Two-stream recurrent convolutional neural networks for video saliency estimation


Recently, research has emphasized the need for video saliency estimation since its application covers a large domain. Traditional saliency prediction methods for video based on hand-crafted visual features lead to slow speed and ineffective results. In this paper, we propose a real-time end-to-end saliency estimation model combining two-stream convolutional neural networks from global-view to local-view. In global view, the temporal stream CNN extracts the inter-frame features from optical flow map, and spatial stream CNN extracts the intra-frame information. In local view, we adopt the recurrent connnections to refine the local details through correcting the saliency map step by step. We test our model TSRCNN on three datasets in video saliency estimation, and it shows not only exceedingly commendable performance but almostly real-time GPU processing time of 0.088s compared to other state-of-art methods.

2017 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB)
Li Song
Li Song
Professor, IEEE Senior Member

Professor, Doctoral Supervisor, the Deputy Director of the Institute of Image Communication and Network Engineering of Shanghai Jiao Tong University, the Double-Appointed Professor of the Institute of Artificial Intelligence and the Collaborative Innovation Center of Future Media Network, the Deputy Secretary-General of the China Video User Experience Alliance and head of the standards group.