Viewport Prediction for Panoramic Video with Multi-CNN

Abstract

Real-time interaction has become increasingly important. At the same time, panoramic video has gradually become popular. In this paper, the problem we study is predicting the Field-of-View(FoV) at the future moment when people are enjoying a dynamic panoramic immersive video. Existing methods either estimate the future viewing area based on the previous trajectory, or predict the FoV based on salient region in video frames. Here, we design a new model to predict the viewing points in future moments. Firstly, we predict a point from the viewer’s previous viewing trajectory using LSTM(Long Short-Term Memory) network. At the mean time, panoramic video frames are mapped to 6 patches by cube map in advance. The modified VGG-16 network is used for each patch image to perform saliency detection. Then, these 6 salient maps are combined to a single salient map as output. A 3layer convolutional neural network to refined the salient map is utilized. Finally, the salient map of corresponding moment is combined with the predicted point by LSTM input to a two-layer fully connected network to produce the final predicted point. The experiment results show that our model’s prediction accuracy is higher than the traditional prediction algorithm and has better performance than the model without using the second CNN network.

Publication
2019 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB)
Li Song
Li Song
Professor, IEEE Senior Member

Professor, Doctoral Supervisor, the Deputy Director of the Institute of Image Communication and Network Engineering of Shanghai Jiao Tong University, the Double-Appointed Professor of the Institute of Artificial Intelligence and the Collaborative Innovation Center of Future Media Network, the Deputy Secretary-General of the China Video User Experience Alliance and head of the standards group.