Real-time interaction has become increasingly important. At the same time, panoramic video has gradually become popular. In this paper, the problem we study is predicting the Field-of-View(FoV) at the future moment when people are enjoying a dynamic panoramic immersive video. Existing methods either estimate the future viewing area based on the previous trajectory, or predict the FoV based on salient region in video frames. Here, we design a new model to predict the viewing points in future moments. Firstly, we predict a point from the viewer’s previous viewing trajectory using LSTM(Long Short-Term Memory) network. At the mean time, panoramic video frames are mapped to 6 patches by cube map in advance. The modified VGG-16 network is used for each patch image to perform saliency detection. Then, these 6 salient maps are combined to a single salient map as output. A 3layer convolutional neural network to refined the salient map is utilized. Finally, the salient map of corresponding moment is combined with the predicted point by LSTM input to a two-layer fully connected network to produce the final predicted point. The experiment results show that our model’s prediction accuracy is higher than the traditional prediction algorithm and has better performance than the model without using the second CNN network.