Recently, research has emphasized the need for video saliency estimation since its application covers a large domain. Traditional saliency prediction methods for video based on hand-crafted visual features lead to slow speed and ineffective results. In this paper, we propose a real-time end-to-end saliency estimation model combining two-stream convolutional neural networks from global-view to local-view. In global view, the temporal stream CNN extracts the inter-frame features from optical flow map, and spatial stream CNN extracts the intra-frame information. In local view, we adopt the recurrent connnections to refine the local details through correcting the saliency map step by step. We test our model TSRCNN on three datasets in video saliency estimation, and it shows not only exceedingly commendable performance but almostly real-time GPU processing time of 0.088s compared to other state-of-art methods.