Two-stream deep encoder-decoder architecture for fully automatic video object segmentation


We propose a two-stream Deep Encoder-Decoder architecture to tackle the task of fully automatic video object segmentation. Both two streams, i.e., ImSeg-Stream (for static image segmentation) and MoSeg-Stream (for optical flow segmentation), hold the totally same Encoder-Decoder architecture. The Encoder part generates a low-resolution mask with accurate locations and smooth boundaries, while the Decoder part refines the details of initial mask and enlarges its resolution via integrating lower-level features progressively. At last two streams learn to integrate for better results. Moreover, to handle the problem of inadequate video object segmentation datasets, we propose a seeking strategy to generate a large-scale handcrafted dataset for training. Experiments on two standard datasets demonstrate that proposed method outperforms most state-of-the-art methods in both segmentation accuracy and run time.

2017 IEEE Visual Communications and Image Processing (VCIP)
Li Song
Li Song
Professor, IEEE Senior Member

Professor, Doctoral Supervisor, the Deputy Director of the Institute of Image Communication and Network Engineering of Shanghai Jiao Tong University, the Double-Appointed Professor of the Institute of Artificial Intelligence and the Collaborative Innovation Center of Future Media Network, the Deputy Secretary-General of the China Video User Experience Alliance and head of the standards group.