Video Object Segmentation By Jointly Tracking Foreground and Background
This report presents an efficient method for semi-supervised video object segmentation -- the problem of identifying foreground pixels occupied by a target object, which is specified by the ground truth mask in the first video frame. While the state of the art achieves a segmentation accuracy greater than 80%, it runs relatively slow at less than 10 frames per second. This limits their application in many domains. In addition, accuracy of existing approaches typically suffers on cases of target occlusion by moving background objects. We address these two shortcomings of prior work by a novel deep architecture aimed at jointly tracking both foreground and background in the video in an efficient manner. Our key hypothesis is that explicitly tracking the dynamic background of the target object helps improve segmentation in cases of target occlusion. We propose using two deep neural networks that work in parallel—one for foreground object segmentation, and the other for background segmentation. They use the same architecture. Their output is integrated in another network for fusing the initial foreground and background segmentation into a more accurate target object segmentation. We perform experiments using various configurations of the proposed architecture on the DAVIS 2016 dataset. Our results support the key hypothesis where the joint tracking of the dynamic foreground and background indeed outperforms a baseline that tracks only the target object. On DAVIS 2016, our accuracy is 70.61%, while operating at over 100 frames per second.
Major Advisor: Sinisa Todorovic
Committee: Fuxin Li
Committee: Leonard Coop
Thursday, June 6, 2019 at 10:00am to 12:00pm
Kelley Engineering Center, 1007
110 SW Park Terrace, Corvallis, OR 97331