Pixel- and Frame-level Video Labeling using Spatial and Temporal Convolutional Networks
This dissertation addresses the problem of video labeling at both the frame and pixel levels using deep learning. For pixel-level video labeling, we have studied two problems: i) Spatiotemporal video segmentation and ii) Boundary detection and boundary flow estimation. For the problem of spatiotemporal video segmentation, we have developed recurrent temporal deep field (RTDF). RTDF is a conditional random field (CRF) that combines a deconvolution neural network and a recurrent temporal restricted Boltzmann machine (RTRBM), which can be jointly trained end-to-end. We have derived a mean-field inference algorithm to jointly predict all latent variables in both RTRBM and CRF. For the problem of boundary detection and boundary flow estimation, we have proposed a fully convolutional Siamese network (FCSN). The FCSN first estimates object boundaries in two consecutive frames, and then predicts boundary correspondences in the two frames. For frame-level video labeling, we have specified a temporal deformable residual network (TDRN) for temporal action segmentation. TDRN computes two parallel temporal processes: i) Residual stream that analyzes video information at its full temporal resolution, and ii) Pooling/unpooling stream that captures long-range visual cues. The former facilitates local, fine-scale action segmentation, and the latter uses multiscale context for improving the accuracy of frame classification. All of our networks have been empirically evaluated on challenging benchmark datasets and compared with the state of the art. Each of the above approaches has outperformed the state of the art at the time of our evaluation.
Major Advisor: Sinisa Todorovic
Committee: Fuxin Li
Committee: Raviv Raich
Committee: Xiaoli Fern
GCR: Bo Zhao
Friday, December 7, 2018 at 1:00pm to 3:00pm
Kelley Engineering Center, 1005
110 SW Park Terrace, Corvallis, OR 97331
Calvin Hughes
No recent activity