Curriculum Learning for Recurrent Video Object Segmentation (original) (raw)
Related papers
YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark
ArXiv, 2018
Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing video segmentation methods predominantly rely on static image segmentation techniques, and methods capturing temporal dependency for segmentation have to depend on pretrained optical flow models, leading to suboptimal solutions for the problem. End-to-end sequential learning to explore spatialtemporal features for video segmentation is largely limited by the scale of available video segmentation datasets, i.e., even the largest video segmentation dataset only contains 90 short video clips. To solve this problem, we build a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains 4,453 YouTube video clips and 94 object categories. This is by far the largest video object segmentation dataset to our knowledge and has been released at this http URL We further evaluate several existing state-of-the-art vid...
Make One-Shot Video Object Segmentation Efficient Again
ArXiv, 2020
Video object segmentation (VOS) describes the task of segmenting a set of objects in each frame of a video. In the semi-supervised setting, the first mask of each object is provided at test time. Following the one-shot principle, fine-tuning VOS methods train a segmentation model separately on each given object mask. However, recently the VOS community has deemed such a test time optimization and its impact on the test runtime as unfeasible. To mitigate the inefficiencies of previous fine-tuning approaches, we present efficient One-Shot Video Object Segmentation (e-OSVOS). In contrast to most VOS approaches, e-OSVOS decouples the object detection task and predicts only local segmentation masks by applying a modified version of Mask R-CNN. The one-shot test runtime and performance are optimized without a laborious and handcrafted hyperparameter search. To this end, we meta learn the model initialization and learning rates for the test time optimization. To achieve optimal learning be...
Recurrent Convolutional Neural Networks for Object-Class Segmentation of RGB-D Video
Object-class segmentation is a computer vision task which requires labeling each pixel of an image with the class of the object it belongs to. Deep convolutional neural networks (DNN) are able to learn and exploit local spatial correlations required for this task. They are, however, restricted by their small, fixed-sized filters, which limits their ability to learn long- range dependencies. Recurrent Neural Networks (RNN), on the other hand, do not suffer from this restriction. Their iterative interpretation allows them to model long-range dependencies by propagating activity. This property might be especially useful when labeling video sequences, where both spatial and temporal long-range dependencies occur. In this work, we propose novel RNN architectures for object-class segmentation. We investigate three ways to consider past and future context in the prediction process by comparing networks that process the frames one by one with networks that have access to the whole sequence. We evaluate our models on the challenging NYU Depth v2 dataset for object-class segmentation and obtain competitive results.
Object Class Segmentation of RGB-D Video using Recurrent Convolutional Neural Networks
Object class segmentation is a computer vision task which requires labeling each pixel of an image with the class of the object it belongs to. Deep convolutional neural networks (DNN) are able to learn and take advantage of local spatial correlations required for this task. They are, however, restricted by their small, fixed-sized filters, which limits their ability to learn long-range dependencies. Recurrent Neural Networks (RNN), on the other hand, do not suffer from this restriction. Their iterative interpretation allows them to model long-range dependencies by propagating activity. This property is be especially useful when labeling video sequences, where both spatial and temporal long-range dependencies occur. In this work, a novel RNN architecture for object class segmentation is presented. We investigate several ways to train such a network. We evaluate our models on the challenging NYU Depth v2 dataset for object class segmen-tation and obtain competitive results.
FASSVid: Fast and Accurate Semantic Segmentation for Video Sequences
Entropy
Most of the methods for real-time semantic segmentation do not take into account temporal information when working with video sequences. This is counter-intuitive in real-world scenarios where the main application of such methods is, precisely, being able to process frame sequences as quickly and accurately as possible. In this paper, we address this problem by exploiting the temporal information provided by previous frames of the video stream. Our method leverages a previous input frame as well as the previous output of the network to enhance the prediction accuracy of the current input frame. We develop a module that obtains feature maps rich in change information. Additionally, we incorporate the previous output of the network into all the decoder stages as a way of increasing the attention given to relevant features. Finally, to properly train and evaluate our methods, we introduce CityscapesVid, a dataset specifically designed to benchmark semantic video segmentation networks. ...