Unsupervised learning of supervoxel embeddings for video Segmentation (original) (raw)

We present an algorithm for learning a feature representation for video segmentation. Standard video segmentation algorithms utilize similarity measurements in order to group related pixels. The contribution of our paper is an unsupervised method for learning the feature representation used for this similarity. The feature representation is defined over video supervoxels. An embedding framework learns a feature mapping for supervoxels in an unsupervised fashion such that supervoxels with similar context have similar embeddings. Based on the learned representation, we can merge similar supervoxels into spatio-temporal segments. Experimental results demonstrate the effectiveness of this learned supervoxel embedding on standard benchmark data.