Extracting Visual Patterns from Deep Learning Representations (original) (raw)

Vector-space word representations based on neural network models can include linguistic regularities, enabling semantic operations based on vector arithmetic. In this paper, we explore an analogous approach applied to images. We define a methodology to obtain large and sparse vectors from individual images and image classes, by using a pre-trained model of the GoogLeNet architecture. We evaluate the vector-space after processing 20,000 ImageNet images, and find it to be highly correlated with WordNet lexical distances. Further exploration of image representations shows how semantically similar elements are clustered in that space, regardless of large visual variances (e.g., 118 kinds of dogs), and how the space distinguishes abstract classes of objects without supervision (e.g., living things from non-living things). Finally, we consider vector arithmetic, and find them to be related with image concatenation (e.g., "horse cart - horse = rickshaw"), image overlap ("Pan...