Vision-Language Fusion for Object Recognition (original) (raw)

Associating words to visually recognized objects

2004

Using associative memories and sparse distributed representations we have developed a system that can learn to associate words with objects, properties like colors, and actions. This system is used in a robotics context to enable a robot to respond to spoken commands like "bot show plum" or "bot put apple to yellow cup". The scenario for this is a robot close to one or two tables on which there are certain kinds of fruit and/or other simple objects. We can demonstrate part of this scenario where the task is to find certain fruits in a complex visual scene according to spoken or typed commands. This involves parsing and understanding of simple sentences and relating the nouns to concrete objects sensed by the camera and recognized by a neural network from the visual input.

Making a robotic scene representation accessible to feature and label queries

2011 IEEE International Conference on Development and Learning (ICDL), 2011

We present a neural architecture for scene representation that stores semantic information about objects in the robot's workspace. We show how this representation can be queried both through low-level features such as color and size, through feature conjunctions, as well as through symbolic labels. This is possible by binding different feature dimensions through space and integrating these space-feature representations with an object recognition system. Queries lead to the activation of a neural representation of previously seen objects, which can then be used to drive object-oriented action. The representation is continuously linked to sensory information and autonomously updates when objects are moved or removed.

Object Recognition Using Dialogues and Semantic Anchoring

2013

This report explains in detail the implemented system containing a robot and a sensor network that is deployed in a test apartment in an elderly residence area. The report focuses on the creation and maintenance (anchoring) of the connection between the semantic information present in the dialog with perceived actual physical objects in the home. Semantic knowledge about concepts and their correlations are retrieved from online resources and ontologies, e.g. Word-Net and sensors information are provided by cameras distributed in the apartment.

Translating Images to Words for Recognizing Objects in Large Image and Video Collections

Lecture Notes in Computer Science, 2006

We present a new approach to the object recognition problem, motivated by the recent availability of large annotated image and video collections. This approach considers object recognition as the translation of visual elements to words, similar to the translation of text from one language to another. The visual elements represented in feature space are categorized into a finite set of blobs. The correspondences between the blobs and the words are learned, using a method adapted from Statistical Machine Translation. Once learned, these correspondences can be used to predict words corresponding to particular image regions (region naming), to predict words associated with the entire images (autoannotation), or to associate the speech transcript text with the correct video frames (video alignment). We present our results on the Corel data set which consists of annotated images and on the TRECVID 2004 data set which consists of video frames associated with speech transcript text and manual annotations.

Object Learning with Natural Language in a Distributed Intelligent System – A Case Study of Human-Robot Interaction

Foundations and Practical Applications of Cognitive Systems and Information Processing - Proceedings of the First International Conference on Cognitive Systems and Information Processing (CSIP 2012), 2012

The development of humanoid robots for helping humans as well as for understanding the human cognitive system is of significant interest in science and technology. How to bridge the large gap between the needs of a natural human-robot interaction and the capabilities of recent humanoid platforms is an important but open question. In this paper we describe a system to teach a robot, based on a dialogue in natural language about its real environment in real time. For this, we integrate a fast object recognition method for the NAO humanoid robot and a hybrid ensemble learning mechanism. With a qualitative analysis we show the effectiveness of our system.

Learning to Recognize Novel Objects in One Shot through Human-Robot Interactions in Natural Language Dialogues

Being able to quickly and naturally teach robots new knowledge is critical for many future open-world human-robot interaction scenarios. In this paper we present a novel approach to using natural language context for one-shot learning of visual objects, where the robot is immediately able to recognize the described object. We describe the architectural components and demonstrate the proposed method approach on a robotic platform in a proof-of-concept evaluation.

Mixing Hierarchical Contexts for Object Recognition

Lecture Notes in Computer Science, 2011

Robust category-level object recognition is currently a major goal for the Computer Vision community. Intra-class and pose variations, as well as, background clutter and partial occlusions are some of the main difficulties to achieve this goal. Contextual information in the form of object co-ocurrences and spatial contraints has been successfully applied to reduce the inherent uncertainty of the visual world. Recently, Choi et al. [5] propose the use of a tree-structured graphical model to capture contextual relations among objects. Under this model there is only one possible fixed contextual relation among subsets of objects. In this work we extent Choi et al. approach by using a mixture model to consider the case that contextual relations among objects depend on scene type. Our experiments highlight the advantages of our proposal, showing that the adaptive specialization of contextual relations improves object recognition and object detection performances.

Context-Dependent Multi-Cue Object Recognition

2008

Abstract Object recognition is a fundamental capability for the potential robot assistance to humans in useful tasks. There have been numerous visual-based object recognition systems, fast and accurate results in constrained environments. However, by depending on visual cues, these techniques are susceptible to object views variations in size, lighting, rotation, and pose, all of which cannot be avoided in real visual data. Thus, it is widely acknowledged that the general object recognition task still remains very challenging.

Grounding language in perception for scene conceptualization in autonomous robots

In order to behave autonomously, it is desirable for robots to have the ability to use human supervision and learn from different input sources (perception, gestures, verbal and textual descriptions etc). In many machine learning tasks, the supervision is directed specifically towards machines and hence is straight forward clearly annotated examples. But this is not always very practical and recently it was found that the most preferred interface to robots is natural language. Also the supervision might only be available in a rather indirect form, which may be vague and incomplete. This is frequently the case when humans teach other humans since they may assume a particular context and existing world knowledge. We explore this idea here in the setting of conceptualizing objects and scene layouts. Initially the robot undergoes training from a human in recognizing some objects in the world and armed with this acquired knowledge it sets out in the world to explore and learn more higher level concepts like static scene layouts and environment activities. Here it has to exploit its learned knowledge and ground language into perception to use inputs from different sources that might have overlapping as well as novel information. When exploring, we assume that the robot is given visual input, without explicit type labels for objects, and also that it has access to more or less generic linguistic descriptions of scene layout. Thus our task here is to learn the spatial structure of a scene layout and simultaneously visual object models it was not trained on. In this paper, we present a cognitive architecture and learning framework for robot learning through natural human supervision and using multiple input sources by grounding language in perception.

Learning Object Models on a Robot using Visual Context and Appearance Cues

Visual object recognition is an important challenge to widespread deployment of mobile robots in real-world domains characterized by partial observability and unforeseen dynamic changes. This paper describes an algorithm that enables robots to use motion cues to identify (and focus on) a set of interesting objects, automatically extracting appearance-based and contextual cues from a small number of images to efficiently learn representative models of these objects. Object models learned from relevant image regions consist of: (a) relative spatial arrangement of gradient features; (b) graph-based models of neighborhoods of gradient features; (c) parts-based models of image segments; (d) color distribution statistics; and (e) probabilistic models of local context. An energy minimization algorithm and a generative model of information fusion use the learned models to reliably and efficiently recognize these objects in novel scenes. All algorithms are evaluated on wheeled robots in indoor and outdoor domains.