A Probabilistic Approach to Unsupervised Induction of Combinatory Categorial Grammar in Situated Human-Robot Interaction (original) (raw)

Towards Understanding Object-Directed Actions: A Generative Model for Grounding Syntactic Categories of Speech through Visual Perception

Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Australia, 2018

Creating successful human-robot collaboration requires robots to have high-level cognitive functions that could allow them to understand human language and actions in space. To meet this target, an elusive challenge that we address in this paper is to understand object-directed actions through grounding language based on visual cues representing the dynamics of human actions on objects, object characteristics (color and geometry), and spatial relationships between objects in a tabletop scene. The proposed probabilistic framework investigates unsupervised Part-of-Speech (POS) tagging to determine syntactic categories of words so as to infer grammatical structure of language. The dynamics of object-directed actions are characterized through the locations of the human arm joints-modeled on a Hidden Markov Model (HMM)-while manipulating objects, in addition to those of objects represented in 3D point clouds. These corresponding point clouds to segmented objects encode geometric features and spatial semantics of referents and landmarks in the environment. The proposed Bayesian learning model is successfully evaluated through interaction experiments between a human user and Toyota HSR robot in space.

Grounding language in perception for scene conceptualization in autonomous robots

In order to behave autonomously, it is desirable for robots to have the ability to use human supervision and learn from different input sources (perception, gestures, verbal and textual descriptions etc). In many machine learning tasks, the supervision is directed specifically towards machines and hence is straight forward clearly annotated examples. But this is not always very practical and recently it was found that the most preferred interface to robots is natural language. Also the supervision might only be available in a rather indirect form, which may be vague and incomplete. This is frequently the case when humans teach other humans since they may assume a particular context and existing world knowledge. We explore this idea here in the setting of conceptualizing objects and scene layouts. Initially the robot undergoes training from a human in recognizing some objects in the world and armed with this acquired knowledge it sets out in the world to explore and learn more higher level concepts like static scene layouts and environment activities. Here it has to exploit its learned knowledge and ground language into perception to use inputs from different sources that might have overlapping as well as novel information. When exploring, we assume that the robot is given visual input, without explicit type labels for objects, and also that it has access to more or less generic linguistic descriptions of scene layout. Thus our task here is to learn the spatial structure of a scene layout and simultaneously visual object models it was not trained on. In this paper, we present a cognitive architecture and learning framework for robot learning through natural human supervision and using multiple input sources by grounding language in perception.

Natural Language Grounding and Grammar Induction for Robotic Manipulation Commands

2017

We present a cognitively plausible system capable of acquiring knowledge in language and vision from pairs of short video clips and linguistic descriptions. The aim of this work is to teach a robot manipulator how to execute natural language commands by demonstration. This is achieved by first learning a set of visual 'concepts' that abstract the visual feature spaces into concepts that have human-level meaning. Second, learning the mapping/grounding between words and the extracted visual concepts. Third, inducing grammar rules via a semantic representation known as Robot Control Language (RCL). We evaluate our approach against state-of-the-art supervised and unsupervised grounding and grammar induction systems, and show that a robot can learn to execute never seenbefore commands from pairs of unlabelled linguistic and visual inputs.

Natural Language Acquisition and Grounding for Embodied Robotic Systems

Proceedings of the AAAI Conference on Artificial Intelligence

We present a cognitively plausible novel framework capable of learning the grounding in visual semantics and the grammar of natural language commands given to a robot in a table top environment. The input to the system consists of video clips of a manually controlled robot arm, paired with natural language commands describing the action. No prior knowledge is assumed about the meaning of words, or the structure of the language, except that there are different classes of words (corresponding to observable actions, spatial relations, and objects and their observable properties). The learning process automatically clusters the continuous perceptual spaces into concepts corresponding to linguistic input. A novel relational graph representation is used to build connections between language and vision. As well as the grounding of language to perception, the system also induces a set of probabilistic grammar rules. The knowledge learned is used to parse new commands involving previously un...

A probabilistic approach to learning a visually grounded language model through human-robot interaction

2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2010

Language is among the most fascinating and complex cognitive activities that develops rapidly since the early months of infants' life. The aim of the present work is to provide a humanoid robot with cognitive, perceptual and motor skills fundamental for the acquisition of a rudimentary form of language. We present a novel probabilistic model, inspired by the findings in cognitive sciences, able to associate spoken words with their perceptually grounded meanings. The main focus is set on acquiring the meaning of various perceptual categories (e.g. red, blue, circle, above, etc.), rather than specific world entities (e.g. an apple, a toy, etc.). Our probabilistic model is based on a variant of multi-instance learning technique, and it enables a robotic platform to learn grounded meanings of adjective/noun terms. The systems could be used to understand and generate appropriate natural language descriptions of real objects in a scene, and it has been successfully tested on the NAO humanoid robotic platform.

Generating Grammars for Natural Language Understanding from Knowledge about Actions and Objects

International Conference on Robotics and Biomimetics (ROBIO), 2015

Many applications in the fields of Service Robotics and Industrial Human-Robot Collaboration, require interaction with a human in a potentially unstructured environment. In many cases, a natural language interface can be helpful, but it requires powerful means of knowledge representation and processing, e.g., using ontologies and reasoning. In this paper we present a framework for the automatic generation of natural language grammars from ontological descriptions of robot tasks and interaction objects, and their use in a natural language interface. Robots can use it locally or even share this interface component through the RoboEarth framework in order to benefit from features such as referent grounding, ambiguity resolution, task identification, and task assignment.

A Bayesian Approach to Phrase Understanding through Cross-Situational Learning

Proceedings of the International Workshop on Visually Grounded Interaction and Language (ViGIL), in Conjunction with the 32nd Conference on Neural Information Processing Systems (NeurIPS), Canada, 2018., 2018

In this paper, we present an unsupervised probabilistic framework to grounding words (e.g., nouns, verbs, adjectives, and prepositions) through visual perception, and we discuss grammar induction in situated human-robot interaction with the objective of making a robot able to understand the underlying syntactic structure of human instructions so as to collaborate with users in space efficiently.

A Generative Framework for Multimodal Learning of Spatial Concepts and Object Categories: An Unsupervised Part-of-Speech Tagging and 3D Visual Perception Based Approach

Proceedings of the 7th Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-EPIROB), Portugal, 2017

Future human-robot collaboration employs language in instructing a robot about specific tasks to perform in its surroundings. This requires the robot to be able to associate spatial knowledge with language to understand the details of an assigned task so as to behave appropriately in the context of interaction. In this paper, we propose a probabilistic framework for learning the meaning of language spatial concepts (spatial prepositions) and object categories based on visual cues representing spatial layouts and geometric characteristics of objects in a tabletop scene. The model investigates unsupervised Part-of-Speech (POS) tagging through a Hidden Markov Model (HMM) that infers the corresponding hidden tags to words. Spatial configurations and geometric characteristics of objects on the tabletop are described through 3D point cloud information that encodes spatial semantics and categories of referents and landmarks in the environment. The proposed model is evaluated through human user interaction with Toyota HSR robot, where the obtained results show the significant effect of the model in making the robot able to successfully engage in interaction with the user in space.