End-to-end optimization of goal-driven and visually grounded dialogue systems (original) (raw)

Generative Deep Neural Networks for Dialogue: A Short Review

Researchers have recently started investigating deep neural networks for dialogue applications. In particular, generative sequence-to-sequence (Seq2Seq) models have shown promising results for unstructured tasks, such as word-level dialogue response generation. The hope is that such models will be able to leverage massive amounts of data to learn meaningful natural language representations and response generation strategies, while requiring a minimum amount of domain knowledge and hand-crafting. An important challenge is to develop models that can effectively incorporate dialogue context and generate meaningful and diverse responses. In support of this goal, we review recently proposed models based on generative encoder-decoder neural network architectures, and show that these models have better ability to incorporate long-term dialogue history, to model uncertainty and ambiguity in dialogue, and to generate responses with high-level compositional structure.

Deep Reinforcement Learning for Dialogue Systems with Dynamic User Goals

2020

Dialogue systems have recently become a widely used system across the world. Some of the functionality offered includes application user interfacing, social conversation, data interaction, and task completion. Most recently, dialogue systems have been developed to autonomously and intelligently interact with users to complete complex tasks in diverse operational spaces. This kind of dialogue system can interact with users to complete tasks such as making a phone call, ordering items online, searching the internet for a question, and more. These systems are typically created by training a machine learning model with example conversational data. One of the existing problems with training these systems is that they require large amounts of realistic user data, which can be challenging to collect and label in large quantities. Our research focuses on modifications to user simulators that "change their mind" mid-episode with the goal of training more robust dialogue agents. We ...

Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models

Proceedings of the AAAI Conference on Artificial Intelligence

We investigate the task of building open domain, conversational dialogue systems based on large dialogue corpora using generative models. Generative models produce system responses that are autonomously generated word-by-word, opening up the possibility for realistic, flexible interactions. In support of this goal, we extend the recently proposed hierarchical recurrent encoder-decoder neural network to the dialogue domain, and demonstrate that this model is competitive with state-of-the-art neural language models and back-off n-gram models. We investigate the limitations of this and similar approaches, and show how its performance can be improved by bootstrapping the learning from a larger question-answer pair corpus and from pretrained word embeddings.

Recurrent Attention Network with Reinforced Generator for Visual Dialog

ACM Transactions on Multimedia Computing, Communications, and Applications, 2020

In Visual Dialog, an agent has to parse temporal context in the dialog history and spatial context in the image to hold a meaningful dialog with humans. For example, to answer “what is the man on her left wearing?” the agent needs to (1) analyze the temporal context in the dialog history to infer who is being referred to as “her,” (2) parse the image to attend “her,” and (3) uncover the spatial context to shift the attention to “her left” and check the apparel of the man. In this article, we use a dialog network to memorize the temporal context and an attention processor to parse the spatial context. Since the question and the image are usually very complex, which makes it difficult for the question to be grounded with a single glimpse, the attention processor attends to the image multiple times to better collect visual information. In the Visual Dialog task, the generative decoder (G) is trained under the word-by-word paradigm, which suffers from the lack of sentence-level training...

Semantic Guidance of Dialogue Generation with Reinforcement Learning

2020

Neural encoder-decoder models have shown promising performance for human-computer dialogue systems over the past few years. However, due to the maximum-likelihood objective for the decoder, the generated responses are often universal and safe to the point that they lack meaningful information and are no longer relevant to the post. To address this, in this paper, we propose semantic guidance using reinforcement learning to ensure that the generated responses indeed include the given or predicted semantics and that these semantics do not appear repeatedly in the response. Synsets, which comprise sets of manually defined synonyms, are used as the form of assigned semantics. For a given/assigned/predicted synset, only one of its synonyms should appear in the generated response; this constitutes a simple but effective semantic-control mechanism. We conduct both quantitative and qualitative evaluations, which show that the generated responses are not only higher-quality but also reflect ...

Neural, Neural Everywhere: Controlled Generation Meets Scaffolded, Structured Dialogue∗

2021

In this paper, we present the second iteration of Chirpy Cardinal, an open-domain dialogue agent developed for the Alexa Prize SGC4 competition. Building on the success of the SGC3 Chirpy, we focus on improving conversational flexibility, initiative, and coherence. We introduce a variety of methods for controllable neural generation, ranging from prefix-based neural decoding over a symbolic scaffolding, to pure neural modules, to a novel hybrid infilling-based method that combines the best of both worlds. Additionally, we enhance previous news, music and movies modules with new APIs, as well as make major improvements in entity linking, topical transitions, and latency. Finally, we expand the variety of responses via new modules that focus on personal issues, sports, food, and even extraterrestrial life! These components come together to create a refreshed Chirpy Cardinal that is able to initiate conversations filled with interesting facts, engaging topics, and heartfelt responses.

Sample Efficient Deep Reinforcement Learning for Dialogue Systems With Large Action Spaces

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018

In spoken dialogue systems, we aim to deploy artificial intelligence to build automated dialogue agents that can converse with humans. A part of this effort is the policy optimisation task, which attempts to find a policy describing how to respond to humans, in the form of a function taking the current state of the dialogue and returning the response of the system. In this paper, we investigate deep reinforcement learning approaches to solve this problem. Particular attention is given to actor-critic methods, off-policy reinforcement learning with experience replay, and various methods aimed at reducing the bias and variance of estimators. When combined, these methods result in the previously proposed ACER algorithm that gave competitive results in gaming environments. These environments however are fully observable and have a relatively small action set so in this paper we examine the application of ACER to dialogue policy optimisation. We show that this method beats the current state-of-the-art in deep learning approaches for spoken dialogue systems. This not only leads to a more sample efficient algorithm that can train faster, but also allows us to apply the algorithm in more difficult environments than before. We thus experiment with learning in a very large action space, which has two orders of magnitude more actions than previously considered. We find that ACER trains significantly faster than the current state-ofthe-art. Index Terms-deep reinforcement learning, spoken dialogue systems, Gaussian processes.

Learning to Plan and Realize Separately for Open-Ended Dialogue Systems

Findings of the Association for Computational Linguistics: EMNLP 2020, 2020

Achieving true human-like ability to conduct a conversation remains an elusive goal for open-ended dialogue systems. We posit this is because extant approaches towards natural language generation (NLG) are typically construed as end-to-end architectures that do not adequately model human generation processes. To investigate, we decouple generation into two separate phases: planning and realization. In the planning phase, we train two planners to generate plans for response utterances. The realization phase uses response plans to produce an appropriate response. Through rigorous evaluations, both automated and human, we demonstrate that decoupling the process into planning and realization performs better than an end-to-end approach.

A Context-aware Convolutional Natural Language Generation model for Dialogue Systems

Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, 2018

Natural language generation (NLG) is an important component in spoken dialog systems (SDSs). A model for NLG involves sequence to sequence learning. State-of-the-art NLG models are built using recurrent neural network (RNN) based sequence to sequence models (Dušek and Jurcicek, 2016a). Convolutional sequence to sequence based models have been used in the domain of machine translation but their application as natural language generators in dialogue systems is still unexplored. In this work, we propose a novel approach to NLG using convolutional neural network (CNN) based sequence to sequence learning. CNN-based approach allows to build a hierarchical model which encapsulates dependencies between words via shorter path unlike RNNs. In contrast to recurrent models, convolutional approach allows for efficient utilization of computational resources by parallelizing computations over all elements, and eases the learning process by applying constant number of nonlinearities. We also propose to use CNN-based reranker for obtaining responses having semantic correspondence with input dialogue acts. The proposed model is capable of entrainment. Studies using a standard dataset shows the effectiveness of the proposed CNN-based approach to NLG.

Improving Generative Visual Dialog by Answering Diverse Questions

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

Prior work on training generative Visual Dialog models with reinforcement learning (Das et al., 2017b) has explored a Q-BOT-A-BOT image-guessing game and shown that this 'self-talk' approach can lead to improved performance at the downstream dialogconditioned image-guessing task. However, this improvement saturates and starts degrading after a few rounds of interaction, and does not lead to a better Visual Dialog model. We find that this is due in part to repeated interactions between Q-BOT and A-BOT during selftalk, which are not informative with respect to the image. To improve this, we devise a simple auxiliary objective that incentivizes Q-BOT to ask diverse questions, thus reducing repetitions and in turn enabling A-BOT to explore a larger state space during RL i.e. be exposed to more visual concepts to talk about, and varied questions to answer. We evaluate our approach via a host of automatic metrics and human studies, and demonstrate that it leads to better dialog, i.e. dialog that is more diverse (i.e. less repetitive), consistent (i.e. has fewer conflicting exchanges), fluent (i.e. more human-like), and detailed, while still being comparably image-relevant as prior work and ablations. Our code is publicly available at github.com/vmurahari3/visdial-diversity.