Wenchao Du | University of Waterloo (original) (raw)

Papers by Wenchao Du

Research paper thumbnail of Avoiding Overlap in Data Augmentation for AMR-to-Text Generation

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2021

Leveraging additional unlabeled data to boost model performance is common practice in machine lea... more Leveraging additional unlabeled data to boost model performance is common practice in machine learning and natural language processing. For generation tasks, if there is overlap between the additional data and the target text evaluation data, then training on the additional data is training on answers of the test set. This leads to overly-inflated scores with the additional data compared to real-world testing scenarios and problems when comparing models. We study the AMR dataset and Gigaword, which is popularly used for improving AMRto-text generators, and find significant overlap between Gigaword and a subset of the AMR dataset. We propose methods for excluding parts of Gigaword to remove this overlap, and show that our approach leads to a more realistic evaluation of the task of AMR-to-text generation. Going forward, we give simple bestpractice recommendations for leveraging additional data in AMR-to-text generation. 1

Research paper thumbnail of Learning to Order Graph Elements with Application to Multilingual Surface Realization

Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019), 2019

Research paper thumbnail of Top-Down Structurally-Constrained Neural Response Generation with Lexicalized Probabilistic Context-Free Grammar

Proceedings of the 2019 Conference of the North, 2019

We consider neural language generation under a novel problem setting: generating the words of a s... more We consider neural language generation under a novel problem setting: generating the words of a sentence according to the order of their first appearance in its lexicalized PCFG parse tree, in a depth-first, left-to-right manner. Unlike previous tree-based language generation methods, our approach is both (i) topdown and (ii) explicitly generating syntactic structure at the same time. In addition, our method combines neural model with symbolic approach: word choice at each step is constrained by its predicted syntactic function. We applied our model to the task of dialog response generation, and found it significantly improves over sequence-to-sequence baseline, in terms of diversity and relevance. We also investigated the effect of lexicalization on language generation, and found that lexicalization schemes that give priority to content words have certain advantages over those focusing on dependency relations.

Research paper thumbnail of Data Augmentation for Neural Online Chats Response Selection

Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, 2018

Data augmentation seeks to manipulate the available data for training to improve the generalizati... more Data augmentation seeks to manipulate the available data for training to improve the generalization ability of models. We investigate two data augmentation proxies, permutation and flipping, for neural dialog response selection task on various models over multiple datasets, including both Chinese and English languages. Different from standard data augmentation techniques, our method combines the original and synthesized data for prediction. Empirical results show that our approach can gain 1 to 3 recall-at-1 points over baseline models in both full-scale and small-scale settings.

Research paper thumbnail of Discovering Conversational Dependencies between Messages in Dialogs

Proceedings of the AAAI Conference on Artificial Intelligence

We investigate the task of inferring conversational dependencies between messages in one-on-one o... more We investigate the task of inferring conversational dependencies between messages in one-on-one online chat, which has become one of the most popular forms of customer service. We propose a novel probabilistic classifier that leverages conversational, lexical and semantic information. The approach is evaluated empirically on a set of customer service chat logs from a Chinese e-commerce website. It outperforms heuristic baselines.

Research paper thumbnail of Boosting Dialog Response Generation

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Neural models have become one of the most important approaches to dialog response generation. How... more Neural models have become one of the most important approaches to dialog response generation. However, they still tend to generate the most common and generic responses in the corpus all the time. To address this problem, we designed an iterative training process and ensemble method based on boosting. We combined our method with different training and decoding paradigms as the base model, including mutual-information-based decoding and reward-augmented maximum likelihood learning. Empirical results show that our approach can significantly improve the diversity and relevance of the responses generated by all base models, backed by objective measurements and human evaluation.

Research paper thumbnail of Avoiding Overlap in Data Augmentation for AMR-to-Text Generation

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2021

Leveraging additional unlabeled data to boost model performance is common practice in machine lea... more Leveraging additional unlabeled data to boost model performance is common practice in machine learning and natural language processing. For generation tasks, if there is overlap between the additional data and the target text evaluation data, then training on the additional data is training on answers of the test set. This leads to overly-inflated scores with the additional data compared to real-world testing scenarios and problems when comparing models. We study the AMR dataset and Gigaword, which is popularly used for improving AMRto-text generators, and find significant overlap between Gigaword and a subset of the AMR dataset. We propose methods for excluding parts of Gigaword to remove this overlap, and show that our approach leads to a more realistic evaluation of the task of AMR-to-text generation. Going forward, we give simple bestpractice recommendations for leveraging additional data in AMR-to-text generation. 1

Research paper thumbnail of Learning to Order Graph Elements with Application to Multilingual Surface Realization

Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019), 2019

Research paper thumbnail of Top-Down Structurally-Constrained Neural Response Generation with Lexicalized Probabilistic Context-Free Grammar

Proceedings of the 2019 Conference of the North, 2019

We consider neural language generation under a novel problem setting: generating the words of a s... more We consider neural language generation under a novel problem setting: generating the words of a sentence according to the order of their first appearance in its lexicalized PCFG parse tree, in a depth-first, left-to-right manner. Unlike previous tree-based language generation methods, our approach is both (i) topdown and (ii) explicitly generating syntactic structure at the same time. In addition, our method combines neural model with symbolic approach: word choice at each step is constrained by its predicted syntactic function. We applied our model to the task of dialog response generation, and found it significantly improves over sequence-to-sequence baseline, in terms of diversity and relevance. We also investigated the effect of lexicalization on language generation, and found that lexicalization schemes that give priority to content words have certain advantages over those focusing on dependency relations.

Research paper thumbnail of Data Augmentation for Neural Online Chats Response Selection

Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, 2018

Data augmentation seeks to manipulate the available data for training to improve the generalizati... more Data augmentation seeks to manipulate the available data for training to improve the generalization ability of models. We investigate two data augmentation proxies, permutation and flipping, for neural dialog response selection task on various models over multiple datasets, including both Chinese and English languages. Different from standard data augmentation techniques, our method combines the original and synthesized data for prediction. Empirical results show that our approach can gain 1 to 3 recall-at-1 points over baseline models in both full-scale and small-scale settings.

Research paper thumbnail of Discovering Conversational Dependencies between Messages in Dialogs

Proceedings of the AAAI Conference on Artificial Intelligence

We investigate the task of inferring conversational dependencies between messages in one-on-one o... more We investigate the task of inferring conversational dependencies between messages in one-on-one online chat, which has become one of the most popular forms of customer service. We propose a novel probabilistic classifier that leverages conversational, lexical and semantic information. The approach is evaluated empirically on a set of customer service chat logs from a Chinese e-commerce website. It outperforms heuristic baselines.

Research paper thumbnail of Boosting Dialog Response Generation

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Neural models have become one of the most important approaches to dialog response generation. How... more Neural models have become one of the most important approaches to dialog response generation. However, they still tend to generate the most common and generic responses in the corpus all the time. To address this problem, we designed an iterative training process and ensemble method based on boosting. We combined our method with different training and decoding paradigms as the base model, including mutual-information-based decoding and reward-augmented maximum likelihood learning. Empirical results show that our approach can significantly improve the diversity and relevance of the responses generated by all base models, backed by objective measurements and human evaluation.