guanglu wan - Profile on Academia.edu (original) (raw)

Papers by guanglu wan

Research paper thumbnail of CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

arXiv (Cornell University), May 27, 2024

Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recen... more Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies for LLM quantization. Firstly, a K-Means clustering based algorithm is proposed that allows dynamic generation of quantization centroids for each column of a parameter matrix. Secondly, we design an outlier-guided adaptive precision search strategy which can dynamically assign varying bit-widths to different columns. Finally, a dynamic outlier reservation scheme is developed to retain some parameters in their original float point precision, in trade off of boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-theart results across different bit settings, especially in extremely low-bit scenarios. Code is available at .

Research paper thumbnail of ECOD: A Multi-modal Dataset for Intelligent Adjudication of E-Commerce Order Disputes

ECOD: A Multi-modal Dataset for Intelligent Adjudication of E-Commerce Order Disputes

Lecture Notes in Computer Science, Dec 31, 2022

Research paper thumbnail of Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter

arXiv (Cornell University), Sep 16, 2023

Multilingual intelligent assistants, such as ChatGPT, have recently gained popularity. To further... more Multilingual intelligent assistants, such as ChatGPT, have recently gained popularity. To further expand the applications of multilingual artificial intelligence (AI) assistants and facilitate international communication, it is essential to enhance the performance of multilingual speech recognition, which is a crucial component of speech interaction. In this paper, we propose two simple and parameter-efficient methods: language prompt tuning and f rame-level language adapter, to respectively enhance language-configurable and language-agnostic multilingual speech recognition. Additionally, we explore the feasibility of integrating these two approaches using parameter-efficient fine-tuning methods. Our experiments demonstrate significant performance improvements across seven languages using our proposed methods.

Research paper thumbnail of CPPF: A contextual and post-processing-free model for automatic speech recognition

arXiv (Cornell University), Sep 12, 2023

ASR systems have become increasingly widespread in recent years. However, their textual outputs o... more ASR systems have become increasingly widespread in recent years. However, their textual outputs often require post-processing tasks before they can be practically utilized. To address this issue, we draw inspiration from the multifaceted capabilities of LLMs and Whisper, and focus on integrating multiple ASR text processing tasks related to speech recognition into the ASR model. This integration not only shortens the multi-stage pipeline, but also prevents the propagation of cascading errors, resulting in direct generation of post-processed text. In this study, we focus on ASR-related processing tasks, including Contextual ASR and multiple ASR post processing tasks. To achieve this objective, we introduce the CPPF model, which offers a versatile and highly effective alternative to ASR processing. CPPF seamlessly integrates these tasks without any significant loss in recognition performance.

Research paper thumbnail of A Task-oriented Dialog Model with Task-progressive and Policy-aware Pre-training

arXiv (Cornell University), Sep 30, 2023

Pre-trained conversation models (PCMs) have achieved promising progress in recent years. However,... more Pre-trained conversation models (PCMs) have achieved promising progress in recent years. However, existing PCMs for Task-oriented dialog (TOD) are insufficient for capturing the sequential nature of the TOD-related tasks, as well as for learning dialog policy information. To alleviate these problems, this paper proposes a task-progressive PCM with two policy-aware pre-training tasks. The model is pre-trained through three stages where TOD-related tasks are progressively employed according to the task logic of the TOD system. A global policy consistency task is designed to capture the multi-turn dialog policy sequential relation, and an act-based contrastive learning task is designed to capture similarities among samples with the same dialog policy. Our model achieves better results on both MultiWOZ and In-Car end-to-end dialog modeling benchmarks with only 18% parameters and 25% pre-training data compared to the previous state-of-the-art PCM, GALAXY. We make our code and data publicly available.

Research paper thumbnail of Exploiting Pseudo Future Contexts for Emotion Recognition in Conversations

Exploiting Pseudo Future Contexts for Emotion Recognition in Conversations

Lecture Notes in Computer Science, Dec 31, 2022

Research paper thumbnail of An Empirical Study of Language Model Integration for Transducer based Speech Recognition

arXiv (Cornell University), Mar 30, 2022

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-... more Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a loworder density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.

Research paper thumbnail of A Low-Cost, Controllable and Interpretable Task-Oriented Chatbot: With Real-World After-Sale Services as Example

arXiv (Cornell University), May 12, 2022

Though widely used in industry, traditional task-oriented dialogue systems suffer from three bott... more Though widely used in industry, traditional task-oriented dialogue systems suffer from three bottlenecks: (i) difficult ontology construction (e.g., intents and slots); (ii) poor controllability and interpretability; (iii) annotation-hungry. In this paper, we propose to represent utterance with a simpler concept named Dialogue Action, upon which we construct a tree-structured TaskFlow and further build task-oriented chatbot with TaskFlow as core component. A framework is presented to automatically construct TaskFlow from large-scale dialogues and deploy online. Our experiments on realworld after-sale customer services show TaskFlow can satisfy the major needs, as well as reduce the developer burden effectively. CCS CONCEPTS • Computing methodologies → Discourse, dialogue and pragmatics; • Information systems → Query intent.

Research paper thumbnail of Exploiting Pseudo Future Contexts for Emotion Recognition in Conversations

arXiv (Cornell University), Jun 27, 2023

With the extensive accumulation of conversational data on the Internet, emotion recognition in co... more With the extensive accumulation of conversational data on the Internet, emotion recognition in conversations (ERC) has received increasing attention. Previous efforts of this task mainly focus on leveraging contextual and speaker-specific features, or integrating heterogeneous external commonsense knowledge. Among them, some heavily rely on future contexts, which, however, are not always available in real-life scenarios. This fact inspires us to generate pseudo future contexts to improve ERC. Specifically, for an utterance, we generate its future context with pre-trained language models, potentially containing extra beneficial knowledge in a conversational form homogeneous with the historical ones. These characteristics make pseudo future contexts easily fused with historical contexts and historical speaker-specific contexts, yielding a conceptually simple framework systematically integrating multicontexts. Experimental results on four ERC datasets demonstrate our method's superiority. Further in-depth analyses reveal that pseudo future contexts can rival real ones to some extent, especially in relatively context-independent conversations.

Research paper thumbnail of Dialog-to-Actions: Building Task-Oriented Dialogue System via Action-Level Generation

arXiv (Cornell University), Apr 3, 2023

End-to-end generation-based approaches have been investigated and applied in task-oriented dialog... more End-to-end generation-based approaches have been investigated and applied in task-oriented dialogue systems. However, in industrial scenarios, existing methods face the bottlenecks of reliability (e.g., domain-inconsistent responses, repetition problem, etc) and efficiency (e.g., long computation time, etc). In this paper, we propose a task-oriented dialogue system via action-level generation. Specifically, we first construct dialogue actions from large-scale dialogues and represent each natural language (NL) response as a sequence of dialogue actions. Further, we train a Sequence-to-Sequence model which takes the dialogue history as the input and outputs a sequence of dialogue actions. The generated dialogue actions are transformed into verbal responses. Experimental results show that our light-weighted method achieves competitive performance, and has the advantage of reliability and efficiency. CCS CONCEPTS • Computing methodologies → Discourse, dialogue and pragmatics.

Research paper thumbnail of MUSIED: A Benchmark for Event Detection from Multi-Source Heterogeneous Informal Texts

arXiv (Cornell University), Nov 25, 2022

Event detection (ED) identifies and classifies event triggers from unstructured texts, serving as... more Event detection (ED) identifies and classifies event triggers from unstructured texts, serving as a fundamental task for information extraction. Despite the remarkable progress achieved in the past several years, most research efforts focus on detecting events from formal texts (e.g., news articles, Wikipedia documents, financial announcements). Moreover, the texts in each dataset are either from a single source or multiple yet relatively homogeneous sources. With massive amounts of user-generated text accumulating on the Web and inside enterprises, identifying meaningful events in these informal texts, usually from multiple heterogeneous sources, has become a problem of significant practical value. As a pioneering exploration that expands event detection to the scenarios involving informal and heterogeneous texts, we propose a new largescale Chinese event detection dataset based on user reviews, text conversations, and phone conversations in a leading e-commerce platform for food service. We carefully investigate the proposed dataset's textual informality and multi-source heterogeneity characteristics by inspecting data samples quantitatively and qualitatively. Extensive experiments with state-of-the-art event detection methods verify the unique challenges posed by these characteristics, indicating that multi-source informal event detection remains an open problem and requires further efforts. Our benchmark and code are released at https://github.c om/myeclipse/MUSIED.

Research paper thumbnail of Segment Augmentation and Prediction Consistency Neural Network for Multi-label Unknown Intent Detection

Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Research paper thumbnail of MUSIED: A Benchmark for Event Detection from Multi-Source Heterogeneous Informal Texts

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Event detection (ED) identifies and classifies event triggers from unstructured texts, serving as... more Event detection (ED) identifies and classifies event triggers from unstructured texts, serving as a fundamental task for information extraction. Despite the remarkable progress achieved in the past several years, most research efforts focus on detecting events from formal texts (e.g., news articles, Wikipedia documents, financial announcements). Moreover, the texts in each dataset are either from a single source or multiple yet relatively homogeneous sources. With massive amounts of user-generated text accumulating on the Web and inside enterprises, identifying meaningful events in these informal texts, usually from multiple heterogeneous sources, has become a problem of significant practical value. As a pioneering exploration that expands event detection to the scenarios involving informal and heterogeneous texts, we propose a new large-scale Chinese event detection dataset based on user reviews, text conversations, and phone conversations in a leading e-commerce platform for food service. We carefully investigate the proposed dataset's textual informality and multi-source heterogeneity characteristics by inspecting data samples quantitatively and qualitatively. Extensive experiments with state-of-the-art event detection methods verify the unique challenges posed by these characteristics, indicating that multisource informal event detection remains an open problem and requires further efforts. Our benchmark and code are released at https: //github.com/myeclipse/MUSIED.

Research paper thumbnail of Ambiguous Learning from Retrieval: Towards Zero-shot Semantic Parsing

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Current neural semantic parsers mostly take supervised approaches, which require a considerable a... more Current neural semantic parsers mostly take supervised approaches, which require a considerable amount of expensive training data. As a result, minimizing supervision requirements has been one of the key challenges in semantic parsing. In this paper, we propose a Retrieval as Ambiguous Supervision framework, which can effectively collect high-coverage ambiguous supervisions (i.e., the parse candidates of an utterance) via a pre-trained language modelsbased retrieval system. Then, by assuming candidates will contain the correct ones, the zeroshot task can be converted into an ambiguously supervised task. To improve the precision and coverage of such ambiguous supervision, we propose a confidence-driven self-training algorithm, in which a semantic parser is learned and exploited to disambiguate candidates iteratively. Experimental results show that our approach significantly outperforms the state-of-the-art zero-shot semantic parsing methods.

Research paper thumbnail of Dialog-to-Actions: Building Task-Oriented Dialogue System via Action-Level Generation

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

End-to-end generation-based approaches have been investigated and applied in task-oriented dialog... more End-to-end generation-based approaches have been investigated and applied in task-oriented dialogue systems. However, in industrial scenarios, existing methods face the bottlenecks of reliability (e.g., domain-inconsistent responses, repetition problem, etc) and efficiency (e.g., long computation time, etc). In this paper, we propose a task-oriented dialogue system via action-level generation. Specifically, we first construct dialogue actions from large-scale dialogues and represent each natural language (NL) response as a sequence of dialogue actions. Further, we train a Sequence-to-Sequence model which takes the dialogue history as the input and outputs a sequence of dialogue actions. The generated dialogue actions are transformed into verbal responses. Experimental results show that our light-weighted method achieves competitive performance, and has the advantage of reliability and efficiency. CCS CONCEPTS • Computing methodologies → Discourse, dialogue and pragmatics.

Research paper thumbnail of Peak-First CTC: Reducing the Peak Latency of CTC Models by Applying Peak-First Regularization

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

The CTC model has been widely applied to many application scenarios because of its simple structu... more The CTC model has been widely applied to many application scenarios because of its simple structure, excellent performance, and fast inference speed. There are many peaks in the probability distribution predicted by the CTC models, and each peak represents a non-blank token. The recognition latency of CTC models can be reduced by encouraging the model to predict peaks earlier. Existing methods to reduce latency require modifying the transition relationship between tokens in the forward-backward algorithm, and the gradient calculation. Some of these methods even depend on the forced alignment results provided by other pretrained models. The above methods are complex to implement. To reduce the peak latency, we propose a simple and novel method named peak-first regularization, which utilizes a frame-wise knowledge distillation function to force the probability distribution of the CTC model to shift left along the time axis instead of directly modifying the calculation process of CTC loss and gradients. All the experiments are conducted on a Chinese Mandarin dataset AISHELL-1. We have verified the effectiveness of the proposed regularization on both streaming and non-streaming CTC models respectively. The results show that the proposed method can reduce the average peak latency by about 100 to 200 milliseconds with almost no degradation of recognition accuracy.

Research paper thumbnail of Covariance Regularization for Probabilistic Linear Discriminant Analysis

Cornell University - arXiv, Dec 6, 2022

Probabilistic linear discriminant analysis (PLDA) is commonly used in speaker verification system... more Probabilistic linear discriminant analysis (PLDA) is commonly used in speaker verification systems to score the similarity of speaker embeddings. Recent studies improved the performance of PLDA in domain-matched conditions by diagonalizing its covariance. We suspect such brutal pruning approach could eliminate its capacity in modeling dimension correlation of speaker embeddings, leading to inadequate performance with domain adaptation. This paper explores two alternative covariance regularization approaches, namely, interpolated PLDA and sparse PLDA, to tackle the problem. The interpolated PLDA incorporates the prior knowledge from cosine scoring to interpolate the covariance of PLDA. The sparse PLDA introduces a sparsity penalty to update the covariance. Experimental results demonstrate that both approaches outperform diagonal regularization noticeably with domain adaptation. In addition, in-domain data can be significantly reduced when training sparse PLDA for domain adaptation.

Research paper thumbnail of Label-free Knowledge Distillation with Contrastive Loss for Light-weight Speaker Recognition

Cornell University - arXiv, Dec 6, 2022

Very deep models for speaker recognition (SR) have demonstrated remarkable performance improvemen... more Very deep models for speaker recognition (SR) have demonstrated remarkable performance improvement in recent research. However, it is impractical to deploy these models for ondevice applications with constrained computational resources. On the other hand, lightweight models are highly desired in practice despite their sub-optimal performance. This research aims to improve lightweight SR models through large-scale label-free knowledge distillation (KD). Existing KD approaches for SR typically require speaker labels to learn task-specific knowledge, due to the inefficiency of conventional loss for distillation. To address the inefficiency problem and achieve labelfree KD, we propose to employ the contrastive loss from selfsupervised learning for distillation. Extensive experiments are conducted on a collection of public speech datasets from diverse sources. Results on lightweight SR models show that the proposed approach of label-free KD with contrastive loss consistently outperforms both conventional distillation methods and self-supervised learning methods by a significant margin.

Research paper thumbnail of Peak-First CTC: Reducing the Peak Latency of CTC Models by Applying Peak-First Regularization

Cornell University - arXiv, Nov 6, 2022

The CTC model has been widely applied to many application scenarios because of its simple structu... more The CTC model has been widely applied to many application scenarios because of its simple structure, excellent performance, and fast inference speed. There are many peaks in the probability distribution predicted by the CTC models, and each peak represents a non-blank token. The recognition latency of CTC models can be reduced by encouraging the model to predict peaks earlier. Existing methods to reduce latency require modifying the transition relationship between tokens in the forward-backward algorithm, and the gradient calculation. Some of these methods even depend on the forced alignment results provided by other pretrained models. The above methods are complex to implement. To reduce the peak latency, we propose a simple and novel method named peak-first regularization, which utilizes a frame-wise knowledge distillation function to force the probability distribution of the CTC model to shift left along the time axis instead of directly modifying the calculation process of CTC loss and gradients. All the experiments are conducted on a Chinese Mandarin dataset AISHELL-1. We have verified the effectiveness of the proposed regularization on both streaming and non-streaming CTC models respectively. The results show that the proposed method can reduce the average peak latency by about 100 to 200 milliseconds with almost no degradation of recognition accuracy.

Research paper thumbnail of An Empirical Study of Language Model Integration for Transducer based Speech Recognition

Interspeech 2022

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-... more Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a loworder density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.

Research paper thumbnail of CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

arXiv (Cornell University), May 27, 2024

Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recen... more Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies for LLM quantization. Firstly, a K-Means clustering based algorithm is proposed that allows dynamic generation of quantization centroids for each column of a parameter matrix. Secondly, we design an outlier-guided adaptive precision search strategy which can dynamically assign varying bit-widths to different columns. Finally, a dynamic outlier reservation scheme is developed to retain some parameters in their original float point precision, in trade off of boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-theart results across different bit settings, especially in extremely low-bit scenarios. Code is available at .

Research paper thumbnail of ECOD: A Multi-modal Dataset for Intelligent Adjudication of E-Commerce Order Disputes

ECOD: A Multi-modal Dataset for Intelligent Adjudication of E-Commerce Order Disputes

Lecture Notes in Computer Science, Dec 31, 2022

Research paper thumbnail of Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter

arXiv (Cornell University), Sep 16, 2023

Multilingual intelligent assistants, such as ChatGPT, have recently gained popularity. To further... more Multilingual intelligent assistants, such as ChatGPT, have recently gained popularity. To further expand the applications of multilingual artificial intelligence (AI) assistants and facilitate international communication, it is essential to enhance the performance of multilingual speech recognition, which is a crucial component of speech interaction. In this paper, we propose two simple and parameter-efficient methods: language prompt tuning and f rame-level language adapter, to respectively enhance language-configurable and language-agnostic multilingual speech recognition. Additionally, we explore the feasibility of integrating these two approaches using parameter-efficient fine-tuning methods. Our experiments demonstrate significant performance improvements across seven languages using our proposed methods.

Research paper thumbnail of CPPF: A contextual and post-processing-free model for automatic speech recognition

arXiv (Cornell University), Sep 12, 2023

ASR systems have become increasingly widespread in recent years. However, their textual outputs o... more ASR systems have become increasingly widespread in recent years. However, their textual outputs often require post-processing tasks before they can be practically utilized. To address this issue, we draw inspiration from the multifaceted capabilities of LLMs and Whisper, and focus on integrating multiple ASR text processing tasks related to speech recognition into the ASR model. This integration not only shortens the multi-stage pipeline, but also prevents the propagation of cascading errors, resulting in direct generation of post-processed text. In this study, we focus on ASR-related processing tasks, including Contextual ASR and multiple ASR post processing tasks. To achieve this objective, we introduce the CPPF model, which offers a versatile and highly effective alternative to ASR processing. CPPF seamlessly integrates these tasks without any significant loss in recognition performance.

Research paper thumbnail of A Task-oriented Dialog Model with Task-progressive and Policy-aware Pre-training

arXiv (Cornell University), Sep 30, 2023

Pre-trained conversation models (PCMs) have achieved promising progress in recent years. However,... more Pre-trained conversation models (PCMs) have achieved promising progress in recent years. However, existing PCMs for Task-oriented dialog (TOD) are insufficient for capturing the sequential nature of the TOD-related tasks, as well as for learning dialog policy information. To alleviate these problems, this paper proposes a task-progressive PCM with two policy-aware pre-training tasks. The model is pre-trained through three stages where TOD-related tasks are progressively employed according to the task logic of the TOD system. A global policy consistency task is designed to capture the multi-turn dialog policy sequential relation, and an act-based contrastive learning task is designed to capture similarities among samples with the same dialog policy. Our model achieves better results on both MultiWOZ and In-Car end-to-end dialog modeling benchmarks with only 18% parameters and 25% pre-training data compared to the previous state-of-the-art PCM, GALAXY. We make our code and data publicly available.

Research paper thumbnail of Exploiting Pseudo Future Contexts for Emotion Recognition in Conversations

Exploiting Pseudo Future Contexts for Emotion Recognition in Conversations

Lecture Notes in Computer Science, Dec 31, 2022

Research paper thumbnail of An Empirical Study of Language Model Integration for Transducer based Speech Recognition

arXiv (Cornell University), Mar 30, 2022

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-... more Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a loworder density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.

Research paper thumbnail of A Low-Cost, Controllable and Interpretable Task-Oriented Chatbot: With Real-World After-Sale Services as Example

arXiv (Cornell University), May 12, 2022

Though widely used in industry, traditional task-oriented dialogue systems suffer from three bott... more Though widely used in industry, traditional task-oriented dialogue systems suffer from three bottlenecks: (i) difficult ontology construction (e.g., intents and slots); (ii) poor controllability and interpretability; (iii) annotation-hungry. In this paper, we propose to represent utterance with a simpler concept named Dialogue Action, upon which we construct a tree-structured TaskFlow and further build task-oriented chatbot with TaskFlow as core component. A framework is presented to automatically construct TaskFlow from large-scale dialogues and deploy online. Our experiments on realworld after-sale customer services show TaskFlow can satisfy the major needs, as well as reduce the developer burden effectively. CCS CONCEPTS • Computing methodologies → Discourse, dialogue and pragmatics; • Information systems → Query intent.

Research paper thumbnail of Exploiting Pseudo Future Contexts for Emotion Recognition in Conversations

arXiv (Cornell University), Jun 27, 2023

With the extensive accumulation of conversational data on the Internet, emotion recognition in co... more With the extensive accumulation of conversational data on the Internet, emotion recognition in conversations (ERC) has received increasing attention. Previous efforts of this task mainly focus on leveraging contextual and speaker-specific features, or integrating heterogeneous external commonsense knowledge. Among them, some heavily rely on future contexts, which, however, are not always available in real-life scenarios. This fact inspires us to generate pseudo future contexts to improve ERC. Specifically, for an utterance, we generate its future context with pre-trained language models, potentially containing extra beneficial knowledge in a conversational form homogeneous with the historical ones. These characteristics make pseudo future contexts easily fused with historical contexts and historical speaker-specific contexts, yielding a conceptually simple framework systematically integrating multicontexts. Experimental results on four ERC datasets demonstrate our method's superiority. Further in-depth analyses reveal that pseudo future contexts can rival real ones to some extent, especially in relatively context-independent conversations.

Research paper thumbnail of Dialog-to-Actions: Building Task-Oriented Dialogue System via Action-Level Generation

arXiv (Cornell University), Apr 3, 2023

End-to-end generation-based approaches have been investigated and applied in task-oriented dialog... more End-to-end generation-based approaches have been investigated and applied in task-oriented dialogue systems. However, in industrial scenarios, existing methods face the bottlenecks of reliability (e.g., domain-inconsistent responses, repetition problem, etc) and efficiency (e.g., long computation time, etc). In this paper, we propose a task-oriented dialogue system via action-level generation. Specifically, we first construct dialogue actions from large-scale dialogues and represent each natural language (NL) response as a sequence of dialogue actions. Further, we train a Sequence-to-Sequence model which takes the dialogue history as the input and outputs a sequence of dialogue actions. The generated dialogue actions are transformed into verbal responses. Experimental results show that our light-weighted method achieves competitive performance, and has the advantage of reliability and efficiency. CCS CONCEPTS • Computing methodologies → Discourse, dialogue and pragmatics.

Research paper thumbnail of MUSIED: A Benchmark for Event Detection from Multi-Source Heterogeneous Informal Texts

arXiv (Cornell University), Nov 25, 2022

Event detection (ED) identifies and classifies event triggers from unstructured texts, serving as... more Event detection (ED) identifies and classifies event triggers from unstructured texts, serving as a fundamental task for information extraction. Despite the remarkable progress achieved in the past several years, most research efforts focus on detecting events from formal texts (e.g., news articles, Wikipedia documents, financial announcements). Moreover, the texts in each dataset are either from a single source or multiple yet relatively homogeneous sources. With massive amounts of user-generated text accumulating on the Web and inside enterprises, identifying meaningful events in these informal texts, usually from multiple heterogeneous sources, has become a problem of significant practical value. As a pioneering exploration that expands event detection to the scenarios involving informal and heterogeneous texts, we propose a new largescale Chinese event detection dataset based on user reviews, text conversations, and phone conversations in a leading e-commerce platform for food service. We carefully investigate the proposed dataset's textual informality and multi-source heterogeneity characteristics by inspecting data samples quantitatively and qualitatively. Extensive experiments with state-of-the-art event detection methods verify the unique challenges posed by these characteristics, indicating that multi-source informal event detection remains an open problem and requires further efforts. Our benchmark and code are released at https://github.c om/myeclipse/MUSIED.

Research paper thumbnail of Segment Augmentation and Prediction Consistency Neural Network for Multi-label Unknown Intent Detection

Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Research paper thumbnail of MUSIED: A Benchmark for Event Detection from Multi-Source Heterogeneous Informal Texts

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Event detection (ED) identifies and classifies event triggers from unstructured texts, serving as... more Event detection (ED) identifies and classifies event triggers from unstructured texts, serving as a fundamental task for information extraction. Despite the remarkable progress achieved in the past several years, most research efforts focus on detecting events from formal texts (e.g., news articles, Wikipedia documents, financial announcements). Moreover, the texts in each dataset are either from a single source or multiple yet relatively homogeneous sources. With massive amounts of user-generated text accumulating on the Web and inside enterprises, identifying meaningful events in these informal texts, usually from multiple heterogeneous sources, has become a problem of significant practical value. As a pioneering exploration that expands event detection to the scenarios involving informal and heterogeneous texts, we propose a new large-scale Chinese event detection dataset based on user reviews, text conversations, and phone conversations in a leading e-commerce platform for food service. We carefully investigate the proposed dataset's textual informality and multi-source heterogeneity characteristics by inspecting data samples quantitatively and qualitatively. Extensive experiments with state-of-the-art event detection methods verify the unique challenges posed by these characteristics, indicating that multisource informal event detection remains an open problem and requires further efforts. Our benchmark and code are released at https: //github.com/myeclipse/MUSIED.

Research paper thumbnail of Ambiguous Learning from Retrieval: Towards Zero-shot Semantic Parsing

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Current neural semantic parsers mostly take supervised approaches, which require a considerable a... more Current neural semantic parsers mostly take supervised approaches, which require a considerable amount of expensive training data. As a result, minimizing supervision requirements has been one of the key challenges in semantic parsing. In this paper, we propose a Retrieval as Ambiguous Supervision framework, which can effectively collect high-coverage ambiguous supervisions (i.e., the parse candidates of an utterance) via a pre-trained language modelsbased retrieval system. Then, by assuming candidates will contain the correct ones, the zeroshot task can be converted into an ambiguously supervised task. To improve the precision and coverage of such ambiguous supervision, we propose a confidence-driven self-training algorithm, in which a semantic parser is learned and exploited to disambiguate candidates iteratively. Experimental results show that our approach significantly outperforms the state-of-the-art zero-shot semantic parsing methods.

Research paper thumbnail of Dialog-to-Actions: Building Task-Oriented Dialogue System via Action-Level Generation

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

End-to-end generation-based approaches have been investigated and applied in task-oriented dialog... more End-to-end generation-based approaches have been investigated and applied in task-oriented dialogue systems. However, in industrial scenarios, existing methods face the bottlenecks of reliability (e.g., domain-inconsistent responses, repetition problem, etc) and efficiency (e.g., long computation time, etc). In this paper, we propose a task-oriented dialogue system via action-level generation. Specifically, we first construct dialogue actions from large-scale dialogues and represent each natural language (NL) response as a sequence of dialogue actions. Further, we train a Sequence-to-Sequence model which takes the dialogue history as the input and outputs a sequence of dialogue actions. The generated dialogue actions are transformed into verbal responses. Experimental results show that our light-weighted method achieves competitive performance, and has the advantage of reliability and efficiency. CCS CONCEPTS • Computing methodologies → Discourse, dialogue and pragmatics.

Research paper thumbnail of Peak-First CTC: Reducing the Peak Latency of CTC Models by Applying Peak-First Regularization

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

The CTC model has been widely applied to many application scenarios because of its simple structu... more The CTC model has been widely applied to many application scenarios because of its simple structure, excellent performance, and fast inference speed. There are many peaks in the probability distribution predicted by the CTC models, and each peak represents a non-blank token. The recognition latency of CTC models can be reduced by encouraging the model to predict peaks earlier. Existing methods to reduce latency require modifying the transition relationship between tokens in the forward-backward algorithm, and the gradient calculation. Some of these methods even depend on the forced alignment results provided by other pretrained models. The above methods are complex to implement. To reduce the peak latency, we propose a simple and novel method named peak-first regularization, which utilizes a frame-wise knowledge distillation function to force the probability distribution of the CTC model to shift left along the time axis instead of directly modifying the calculation process of CTC loss and gradients. All the experiments are conducted on a Chinese Mandarin dataset AISHELL-1. We have verified the effectiveness of the proposed regularization on both streaming and non-streaming CTC models respectively. The results show that the proposed method can reduce the average peak latency by about 100 to 200 milliseconds with almost no degradation of recognition accuracy.

Research paper thumbnail of Covariance Regularization for Probabilistic Linear Discriminant Analysis

Cornell University - arXiv, Dec 6, 2022

Probabilistic linear discriminant analysis (PLDA) is commonly used in speaker verification system... more Probabilistic linear discriminant analysis (PLDA) is commonly used in speaker verification systems to score the similarity of speaker embeddings. Recent studies improved the performance of PLDA in domain-matched conditions by diagonalizing its covariance. We suspect such brutal pruning approach could eliminate its capacity in modeling dimension correlation of speaker embeddings, leading to inadequate performance with domain adaptation. This paper explores two alternative covariance regularization approaches, namely, interpolated PLDA and sparse PLDA, to tackle the problem. The interpolated PLDA incorporates the prior knowledge from cosine scoring to interpolate the covariance of PLDA. The sparse PLDA introduces a sparsity penalty to update the covariance. Experimental results demonstrate that both approaches outperform diagonal regularization noticeably with domain adaptation. In addition, in-domain data can be significantly reduced when training sparse PLDA for domain adaptation.

Research paper thumbnail of Label-free Knowledge Distillation with Contrastive Loss for Light-weight Speaker Recognition

Cornell University - arXiv, Dec 6, 2022

Very deep models for speaker recognition (SR) have demonstrated remarkable performance improvemen... more Very deep models for speaker recognition (SR) have demonstrated remarkable performance improvement in recent research. However, it is impractical to deploy these models for ondevice applications with constrained computational resources. On the other hand, lightweight models are highly desired in practice despite their sub-optimal performance. This research aims to improve lightweight SR models through large-scale label-free knowledge distillation (KD). Existing KD approaches for SR typically require speaker labels to learn task-specific knowledge, due to the inefficiency of conventional loss for distillation. To address the inefficiency problem and achieve labelfree KD, we propose to employ the contrastive loss from selfsupervised learning for distillation. Extensive experiments are conducted on a collection of public speech datasets from diverse sources. Results on lightweight SR models show that the proposed approach of label-free KD with contrastive loss consistently outperforms both conventional distillation methods and self-supervised learning methods by a significant margin.

Research paper thumbnail of Peak-First CTC: Reducing the Peak Latency of CTC Models by Applying Peak-First Regularization

Cornell University - arXiv, Nov 6, 2022

The CTC model has been widely applied to many application scenarios because of its simple structu... more The CTC model has been widely applied to many application scenarios because of its simple structure, excellent performance, and fast inference speed. There are many peaks in the probability distribution predicted by the CTC models, and each peak represents a non-blank token. The recognition latency of CTC models can be reduced by encouraging the model to predict peaks earlier. Existing methods to reduce latency require modifying the transition relationship between tokens in the forward-backward algorithm, and the gradient calculation. Some of these methods even depend on the forced alignment results provided by other pretrained models. The above methods are complex to implement. To reduce the peak latency, we propose a simple and novel method named peak-first regularization, which utilizes a frame-wise knowledge distillation function to force the probability distribution of the CTC model to shift left along the time axis instead of directly modifying the calculation process of CTC loss and gradients. All the experiments are conducted on a Chinese Mandarin dataset AISHELL-1. We have verified the effectiveness of the proposed regularization on both streaming and non-streaming CTC models respectively. The results show that the proposed method can reduce the average peak latency by about 100 to 200 milliseconds with almost no degradation of recognition accuracy.

Research paper thumbnail of An Empirical Study of Language Model Integration for Transducer based Speech Recognition

Interspeech 2022

Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-... more Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a loworder density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.