Investigating linguistic knowledge in a maximum entropy token-based language model (original) (raw)

A Maximum Entropy Approach for Semantic Language Modeling

2006

The conventional n-gram language model exploits only the immediate context of historical words without exploring long-distance semantic information. In this paper, we present a new information source extracted from latent semantic analysis (LSA) and adopt the maximum entropy (ME) principle to integrate it into an n-gram language model. With the ME approach, each information source serves as a set of constraints, which should be satisfied to estimate a hybrid statistical language model with maximum randomness. For comparative study, we also carry out knowledge integration via linear interpolation (LI). In the experiments on the TDT2 Chinese corpus, we find that the ME language model that combines the features of trigram and semantic information achieves a 17.9% perplexity reduction compared to the conventional trigram language model, and it outperforms the LI language model. Furthermore, in evaluation on a Mandarin speech recognition task, the ME and LI language models reduce the character error rate by 16.9% and 8.5%, respectively, over the bigram language model.

Compact maximum entropy language models

Proceedings of the IEEE workshop on automatic …

In language modeling we are always confronted with a sparse data problem. The Maximum Entropy formalism allows to fully integrate complementary statistical properties of limited corpora. The focus of the present paper is twofold. The new smoothing technique of LM-induced marginals is introduced and discussed. We then highlight the advantages resulting from a combination of robust features and show that the brute-force inclusion of too many constraints may deteriorate the performance due to overtraining effects. Very good LMs may be trained on the basis of pair correlations which are supplemented by heavily pruned ¢-grams. This is especially true if word and class based features are combined. Tests were carried out for the German Verbmobil task and on WSJ data. The test-set perplexities were reduced by 3-7% and the number of free parameters was reduced by 60-75%. At the same time overtraining effects are considerably reduced.

Non-deterministic stochastic language models for speech recognition

1995 International Conference on Acoustics, Speech, and Signal Processing

Stochastic language models for speech recognition have traditionally been designed and evaluated in order to optimize word accuracy. In this work, we present a novel framework for training stochastic language models by optimizing two different criteria appropriate for speech recognition and language understanding. First, the language entropy and salience measure are used for learning the relevant spoken language features (phrases). Secondly, a novel algorithm for training stochastic finite state machines is presented which incorporates the acquired phrase structure into a single stochastic language model. Thirdly, we show the benefit of our novel framework with an end-toend evaluation of a large vocabulary spoken language system for call routing.

Statistical language modeling based on variable-length sequences

Computer Speech & Language, 2003

In natural language and especially in spontaneous speech, people often group words in order to constitute phrases which become usual expressions. This is due to phonological (to make the pronunciation easier), or to semantic reasons (to remember more easily a phrase by assigning a meaning to a block of words). Classical language models do not adequately take into account such phrases. A better approach consists in modeling some word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the vocabulary, on which language models are computed.

The hidden vector state language model

2005

The Hidden Vector State (HVS) model extends the basic Hidden Markov Model (HMM) by encoding each state as a vector of stack states but with restricted stack operations. The model uses a right branching stack automaton to assign valid stochastic parses to a word sequence from which the language model probability can be estimated. The model is completely data driven and is able to model classes from the data that reflect the hierarchical structures found in natural language. This paper describes the design and the implementation of the HVS language model [1], focusing on the practical issues of initialisation and training using Baum-Welch re-estimation whilst accommodating a large and dynamic state space. Results of experiments conducted using the ATIS corpus [2] show that the HVS language model reduces test set perplexity compared to standard class based language models.

Stochastic language models for speech recognition and understanding

1998

Stochastic language models for speech recognition have traditionally been designed and evaluated in order to optimize word accuracy. In this work, we present a novel framework for training stochastic language models by optimizing two different criteria appropriate for speech recognition and language understanding. First, the language entropy and salience measure are used for learning the relevant spoken language features (phrases). Secondly, a novel algorithm for training stochastic finite state machines is presented which incorporates the acquired phrase structure into a single stochastic language model. Thirdly, we show the benefit of our novel framework with an end-toend evaluation of a large vocabulary spoken language system for call routing.

Effectively Building Tera Scale MaxEnt Language Models Incorporating Non-Linguistic Signals

Interspeech 2017

Maximum Entropy (MaxEnt) language models are powerful models that can incorporate linguistic and non-linguistic contextual signals in a unified framework with a convex loss. MaxEnt models also have the advantage of scaling to large model and training data sizes We present the following two contributions to MaxEnt training: (1) By leveraging smaller amounts of transcribed data, we demonstrate that a MaxEnt LM trained on various types of corpora can be easily adapted to better match the test distribution of Automatic Speech Recognition (ASR); (2) A novel adaptive-training approach that efficiently models multiple types of non-linguistic features in a universal model. We evaluate the impact of these approaches on Google's state-of-the-art ASR for the task of voice-search transcription and dictation. Training 10B parameter models utilizing a corpus of up to 1T words, we show large reductions in word error rate from adaptation across multiple languages. Also, human evaluations show significant improvements on a wide range of domains from using non-linguistic features. For example, adapting to geographical domains (e.g., US States and cities) affects about 4% of test utterances, with 2:1 win to loss ratio.

Maximum-likelihood training of the PLCG-based language model

IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01., 2001

In [1] a parsing language model based on a probabilistic left-corner grammar (PLCG) was proposed and encouraging performance on a speech recognition task using the PLCG-based language model was reported. In this paper we show how the PLCG-based language model can be further optimized by iterative parameter reestimation on unannotated training data. The precalculation of forward, inner and outer probabilities of states in the PLCG network provides an elegant crosscut to the computation of transition frequency expectations, which are needed in each iteration of the proposed reestimation procedure. The training algorithm enables model training on very large corpora. In our experiments, test set perplexity is close to saturation after three iterations, 5 to 16% lower than initially. We however observed no significant improvement of recognition accuracy after reestimation.

Language models: where are the bottlenecks?

AISB Quarterly, 1994

Statistical, parsing, database, and other methods of bringing contextual information to bear on the recognition task are described in a uniform framework in which the central data structure mediating between the recognition and the contextual components is a segment lattice, a directed graph that contains the alternative segments and their confidence/probability ranking. Explicit measures of the value of such segment lattices and the correctness of language models are proposed, and the dominant technologies are critically evaluated.