A progressive feature selection algorithm for ultra large feature spaces (original) (raw)

Exploring features for identifying edited regions in disfluent sentences

Proceedings of the Ninth International Workshop on Parsing Technology - Parsing '05, 2005

This paper describes our effort on the task of edited region identification for parsing disfluent sentences in the Switchboard corpus. We focus our attention on exploring feature spaces and selecting good features and start with analyzing the distributions of the edited regions and their components in the targeted corpus. We explore new feature spaces of a partof-speech (POS) hierarchy and relaxed for rough copy in the experiments. These steps result in an improvement of 43.98% percent relative error reduction in F-score over an earlier best result in edited detection when punctuation is included in both training and testing data [Charniak and Johnson 2001], and 20.44% percent relative error reduction in F-score over the latest best result where punctuation is excluded from the training and testing data [Johnson and Charniak 2004].

A Maximum Entropy Approach to Natural Language Processing

Computational Linguistics, 1996

The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper, we describe a method for statistical modeling based on maximum entropy. We present a maximum-likelihood approach for automatically constructing maximum entropy models and describe how to implement this approach efficiently, using as examples several problems in natural language processing.

Maximum entropy discrimination (MED) feature subset selection for speech recognition

2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), 2003

In this paper we investigate the application of Maximum Entropy Discrimination (MED) feature selection in speech recognition problems. We compare the MED algorithm with a classical wrapper feature selection algorithm and we propose an hybrid wrapper/MED algorithm. We experiment the three approaches on a phoneme recognition task on the TIMIT database. Results show that the MED algorithm achieves error rates comparable with the wrapper algorithm requiring a reduced computational charge. Furthmore the use of a probabilistic framework shows that the MED algorithm holds very good results even with very limited amount of data.

Processing highly variant language using incremental model selection

2012

This dissertation demonstrates a framework for incremental model selection and processing of highly variant speech transcripts and user-generated text. The system reduces natural language processing (NLP) ambiguity by segmenting text by domain, allowing for domain-specific downstream processes to analyze each segment independently. A tokenized text input stream is received by the system. At every word, an Indicator Function calculates a quantitative feature signal we call an Indicator Value Signal, that runs in parallel to the input stream. This feature signal is monitored for domain changes by an event controller, which segments the stream into feature chunks. The event controller can activate slowly over large spans of text, or rapidly and intrasententially. As the event controller indicates each domain change with an event signal, pipeline processes assigned to specific indicator function values are executed to process the segment, and add additional feature signals to the feature signal stack. At the end of the pipeline, feature signals are unified to produce a single annotated output stream. To exemplify the framework, this dissertation makes three additional contributions. The first is a novel short-string language identification system that calculates our Indicator Value Signal. The second is a machine transliteration system to convert the Arabizi chat alphabet into Arabic script. The third is a modular part of speech tagger for multilingual code-mixing. The short-string language identification system extracts an n-gram, and selects the closest language out of 373 reference languages by using a Support Vector Machine (SVM) classifier trained on a matrix of language model measurements. This classifier learns patterns of similarity and divergence of a language's tokens across all reference languages, leading to high accuracy on in-domain n-grams from a legal corpus as well as out-of-domain tokens from an English-Egyptian Arabic code-mixing microblog corpus. The machine transliteration system converts Arabizi, a Latinized Arabic chat alphabet into Arabic script, in order to utilize existing NLP tools on Arabic chat text. A parallel, word-aligned corpus of the chat alphabet was collected from a dozen Arabic speakers. From the corpus we induced a probabilistic mapping of cross-dialect Arabizi characters to Arabic script and used this to train a highly accurate transducer. The multilingual part of speech tagger demonstrates the modularity of our framework. We find that segmenting language before tagging, and then applying single-language homogeneous language models, is competitive to multilingual heterogeneous tagging models. We compare the two approaches on a speech transcript of English-Spanish code-mixing. In addition to language identification, we consider a range of alternative indicator functions, such as genre identification, entropy, and gender identification, which could add a language adaptation ability on top of existing NLP systems and provide a boost in accuracy and performance on variational processing. To summarize, this dissertation provides an architecture for NLP that allows for better handling of complicated language variation. To demonstrate the model, we introduce a short-string language identification system with state of the art accuracy, the first research on machine transliteration for a chat alphabet, and a modular part of speech tagger for multilingual code-mixing.

Feature selection for a rich HPSG grammar using decision trees

proceeding of the 6th conference on Natural language learning - COLING-02, 2002

This paper examines feature selection for log linear models over rich constraint-based grammar (HPSG) representations by building decision trees over features in corresponding probabilistic context free grammars (PCFGs). We show that single decision trees do not make optimal use of the available information; constructed ensembles of decision trees based on different feature subspaces show significant performance gains (14% parse selection error reduction). We compare the performance of the learned PCFG grammars and log linear models over the same features.

A comparative study of parameter estimation methods for statistical natural language processing

2007

This paper presents a comparative study of five parameter estimation algorithms on four NLP tasks. Three of the five algorithms are well-known in the computational linguistics community: Maximum Entropy (ME) estimation with L 2 regularization, the Averaged Perceptron (AP), and Boosting. We also investigate ME estimation with L 1 regularization using a novel optimization algorithm, and BLasso, which is a version of Boosting with Lasso (L 1) regularization. We first investigate all of our estimators on two re-ranking tasks: a parse selection task and a language model (LM) adaptation task. Then we apply the best of these estimators to two additional tasks involving conditional sequence models: a Conditional Markov Model (CMM) for part of speech tagging and a Conditional Random Field (CRF) for Chinese word segmentation. Our experiments show that across tasks, three of the estimators-ME estimation with L 1 or L 2 regularization, and APare in a near statistical tie for first place.

Feature selection techniques for maximum entropy based biomedical named entity recognition

Journal of biomedical informatics, 2009

Named entity recognition is an extremely important and fundamental task of biomedical text mining. Biomedical named entities include mentions of proteins, genes, DNA, RNA, etc which often have complex structures, but it is challenging to identify and classify such entities. Machine learning methods like CRF, MEMM and SVM have been widely used for learning to recognize such entities from an annotated corpus. The identification of appropriate feature templates and the selection of the important feature values play a very important role in the success of these methods. In this paper, we provide a study on word clustering and selection based feature reduction approaches for named entity recognition using a maximum entropy classifier. The identification and selection of features are largely done automatically without using domain knowledge. The performance of the system is found to be superior to existing systems which do not use domain knowledge.

Minimum Bayes error feature selection for continuous speech recognition

Advances in neural information processing …, 2001

We consider the problem of designing a linear transformation ¾ Á Ê Ô¢Ò , of rank Ô Ò, which projects the features of a classifier Ü ¾ Á Ê Ò onto Ý Ü ¾ Á Ê Ô such as to achieve minimum Bayes error (or probability of misclassification). Two avenues will be explored: the first is to maximize the -average divergence between the class densities and the second is to minimize the union Bhattacharyya bound in the range of . While both approaches yield similar performance in practice, they outperform standard LDA features and show a 10% relative improvement in the word error rate over state-of-the-art cepstral features on a large vocabulary telephony speech recognition task.

Domain-independent Classification of Automatic Speech Recognition Texts

2017

Call centers receive large amounts of incoming calls. The calls are being regularly processed by the analytical system, which helps people automatically inspect all the data. Such system demands a classification module that can determine the topic of conversation for each call. Due to high costs of manual annotation, the input for this module is the automatically transcribed calls. Hence, the texts (=automatic transcription) used for classification contain ill-transcribed words which can probably influence the classification process. Another important point is that this module also has special requirements: it should be domain-independent and easy to setup. Document classification task always requires an annotated data set for classifier training, but it seems to be too costly to make an annotated training set for each domain manually. In this paper, we propose an approach to automatic speech recognition texts classification that allows the user avoiding full manual annotation and a...

A Simple Introduction to Maximum Entropy Models for Natural Language Processing A Simple Introduction to Maximum Entropy Models for Natural Language Processing

Many problems in natural language processing can be viewed as linguistic classiication problems, in which linguistic contexts are used to predict linguistic classes. Maximum entropy models ooer a clean way t o c o m-bine diverse pieces of contextual evidence in order to estimate the probability of a certain linguistic class occurring with a certain linguistic context. This report demonstrates the use of a particular maximum entropy model on an example problem, and then proves some relevant mathematical facts about the model in a simple and accessible manner. This report also describes an existing procedure called Generalized Iterative Scaling, which estimates the parameters of this particular model. The goal of this report is to provide enough detail to re-implement the maximum entropy models described in Ratnaparkhi, 1996, Reynar and Ratnaparkhi, 1997, Ratnaparkhi, 1997] and also to provide a simple explanation of the maximum entropy formalism.