Fernando Pereira - Academia.edu (original) (raw)
Uploads
Papers by Fernando Pereira
Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics - ACL '05, 2005
A complex relation is any n-ary relation in which some of the arguments may be be unspecified. We... more A complex relation is any n-ary relation in which some of the arguments may be be unspecified. We present here a simple two-stage method for extracting complex relations between named entities in text. The first stage creates a graph from pairs of entities that are likely to be related, and the second stage scores maximal cliques in that graph as potential complex relation instances. We evaluate the new method against a standard baseline for extracting genomic variation relations from biomedical text.
BMC bioinformatics, Jan 7, 2006
The rapid proliferation of biomedical text makes it increasingly difficult for researchers to ide... more The rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. Furthermore, promise has been tempered by an inability to efficiently scale approaches in ways that minimize manual efforts and still perform with high accuracy. Here, we have applied a machine-learning approach previously successful for identifying molecular entities to a disease concept to determine if the underlying probabilistic model effectively generalizes to unrelated concepts with minimal manual intervention for model retraining. We developed a named entity recognizer (MT...
Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics - ACL '05, 2005
We present an effective training algorithm for linearly-scored dependency parsers that implements... more We present an effective training algorithm for linearly-scored dependency parsers that implements online largemargin multi-class training (Crammer and Singer, 2003; Crammer et al., 2003) on top of efficient parsing techniques for dependency trees (Eisner, 1996). The trained parsers achieve a competitive dependency accuracy for both English and Czech with no language specific enhancements.
PLoS Computational Biology, 2007
Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov mode... more Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns.
BMC Bioinformatics, 2005
Background We present a model for tagging gene and protein mentions from text using the probabili... more Background We present a model for tagging gene and protein mentions from text using the probabilistic sequence tagging framework of conditional random fields (CRFs). Conditional random fields model the probability P(t|o) of a tag sequence given an observation sequence directly, and have previously been employed successfully for other tagging tasks. The mechanics of CRFs and their relationship to maximum entropy are discussed in detail. Results We employ a diverse feature set containing standard orthographic features combined with expert features in the form of gene and biological term lexicons to achieve a precision of 86.4% and recall of 78.7%. An analysis of the contribution of the various features of the model is provided.
Bioinformatics, 2004
Summary: VTag is an application for identifying the type, genomic location and genomic state-chan... more Summary: VTag is an application for identifying the type, genomic location and genomic state-change of acquired genomic aberrations described in text. The application uses a machine learning technique called conditional random fields. VTag was tested with 345 training and 200 evaluation documents pertaining to cancer genetics. Our experiments resulted in 0.8541 precision, 0.7870 recall and 0.8192 F-measure on the evaluation set. Availability: The software is available at http://www.cis.upenn.edu/group/datamining/software_dist/biosfier/.
Proceedings of the 1987 workshop on Theoretical issues in natural language processing -, 1987
Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics - ACL '05, 2005
A complex relation is any n-ary relation in which some of the arguments may be be unspecified. We... more A complex relation is any n-ary relation in which some of the arguments may be be unspecified. We present here a simple two-stage method for extracting complex relations between named entities in text. The first stage creates a graph from pairs of entities that are likely to be related, and the second stage scores maximal cliques in that graph as potential complex relation instances. We evaluate the new method against a standard baseline for extracting genomic variation relations from biomedical text.
BMC bioinformatics, Jan 7, 2006
The rapid proliferation of biomedical text makes it increasingly difficult for researchers to ide... more The rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. Furthermore, promise has been tempered by an inability to efficiently scale approaches in ways that minimize manual efforts and still perform with high accuracy. Here, we have applied a machine-learning approach previously successful for identifying molecular entities to a disease concept to determine if the underlying probabilistic model effectively generalizes to unrelated concepts with minimal manual intervention for model retraining. We developed a named entity recognizer (MT...
Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics - ACL '05, 2005
We present an effective training algorithm for linearly-scored dependency parsers that implements... more We present an effective training algorithm for linearly-scored dependency parsers that implements online largemargin multi-class training (Crammer and Singer, 2003; Crammer et al., 2003) on top of efficient parsing techniques for dependency trees (Eisner, 1996). The trained parsers achieve a competitive dependency accuracy for both English and Czech with no language specific enhancements.
PLoS Computational Biology, 2007
Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov mode... more Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns.
BMC Bioinformatics, 2005
Background We present a model for tagging gene and protein mentions from text using the probabili... more Background We present a model for tagging gene and protein mentions from text using the probabilistic sequence tagging framework of conditional random fields (CRFs). Conditional random fields model the probability P(t|o) of a tag sequence given an observation sequence directly, and have previously been employed successfully for other tagging tasks. The mechanics of CRFs and their relationship to maximum entropy are discussed in detail. Results We employ a diverse feature set containing standard orthographic features combined with expert features in the form of gene and biological term lexicons to achieve a precision of 86.4% and recall of 78.7%. An analysis of the contribution of the various features of the model is provided.
Bioinformatics, 2004
Summary: VTag is an application for identifying the type, genomic location and genomic state-chan... more Summary: VTag is an application for identifying the type, genomic location and genomic state-change of acquired genomic aberrations described in text. The application uses a machine learning technique called conditional random fields. VTag was tested with 345 training and 200 evaluation documents pertaining to cancer genetics. Our experiments resulted in 0.8541 precision, 0.7870 recall and 0.8192 F-measure on the evaluation set. Availability: The software is available at http://www.cis.upenn.edu/group/datamining/software_dist/biosfier/.
Proceedings of the 1987 workshop on Theoretical issues in natural language processing -, 1987