Finite-State Tokenization for a Deep Wolof LFG Grammar (original) (raw)

The importance of precise tokenizing for deep grammars

2006

We present a non-deterministic finite-state transducer that acts as a tokenizer and normalizer for free text that is input to a broad-coverage LFG of German. We compare the basic tokenizer used in an earlier version of the grammar and the more sophisticated tokenizer that we now use. The revised tokenizer increases the coverage of the grammar in terms of full parses from 68.3% to 73.4% on sentences 8,001 through 10,000 of the TiGer Corpus. 9 9 9

Low-level mark-up and large-scale LFG grammar processing

2003

It is commonly believed that shallow mark-up techniques such as part-of-speech disambiguation or low-level phrase chunking provide useful information that can improve the performance of natural language processing systems, even those that ultimately require deeper levels of analysis. In this paper, we discuss three types of shallow mark-up: part of speech tagging, named entities, and labeled bracketing. We show how they were integrated into the ParGram LFG English grammar and report on the results of parsing the PARC700 sentences with each type of mark-up. We observed that named-entity mark-up improves both speed and accuracy and labeled brackets also can be beneficial, but that part-of-speech tags are not particularly useful.

Data-Driven Parametric Text Normalization: Rapidly Scaling Finite-State Transduction Verbalizers to New Languages

2020

This paper presents a methodology for rapidly generating FST-based verbalizers for ASR and TTS systems by efficiently sourcing language-specific data. We describe a questionnaire which collects the necessary data to bootstrap the number grammar induction system and parameterize the verbalizer templates described in Ritchie et al. (2019), and a machine-readable data store which allows the data collected through the questionnaire to be supplemented by additional data from other sources. This system allows us to rapidly scale technologies such as ASR and TTS to more languages, including low-resource languages.

Context-free parsing with finite-state transducers

Proceedings of the 3rd South American Workshop on String Processing, 1996

This article is a study of an algorithm designed and implemented by Roche for parsing natural language sentences according to a context-free grammar. This algorithm is based on the construction and use of a finite-state transducer. Roche successfully applied it to a context-free grammar with very numerous rules. In contrast, the complexity of parsing words according to context-free grammars is usually considered in practice as a function of one parameter: the length of the input sequence; the size of the grammar is generally taken to be a constant of a reasonable value. In this article, we first explain why a context-free grammar with a correct lexical and grammatical coverage is bound to have a very large number of rules and we review work related with this problem. Then we exemplify the principle of Roche's algorithm on a small grammar. We provide formal definitions of the construction of the parser and of the operation of the algorithm and we prove that the parser can be built for a large class of context-free grammars, and that it outputs the set of parsing trees of the input sequence.

Pruning the Search Space of the Wolof LFG Grammar Using a Probabilistic and a Constraint Grammar Parser

2014

This paper presents a method for greatly reducing parse times in LFG by integrating a Constraint Grammar parser into a probabilistic context-free grammar. The CG parser is used in the pre-processing phase to reduce morphological and lexical ambiguity. Similarly, the c-structure pruning mechanism of XLE is used in the parsing phase to discard low-probability c-structures, before f-annotations are solved. The experiment results show a considerable increase in parsing efficiency and robustness in the annotation of Wolof running text. The Wolof CG parser indicated an f-score of 90% for morphological disambiguation and a speedup of ca. 40%, while the c-structure pruning method increased the speed of the Wolof grammar by over 36%. On a small amount of data, CG disambiguation and c-structure pruning allowed for a speedup of 58%, however with a substantial drop in parse accuracy of 3.62.

A Morphological Analyzer For Wolof Using Finite-State Techniques

2012

This paper reports on the design and implementation of a morphological analyzer for Wolof. The main motivation for this work is to obtain a linguistically motivated tool using finite-state techniques. The finite-state technology is especially attractive in dealing with human language morphologies. Finite-state transducers (FST) are fast, efficient and can be fully reversible, enabling users to perform analysis as well as generation. Hence, I use this approach to construct a new FST tool for Wolof, as a first step towards a computational grammar for the language in the Lexical Functional Grammar framework. This article focuses on the methods used to model complex morphological issues and on developing strategies to limit ambiguities. It discusses experimental evaluations conducted to assess the performance of the analyzer with respect to various statistical criteria. In particular, I also wanted to create morphosyntactically annotated resources for Wolof, obtained by automatically an...

Efficient Morphological Parsing with a Weighted Finite State Transducer

This article describes a highly optimized algorithm and implementation of a deterministic weighted finite state transducer for morphological analysis. We show how various functionalities can be integrated into one machine, without sacrificing performance or flexibility, and and still maintaining applicability to various languages. The annotation schema used in this implementation maximizes interoperability and compatibility by using a direct mapping of tags from the GOLD ontology of linguistic concepts and features, providing possible extended processing scenarios.

Two parsing algorithms by means of finite state transducers

Proceedings of the 15th conference on Computational linguistics -, 1994

We present a new apl)roach , ilhlstrated by two algo-rithms> for parsing not only Finite SI.ate (:Iranlnlars but also Context Free Grainlnars and their extension, by means of finite state machines. '/'he basis is the computation of a flxed point of a linite-state function, i.e. a finite-state transducer. Using these techniques, we have built a program that parses French sentences with a gramnlar of more than 200>000 lexical rules with a typical response time of less than a second. The tirst algorithm computes a fixed point of a non-deterluinistic tinite-state transducer and the second coniplites a lixed point of a deterministic bidirectiollal device called a bimachine. These two algoril;hms point out a new connection between the theory of parsing and the theory of representation of rational transduetions.

Finite-State Tokenization for a Deep Wolof LFG Grammar (original) (raw)

Related papers