A Flexible Approach for Text Processing Engineering (original) (raw)

A typed applicative system for a language and text processing engineering

Journal of Innovation in Digital Ecosystems, 2014

In this paper, we present a flexible, modular, consistent, and coherent approach for language and text processing engineering. Each processing chain dedicated to text processing is regarded as a serial or parallel assembly of modules, underlying particular tasks a user wants to apply to a text. Users, according to their needs and perspectives might want to build and validate their own processing chain by assembling a set of modules according to a certain configuration. In this paper, we suggest a theoretical formal system based on the model of the typed applicative grammars and the combinatory logic. This approach allows providing a general framework in which users would be able to build multiple language and text analysis processes according to their own objectives. It will also systematize the verification of the logical consistency of the sequence of modules in the assembly that characterizes a given processing chain. c

Integration of Sequence of Computational Modules Dedicated to Text Analysis: a Combinatory Typed Approach

In informational terms, a module dedicated to process information always has specific inputs and outputs. It describes a particular process constrained by specific rules. A processing chain can be a serial combination and/or a parallel combination of such modules. Thus, in an architecture of language engineering, each processing chain becomes a particular instantiation of all possible paths. A processing chain is built from a choice of modules underlying tasks that an engineer wants to apply to the text. In our paper we will present our theoretical model of logical representation of the processing chains, based on combinatory logic and a formal approach based on categorial grammars and applicative grammar, along with many cases of modules configurations.

GATE-a General Architecture for Text Engineering

1996

Much progress has been made in the provision of reusable data resources for Natural Language Engineering, such as grammars, lexicons, thesauruses. Although a number of projects have addressed the provision of reusable algorithmic resources (or 'tools'), takeup of these resources has been relatively slow. This paper describes GATE, a General Architecture for Text Engineering, which is a freely-available system designed to help alleviate the problem.

The TEA language; Design, Implementation and Justification of a new Generic Text Processing Programming Language

2024

Programming languages drive most if not all of modern problem-solving using computational methods and power. Research into new programming languages and methods is essential to the furthering or improvement of computational problem-solving methods by making the design, implementation, and application of automation to general or particular problem-solving ever easier, more accessible, and more performant. General Programming Languages typically are designed to be purely domain agnostic - meaning they can be applied in any field, for any kind of problem. However, this typically also makes them hard and difficult to apply in problems where non-programmers or even experts with little or no general programming skills are expected to leverage programmatic problem solving, which is why Domain Specific Languages come into play; they are generally more fine-tuned towards improving human productivity and performance than that of the machine, while making solving particular, domain-oriented problems simpler. In this research, we wish to design and then fully implement a new Domain Specific Programming Language called TEA, for generic problem-solving leveraging Text Processing methods. We anticipate that TEA shall open up new methods of solving important old and new problems spanning information security and processing, as well as data and art generation to name but a few domains where we see its potential being exploited. This research shall follow the design science research method, with a focus on producing new knowledge about the design and implementation of a text-processing language, as well as producing useful artifacts for researchers and end-users interested in computational problem-solving leveraging programmatic text processing; such as having an industry-ready implementation of the TEA language usable from any operating system and on any reasonable computer hardware. Further, we anticipate the evaluation of the language using the SOE framework alongside other popular and older text-processing languages such as Sed and Awk. We shall also conduct a validation of the effectiveness of the language with at least 5 practical cases inspired by real-world problems.

Programming Language Engineering---a review of Text Processing Language Design, Implementation and Evaluation Methods

2024

Programming languages drive most if not all of modern problemsolving using computational methods and power. Research into new programming languages and methods is essential to the furthering or improvement of computational problem-solving by making the design, implementation, and application of automation to general or particular problem-solving ever easier, more accessible, and more performant. General-purpose Programming Languages (GPLs) typically are designed to be purely domain agnostic-meaning they can be applied in any field, for any kind of problem. However, this typically also makes them hard and difficult to apply in problems where non-programmers or even experts with little or no GPL programming skills are required to leverage programmatic problem solving capabilities, which is why Domain Specific Languages (DSLs) come into play; they are generally more fine-tuned towards improving human productivity and performance than that of the machine, while making solving particular, domain-oriented problems simpler. In this paper, we review the literature concerning how to design and then fully implement a new DSL, with special focus on a DSL for generic problem-solving leveraging Text Processing methods-essentially, a Text Processing Language (TPL). We consider leveraging the design research paradigm and philosophy as a systematic framework for guiding research into the development of new TPLs. This work presents for the first time, new unifying theory concerning general, but also TPL-specific language engineering theory and guiding frameworks-UPLT, PLEF & PLE. We consider quantitative but also qualitative evaluation of programming languages. We also reintroduce the SOE framework for this purpose. Finally, we set the pace for future theoretical and practical research into the field of programming language engineering especially with focus on TPLs.

A formal specification of document processing

Mathematical and Computer Modelling, 1997

we propose a computational model of structured documents and their processing based on preferential attribute grammar schemes and grammar coordinations. Our grammar-based model can be viewed as a specification of composable structure transformations.

Toward a New Language Engineering

Twenty-Fourth International …, 2011

In informational terms, a module dedicated to process information always has specific inputs and outputs. It describes a particular process constrained by specific rules. A processing chain can be a serial combination or a parallel combination of such modules. ...

Toward an engineering discipline for grammarware

2005

Abstract Grammarware comprises grammars and all grammar-dependent software. The term grammar is meant here in the sense of all established grammar formalisms and grammar notations including context-free grammars, class dictionaries, and XML schemas as well as some forms of tree and graph grammars. The term grammar-dependent software refers to all software that involves grammar knowledge in an essential manner.

The design of the structure of the software system for processing text document corpus

Business Informatics

One of the most difficult tasks in the field of data mining is the development of universal tools for the analysis of texts written in the literary and business styles. A popular path in the development of algorithms for processing text document corpus is the use of machine learning methods that allow one to solve NLP (natural language processing) tasks. The basis for research in the field of natural language DATA ANALYSIS AND INTELLIGENCE SYSTEMS

Current issues in software engineering for natural language processing

2003

In Natural Language Processing (NLP), research results from software engineering and software technology have often been neglected. This paper describes some factors that add complexity to the task of engineering reusable NLP systems (beyond conventional software systems). Current work in the area of design patterns and composition languages is described and claimed relevant for natural language processing. The benefits of NLP componentware and barriers to reuse are outlined, and the dichotomies "system versus experiment" and "toolkit versus framework" are discussed. It is argued that in order to live up to its name language engineering must not neglect component quality and architectural evaluation when reporting new NLP research.