U-Compare: An Integrated Language Resource Evaluation Platform Including a Comprehensive UIMA Resource Library (original) (raw)

Towards a Web-based Tool to Semi-automatically Compile , Manage and Explore Comparable and Parallel Corpora

2016

This article presents an ongoing project that which aims to design and develop a robust and agile web-based application capable of semi-automatically compiling multilingual comparable and parallel corpora, named iCorpora. Its main purpose is to increase the flexibility and robustness of the compilation, management and exploration of both comparable and parallel corpora. iCorpora intends to fulfil not only translators’ and interpreters’ needs, but also the needs of other professionals and laypeople, either by solving some of the usability problems found in the current compilation tools available on the market or by reducing their limitations and performance issues.

iCompileCorpora: A Web-based Application to Semi-automatically Compile Multilingual Comparable Corpora

This article presents an ongoing project that aims to design and develop a robust and agile web-based application capable of semi-automatically compiling monolingual and multilingual comparable corpora, which we named iCompileCorpora. The dimensions that comprise iCompileCorpora can be represented in a layered model comprising a manual, a semi-automatic and a Cross-Language Information Retrieval (CLIR) layer. This design option will not only permit to increase the flexibility of the compilation process, but also to hierarchically extend the manual layer features to the semi-automatic web-based layer and then to the semi-automatic CLIR layer. The manual layer presents the option of compiling monolingual or multilingual corpora. It will allow the manual upload of documents from a local or remote directory onto the platform. The second layer will permit the exploitation of either monolingual or multilingual corpora mined from the Internet. As nowadays there is an increasing demand for systems that can somehow cross the language boundaries by retrieving information of various languages with just one query, the third layer aims to answer this demand by taking advantage of CLIR techniques to find relevant information written in a language different from the one semi-automatically retrieved by the methodology used in the previous layer.

i-Publisher, i-Librarian and EUDocLib --- linguistic services for the Web

This paper presents three linguistically-aware online services built on top of the multilingual framework prepared for the ICT PSP EU-co-financed project ATLAS (Applied Technology for Language-Aided CMS). The framework intends to use the state-of-the art text processing methods in order to extract information and cluster documents. These basic blocks provide the base for advanced CMS functions such as automatic categorization or text summarization.

The corpus, its users and their needs: A user-oriented evaluation of COMPARA

International Journal of Corpus Linguistics, 2007

COMPARA is a bidirectional parallel corpus of English and Portuguese, currently with 3 million words. The corpus was launched in 2000 and at present it is possibly the largest edited parallel corpus publicly available on the Web, with roughly 6,000 corpus queries per month. This paper summarizes an analysis of six years of corpus use. We begin by looking at user studies for language resources, especially corpora, and then we provide a snapshot of COMPARA’s users and their behaviour based on log analysis. Particular emphasis is given to the language interface preferred by users (Portuguese and English are possible), the choice between the Simple and Complex Search modes, the reasons underlying null-results and behaviour after restricted output. The data has pointed us to cases where COMPARA’s Web interface can be improved, and provided insights about our users and the problems they face, although further studies that distinguish between different kinds of users remain necessary.

U-Compare: A modular NLP workflow construction and evaluation system

IBM Journal of Research and Development, 2011

During the development of natural language processing (NLP) applications, developers are often required to repeatedly perform certain tasks. Among these tasks, workflow comparison and evaluation are two of the most crucial because they help to discover the nature of NLP problems, which is important from both scientific and engineering perspectives. Although these tasks can potentially be automated, developers tend to perform them manually, repeatedly writing similar pieces of code. We developed tools to largely automate these subtasks. Promoting component reuse is another way to further increase NLP development efficiency. Building on the interoperability enhancing Unstructured Information Management Architecture (UIMA) framework, we have collected a large library of interoperable resources, developed several workflow creation utilities, added a customizable comparison and evaluation system, and built visualization utilities. These tools are modularly designed to accommodate various use cases and potential reuse scenarios. By integrating all these features into our U-Compare system, we hope to increase NLP developer efficiency. Simple to use and directly runnable from a web browser, U-Compare has already found uses in a range of applications.

Ninth Workshop on Building and Using Comparable Corpora Workshop Programme

2016

Comparable corpora are the most versatile and valuable resource for multilingual Natural Language Processing. The speaker will argue that comparable corpora can support a wider range of applications than has been demonstrated so far in the state of the art. The talk will present completed and ongoing work conducted by the speaker and colleagues from his research group where comparable corpora are employed for different tasks including but not limited to the identification of cognates and false friends, validation of translation universals, language change and translation of multiword expressions. Corpora have long been the preferred resource for a number of NLP applications and language users. They offer a reliable alternative to dictionaries and lexicographical resources which may offer only limited coverage. In the case of terminology, for instance, new terms are coined on a daily basis and dictionaries or other lexical resources, however up-to-date they are, cannot keep up with t...

An Integrated Digital Tool for Accessing Language Resources

lrec-conf.org

Language resources can be classified under several categories. To be able to query and operate on all (or most of) these categories using a single digital tool would be very helpful for a large number of researchers working on languages. We describe such a tool in this paper. It is ...

Atlas multilingual language processing platform

Resumen: En este trabajo se presenta la plataforma ATLAS -marco multilingüe de procesamiento del lenguaje que integra el conjunto común de herramientas lingüísticas para un grupo de lenguas europeas (con menos recursos: búlgaro, croata, griego, polaco y rumano, junto con inglés y alemán como lenguas de referencia). La más avanzada funcionalidad PNL que ofrece la plataforma permite la anotación de textos multilingües en los niveles inferiores (segmentación, morfosintaxis) y a su vez soporta el procesamiento de más alto nivel como la categorización automática, extracción de información, la traducción automática o de resumen. Métodos de anotación más elaborados como la extracción de la entidad nombrada o lematización unitaria de varias palabras también están disponibles. La anotación multinivel de los textos se rige por las cadenas de procesamiento de lenguaje construidas con el estándar de la industria UIMA. Para demostrar las capacidades del marco, se han construido en la parte superior del mismo tres servicios informados lingüísticamente: "i-Publisher" (plataforma de gestión de contenidos basada en la Web), "i-Librarian" (una biblioteca digital de trabajos científicos) y "EUDocLib" (página para la navegación y la búsqueda a través de documentos de EUR-LEX). Palabras clave: herramientas lingüísticas, recursos lingüísticos, servicios Web, sistema de gestión de contenidos, servicios en línea, UIMA Abstract: This paper presents the ATLAS platform -multilingual language processing framework integrating the common set of linguistic tools for a group of European languages (less-resourced: Bulgarian, Croatian, Greek, Polish and Romanian together with English and German as reference languages). State-of-the-art NLP functionality offered by the platform allows for multilingual annotation of texts on lower levels (segmentation, morphosyntax) which in turn supports higher-level processing such as automated categorization, information extraction, machine translation or summarization. More elaborate annotation methods such as named entity extraction or multiword unit lemmatization are also available. Multilevel annotation of texts is governed by language processing chains constructed with UIMA (Unstructured Information Management Application) industry standard. To demonstrate capabilities of the framework, three linguistically-aware online services have been built on top of it: i-Publisher (Web-based content management platform), i-Librarian (a digital library of scientific works) and EUDocLib (site for browsing and searching through EUR-LEX documents).

LiMoSINe Pipeline: Multilingual UIMA-based NLP Platform

Proceedings of ACL-2016 System Demonstrations, 2016

We present a robust and efficient parallelizable multilingual UIMA-based platform for automatically annotating textual inputs with different layers of linguistic description, ranging from surface level phenomena all the way down to deep discourse-level information. In particular, given an input text, the pipeline extracts: sentences and tokens; entity mentions; syntactic information; opinionated expressions; relations between entity mentions; co-reference chains and wikified entities. The system is available in two versions: a standalone distribution enables design and optimization of userspecific sub-modules, whereas a server-client distribution allows for straightforward highperformance NLP processing, reducing the engineering cost for higher-level tasks.