Challenges of Terminology Extraction from Legal Spanish Corpora (original) (raw)

Thesaurus Enhanced Extraction of Hohfeld's Relations from Spanish Labour Law

2021

In this paper we describe the design of an experiment to extract Hohfeld’s deontic relations from legal texts. Our approach intends to minimise the manual effort in the annotation process by expanding a set of initial annotations with the legal domain knowledge contained in thesauri represented in Semantic Web formats. With such annotations, we perform a set of iterations to train a deep learning relation extraction model. After analysing the results, we will adapt the process to work on the extraction of Hohfeld’s potestative relations. We also plan to use that model to recognise relations in unseen legal sub-domains.

Legal Terminology Extraction with the Termolator

Proceedings of the Natural Legal Language Processing Workshop 2021, 2021

Domain-specific terminology is ubiquitous in legal documents. Despite potential utility in populating glossaries and ontologies or as arguments in information extraction and document classification tasks, there has been limited work done for legal terminology extraction. This paper describes some work to remedy this omission. In the described research, we make some modifications to the Termolator, a high-performing, open-source terminology extractor which has been tuned to scientific articles. Our changes are designed to improve the Termolator's results when applied to United States Supreme Court decisions. Unaltered and using the recommended settings, the original Termolator provides a list of terminology with a precision of 23% and 25% for the categories of economic activity (development set) and criminal procedures (test set) respectively. These were the most frequently occurring broad issues in Washington University in St. Louis Database corpus, a database of Supreme Court decisions that have been manually classified by topic. Our contribution includes the introduction of several legal domain-specific filtration steps and changes to the web search relevance score; each incrementally improved precision culminating in a combined precision of 63% and 65%. We also evaluated the baseline version of the Termolator on more specific subcategories and on broad issues with fewer cases. Our results show that a narrowed scope as well as smaller document numbers significantly lower the precision. In both cases, the modifications to the Termolator improve precision.

Spanish and English terminological study based on comparable Corpus of Employment contracts

2016

The evolution of new technologies has transformed the working method of the study of language in the last few years. The use of electronic corpora has facilitated the task of both experts and learners, especially in the speed of compilation and analysis of data. For this reason, this final work is based on a corpus linguistic methodology for the analysis of the specialised language of employment contracts in Spanish and English. The analysis includes a glossaries for the Spanish and English languages of the most common twenty terms in this type of document. Each of these words includes a real example of use and our translation proposal.

Information Extraction from Legal Documents Using Linguistic Knowledge and Ontologies

Information extraction in legal texts is an important part of a broader set of enabling tools to assist users in accessing relevant information. Existing approaches deal with difficulties regarding proper treatment of text aspects. Knowledge acquisition rules, based on the linguistic treatment of specific aspects of legal documents would be useful for improving the results in this task. Additionally, domain knowledge representation can provide an even broader set of possibilities. This paper presents a model for addressing Information Extraction from texts in the legal domain in which both of the aforementioned aspects are considered. It outlines the proposed fundamental components, describes Brazilian law document use cases and discusses the methodology and initial results, as well as future works.

Spanish Legalese Language Model and Corpora

2021

There are many Language Models for the English language according to its worldwide relevance. However, for the Spanish language, even if it is a widely spoken language, there are very few Spanish Language Models which result to be small and too general. Legal slang could be think of a Spanish variant on its own as it is very complicated in vocabulary, semantics and phrase understanding. For this work we gathered legal-domain corpora from different sources, generated a model and evaluated against Spanish general domain tasks. The model provides reasonable results in those tasks.

Automatic acces to legal terminology applying two different automatic term recognition methods.

Automatic term recognition (ATR) methods help to identify the most representative terms in a corpus automatically, saving time and allowing managing large amounts of data that could not be dealt with manually. This paper presents the evaluation of two ATR methods implemented on a 2.6 million-word legal corpus designed and compiled ad hoc: Keywords (Scott, 2008) and Chung's method (2003). Both techniques have been assessed as regards precision and recall. The results clearly show that Keywords is, by far, the most efficient one achieving to recognize 62% true terms out of the 2,000 items evaluated in this study.

Automatic Access to Legal Terminology Applying Two Different Automatic Term Recognition Methods

Procedia -Social and Behavioral Sciences-, 2013

Legal Terminology for Translators: Company Law. A Bilingual Corpus-Driven Project

POLISSEMA – Revista de Letras do ISCAP, 2020

In a world where people, goods, services, companies, and capital move globally, legal translation plays a crucial role because legal documents and rules regulate all these exchanges. Legal translation is acknowledged as a daunting and time-consuming task, due to the culturebound nature of legal terms and the complexity of legal language. Existing terminology resources do not provide translators with enough information to make informed decisions without extensive searches and concept comparison. The aim of this article is to present a corpus-driven bilingual (British English and European Portuguese) terminology project in the domain of company law (company incorporation) with a view to translating company types and incorporation documents, as well as to understanding and deciding on the most suitable strategies to find equivalents in legal translation. As far as methodology is concerned, we follow the typical workflow of bilingual or multilingual terminology projects, according to the literature on the subject as well as terminology standards. We resort to comparable corpora, semi-automatic term extraction tools and concordance tools, as well as terminology management software.

TERMitLEX: a legal terminology knowledge base for translators, interpreters and beyond

2020

The first terminology database of the University of Trieste – TERMit – dates back to 1996 and contains collections belonging to all sorts of specialised domains. This paper illustrates a project conducted by a team of linguists and lawyers at the Department of Legal, Language, Interpreting and Translation Studies. The aim of the project was to take TERMit more than a step forward, by developing a new terminological knowledge base, devoted exclusively to legal terminology. The project consisted in i) revising of TERMit’s template; ii) updating the existing terminological collections; and iii) disseminating of terminological data. The aim of our paper is to give a general overview of the project and to illustrate more in detail how the lawyers’ needs can influence the structure of a terminology database.

Legal-ES: A Set of Large Scale Resources for Spanish Legal Text Processing

2020

Legal-ES is an open source resource kit for legal Spanish. It consists of a large scale Spanish corpus of open legal texts and different kinds of language models including word embeddings and topic models. The corpus includes over 1000 million words covering a collection of legislative and administrative open access documents in Spanish from different sources representing international, national and regional entities. The corpus is pre-processed and tokenized using Spacy. For the word embeddings, gensim was used on the collection of tokens, producing a representation space that is especially suited to reflect the inherent characteristics of the legal domain. We calculate also topic models to obtain a convenient tool to understand the main topics in the corpus and to navigate through the documents exploiting the semantic similarity among documents. We will analyse the time structure of a dynamic topic model to infer changes in the legal production of Spanish jurisdiction that have oc...

Challenges of Terminology Extraction from Legal Spanish Corpora (original) (raw)

Related papers