Interactive Analysis and Visualisation of Annotated Collocations in Spanish (AVAnCES) (original) (raw)

ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts

2022

Social Media platforms have offered invaluable opportunities for linguistic research. The availability of up-to-date data, coming from any part in the world, and coming from natural contexts, has allowed researchers to study language in real time. One of the fields that has made great use of social media platforms is Corpus Linguistics. There is currently a wide range of projects which have been able to successfully create corpora from social media. In this paper, we present the development and deployment of a linguistic corpus from Twitter posts in English, coming from 26 news agencies and 27 individuals. The main goal was to create a fully annotated English corpus for linguistic analysis. We include information on morphology and syntax, as well as NLP features such as tokenization, lemmas, and ngrams. The information is presented through a range of powerful visualisations for users to explore linguistic patterns in the corpus. With this tool, we aim to contribute to the area of language technologies applied to linguistic research.

A Spanish E-dictionary of Collocations

Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019), 2019

We present a new e-dictionary of Spanish (in progress) called Diretes (DIccionario RETicular de ESpañol). It contains descriptions of collocations by means of Lexical Functions (LFs), both standard and non-standard, in the sense of the Meaning-Text Theory by Igor Mel'čuk. At present, Diretes contains about 50,000 collocations. This paper concentrates on the collocations in which the collocate is an adjectival or an adverbial phrase. These collocations are mostly extracted from the Práctico combinatorial dictionary of modern Spanish. We explain the structure of the e-dictionary, the types of information it contains and the way it is presented. We also show how the LF-interpreted collocations can be used in NLP applications. We demonstrate it with the SemETAP semantic analyzer, in which LFs are used to normalize semantic structures and make inferences.

A linguistic approach for determining the topics of Spanish Twitter messages

Journal of Information Science, 2014

The vast number of opinions and reviews provided in Twitter is helpful in order to make interesting findings about a given industry, but given the huge number of messages published every day, it is important to detect the relevant ones. In this respect, the Twitter search functionality is not a practical tool when we want to poll messages dealing with a given set of general topics. This article presents an approach to classify Twitter messages into various topics. We tackle the problem from a linguistic angle, taking into account part-of-speech, syntactic and semantic information, showing how language processing techniques should be adapted to deal with the informal language present in Twitter messages. The TASS 2013 General corpus, a collection of tweets that has been specifically annotated to perform text analytics tasks, is used as the dataset in our evaluation framework. We carry out a wide range of experiments to determine which kinds of linguistic information have the greatest...

Corpus-based Methodology for an Online Multilingual Collocations Dictionary: First Steps

2021

This paper describes the first steps of a corpus-based methodology for the development of an online Platform for Multilingual Collocations Dictionaries (PLATCOL). The platform is aimed to be customized for different target audiences according to their needs. It covers various syntactic structures of collocations that fit into the following taxonomy: verbal, adjectival, nominal, and adverbial. Part of its design, layout and methodological procedures are based on the Bilingual Online Collocations Dictionary Platform (Orenha-Ottaiano, 2017). The methodology also relies on the combination of automatic methods to extract candidate collocations (Garcia et al., 2019a) with careful post-editing performed by lexicographers. The automatic approaches take advantage of NLP tools to annotate large corpora with lemmas, PoS-tags and dependency relations in five languages (English, French, Portuguese, Spanish and Chinese). Using these data, we apply statistical measures (Evert et al., 2017; Garcia ...

Learning about Spanish dialects through Twitter

This paper maps the large-scale variation of the Spanish language by employing a corpus based on geographically tagged Twitter messages. Lexical dialects are extracted from an analysis of variants of tens of concepts. The resulting maps show linguistic variation on an unprecedented scale across the globe. We discuss the properties of the main dialects within a machine learning approach and find that varieties spoken in urban areas have an international character in contrast to country areas where dialects show a more regional uniformity. Resumen: En este trabajo, cartografiamos la variación a gran escala del idioma español usando un corpus basado en mensajes geolocalizados de Twitter. Se extraen las formas dialectales léxicas a partir del análisis de decenas de variantes. Los mapas resultantes muestran una variación lingüística en todo el planeta con una escala que no tiene precedentes. Examinamos las propiedades de los principales dialécticos empleando técnicas de aprendizaje automático y hallamos que las variedades habladas en áreas urbanas poseen un carácter internacional, a diferencia de las zonas rurales, donde los dialectos presentan una uniformidad más regional.

Methodological Approach to the Design of Digital Discourse Corpora in Spanish. Proposal of the CÓDICE Project

Procedia - Social and Behavioral Sciences, 2015

Having analyzed the current situation of Spanish corpora-and the scarce representativeness of digital communication in themand corpora from different types of interactions on digital platforms (e-mail, chats, SMSs), we noticed the need to create a repository of stable language samples aiming to solve this deficiency. Before implementing the CODICE, an open and collaborative repository of language samples from the digital discourse in Spanish, it becomes necessary to deal with the specific problems of compiling and transcribing this type of data. The present work addresses this approach aiming towards two goals: 1) establishing common standards, mainly concerning contextual and situational factors, in order to facilitate sociopragmatic analysis, and 2) developing ethical standards to ensure the anonymization of participants.

Visual analytics: A novel Approach in corpus linguistics and the Nuevo Diccionario Histórico del Español

Abstract The aim of this article is to introduce visual analysis in corpus linguistics. This is a novel approach that is based on the integration of automated processes and the unique humans' abilities, within a common effort to gain insight into complex problems that have to deal with vast amounts of data. Concretely, we intend to advance in the application of information visualization techniques to the diachronic linguistics field. The proposal of novel, highly interactive, visual solutions is approached by means of the Computational Information Design methodology, an integral process that brings together fields such as information visualization, computational linguistics, data mining and graphic design, for the realization of tools that truly support knowledge discovery and other general tasks of linguists. We discuss here the choices made for the design and development of interactive visual tools triggered by the creation of the New Spanish Historical Dictionary (Nuevo Diccionario Histórico del Español, NDHE).

La estrategia de comunicación de RSC de Iberdrola en Facebook y Twitter: Un análisis lingüístico y de contenido basado en corpus

2021

In our increasingly digitalized civilization where sharing information and interacting about it is ever more made available to the general public, research into corporate communication practices of Corporate Social Responsibility (CSR) content on social media is desperately needed. This study makes a contribution to fill this research gap by performing a corpus-based content and multimodal linguistic analysis of the way in which an IBEX 35 energy company, Iberdrola, communicates about its CSR policy on social media. The corpus consists of 438 posts on Twitter and 126 posts on Facebook. The results allow to draw the following conclusions: 1) Iberdrola’s information strategy varies from social media channel, where on Twitter the focus is through the environment, sustainability, social investment, stakeholders and arts, and on Facebook the focus is directed at stakeholders, thus motivating them to engage with 1 Helena Moyaert: Máster en Comunicación Profesional Multilingüe (junio 2020)...

27th Conference of the Spanish Society for Natural Language Processing

2011

Preface In the Iberian Peninsula, five official languages co-exist: Basque, Catalan, Galician, Portuguese and Spanish. Fostering multi-linguality and establishing strong links among the linguistic resources developed for each language of the region is essential. Additionally, a lack of published resources in some of these languages exists. Such lack propitiates a strong interrelation between them and higher resourced languages, such as English and Spanish. In order to favour the intra-relation among the peninsular languages as well as the interrelation between them and foreign languages, different purpose multilingual NLP tools need to be developed. Interesting topics to be researched include, among others, analysis of parallel and comparable corpora, development of multilingual resources, and language analysis in bilingual environments and within dialectal variations. With the aim of solving these tasks, statistical, linguistic and hybrid approaches are proposed. Therefore, the wor...

esTenTen, a Vast Web Corpus of Peninsular and American Spanish

Procedia Social and Behavioral Sciences, 2013

Everyone working on general language would like their corpus to be bigger, wider-coverage, cleaner, duplicate-free, and with richer metadata. As a response to that wish, Lexical Computing Ltd. has a programme to develop very large web corpora. In this paper we introduce the Spanish corpus, esTenTen, of 8 billion words and 19 different national varieties of Spanish. We investigate the distance between the national varieties as represented in the corpus, and examine in detail the keywords of Peninsular Spanish vs. American Spanish, finding a wide range of linguistic, cultural and political contrasts.