Ana Salgado | Faculdade de Ciências Sociais e Humanas - Universidade Nova de Lisboa (original) (raw)
Papers by Ana Salgado
Langues & Parole
Dans cet article, nous présentons OntoDomLab-Med, une ontologie des marques de domaines des scien... more Dans cet article, nous présentons OntoDomLab-Med, une ontologie des marques de domaines des sciences médicales et de la santé. Nous avons élaboré une taxonomie à partir des marques présentes dans la liste des abréviations du Dicionário da Língua Portuguesa Contemporânea de l’Académie des Sciences de Lisbonne. Notre objectif est de mettre en rapport OntoDomLab-Med et les entréessélectionnées du dictionnaire balisées en TEI Lex-0 – système de balisage plus stricte et plus adapté que TEI au codage des dictionnaires – en ligne avec les principes FAIR. L’ontologie construite avec Protégé et codifiée en OWL permet l’exportation des connaissances dans un format d’échange interopérable permettant que l’ontologie puisse être appliquée à différentes ressources lexicales pour référencer les domaines indépendamment de la langue utilisée.OntoDomLab-Med sera utile non seulement pour rechercher de l’information par domaine, mais permettra au lexicographe d’être plus cohérent dans son travail de ch...
De Gruyter eBooks, Dec 5, 2022
UIDB/00749/2020 UIDP/00749/2020This paper reports on an ongoing task of monolingual word sense al... more UIDB/00749/2020 UIDP/00749/2020This paper reports on an ongoing task of monolingual word sense alignment in which a comparative study between the Portuguese Academy of Sciences Dictionary and the Dicionário Aberto is carried out in the context of the ELEXIS (European Lexicographic Infrastructure) project. Word sense alignment involves searching for matching senses within dictionary entries of different lexical resources and linking them, which poses significant challenges. The lexicographic criteria are not always entirely consistent within individual dictionaries and even less so across different projects where different options may have been assumed in terms of structure and especially wording techniques of lexicographic glosses. This hinders the task of matching senses. We aim to present our annotation workflow in Portuguese using the Semantic Web standards. The results obtained are useful for the discussion within the community.publishersversionpublishe
UID/LIN/03213/2013MORDigital is a newly funded Portuguese lexicographic project that aims to prod... more UID/LIN/03213/2013MORDigital is a newly funded Portuguese lexicographic project that aims to produce high-quality and searchable digital versions of the first three editions (1789; 1813; 1823) of the Diccionario da Lingua Portugueza by António de Morais Silva, preserving and making accessible this important work of European heritage. This paper will describe the current state of the art, the project, its objectives and the methodology proposed, the latter of which is based on a rigorous linguistic analysis and will also include steps necessary for the ontologisation of knowledge contained in and relating to the text. A section will be dedicated to the various investigation domains of the project description. The output of the project will be made available via a dedicated platform.publishersversionpublishe
UIDB/00749/2020 UIDP/00749/2020Terms are a significant part of lexicographical nomenclatures in g... more UIDB/00749/2020 UIDP/00749/2020Terms are a significant part of lexicographical nomenclatures in general language dictionaries. In this paper, we focus on how football terms are treated in three Academy Dictionaries – Portuguese, French, and Spanish – and draw some conclusions about the lexicographical decisions taken in the three languages. After identifying every position football players can have on the field, we verify whether the dictionaries above include these terms. We propose the TEI encoding of the term “defesa” (defence), which designates a position occupied by football players on the field. Bearing in mind concepts such as reusability and interoperability, we intend to present: 1) a comparison of football terms in the three dictionaries; 2) TEI Lex-0 dictionary encoding, a streamlined standard to facilitate interoperability; 3) a consistent TEI modelling and description of the microstructural elements of lexicographical entries. In the end, we draw some conclusions.publis...
UIDB/00749/2020 UIDP/00749/2020In this article, we will introduce two of the new parts of the new... more UIDB/00749/2020 UIDP/00749/2020In this article, we will introduce two of the new parts of the new multi-part version of the Lexical Markup Framework(LMF) ISO standard, namely Part 3 of the standard (ISO 24613-3), which deals with etymological and diachronic data, andPart 4 (ISO 24613-4), which consists of a TEI serialisation of all of the prior parts of the model. We will demonstrate the useof both standards by describing the LMF encoding of a small number of examples taken from a sample conversion of thereference Portuguese dictionaryGrande Dicion ́ario Houaiss da L ́ıngua Portuguesa, part of a broader experiment comprisingthe analysis of different, heterogeneously encoded, Portuguese lexical resources. We present the examples in the UnifiedModelling Language (UML) and also in a couple of cases in TEI.publishersversionpublishe
Over the course of the last few years, lexicography has witnessed the burgeoning of increasingly ... more Over the course of the last few years, lexicography has witnessed the burgeoning of increasingly reliable automatic approaches supporting the creation of lexicographic resources such as dictionaries, lexical knowledge bases and annotated datasets. In fact, recent achievements in the field of Natural Language Processing and particularly in Word Sense Disambiguation have widely demonstrated their effectiveness not only for the creation of lexicographic resources, but also for enabling a deeper analysis of lexical-semantic data both within and across languages. Nevertheless, we argue that the potential derived from the connections between the two fields is far from exhausted. In this work, we address a serious limitation affecting both lexicography and Word Sense Disambiguation, i.e. the lack of high-quality sense-annotated data and describe our efforts aimed at constructing a novel entirely manually annotated parallel dataset in 10 European languages. For the purposes of the present p...
In this article, we will introduce two of the new parts of the new multi-part version of the Lexi... more In this article, we will introduce two of the new parts of the new multi-part version of the Lexical Markup Framework (LMF) ISO standard, namely Part 3 of the standard (ISO 24613-3), which deals with etymological and diachronic data, and Part 4 (ISO 24613-4), which consists of a TEI serialisation of all of the prior parts of the model. We will demonstrate the use of both standards by describing the LMF encoding of a small number of examples taken from a sample conversion of the reference Portuguese dictionary Grande Dicion´ario Houaiss da L´ıngua Portuguesa, part of a broader experiment comprising the analysis of different, heterogeneously encoded, Portuguese lexical resources. We present the examples in the Unified Modelling Language (UML) and also in a couple of cases in TEI.
In this presentation, we focus on how football terms are treated in three Academy Dictionaries – ... more In this presentation, we focus on how football terms are treated in three Academy Dictionaries – Portuguese, French, and Spanish – and will draw some assumptions about the lexicographical decisions that had been taken in the three languages.
This paper describes some experiments made while encoding the first complete dictionary of the Ac... more This paper describes some experiments made while encoding the first complete dictionary of the Academia das Ciências de Lisboa (DACL) in the context of TEI Lex-0, a community-based interchange format for lexical data aimed at facilitating the interoperability and reusability of lexical resources. Even though the original encoding of the DACL was based on TEI, we decided to switch to TEI Lex-0 because it allowed us to streamline our encoding. Our experiments show that even though TEI Lex-0 is stricter than TEI itself (allowing fewer elements and imposing certain constraints that are not present in plain TEI), it is fully capable of representing the complexities of the entry structure of the DACL. In the paper, we discuss the TEI Lex-0 encoding of the DACL, as well as the conversion methodology and the tools used for the automatic conversion from the original encoding. We are currently focusing on the macrostructural level, more precisely on the types of lexical units and on the writt...
The digital era has brought some challenges to lexicographers, but it has also brought new opport... more The digital era has brought some challenges to lexicographers, but it has also brought new opportunities as part of the rise of information technology and, more recently, the emergence of digital humanities. This paper provides a description of LeXmart, the framework that supports the digital development of the Portuguese Academy of Sciences Dictionary. LeXmart is a smart tool framework to support lexicographers' work that offers different types of tools, ranging from a structural editor to a set of validation tools. Given that the dictionary is stored in eXist-DB, LeXmart is developed on top of its ecosystem, using W3C standard languages, and offering default functionalities offered by eXist-DB, namely a RESTful API.
Aligning senses across resources and languages is a challenging task with beneficial applications... more Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.
In this article we describe the workflow implemented to convert a dictionary saved as a PDF file ... more In this article we describe the workflow implemented to convert a dictionary saved as a PDF file into an XML document and posterior importation into an XML aware database, and the process to edit, add and delete new entries. The conversion process was challenging given the format of the PDF file, and the fine grained detail of the XML schema that was used. For that, an iterative filtering approach was used. To store the dictionary we decided to use an XML aware database (eXist-DB), that stores each dictionary entry as a separate resource. It can be queried used a web interface developed using XQuery. The lexicographers can edit entries using the oXygen XML editor, reading and storing them directly in the database. In order to guarantee incremental backups, it was defined a mechanism to import the XML database into a GIT repository. Finally, a couple of programs were created in order to prepare regular reports on the dictionary revision process, as well as to backup it in a GIT repos...
Revista da Associação Portuguesa de Linguística, 2020
This paper presents the Digital Edition of the Vocabularies of the Academy of Sciences project, w... more This paper presents the Digital Edition of the Vocabularies of the Academy of Sciences project, which aims to digitise the spelling vocabularies of the Lisbon Academy of Sciences (ACL) in order to create a digital lexicographic corpus bringing together the printed versions of all these lexicographical reference works – the 1940, 1947, 1970, and finally the 2012 editions. The first stage started with the Vocabulário Ortográfico da Língua Portuguesa [Orthographic Vocabulary of the Portuguese Language] (VOLP-1940), our case study. After digitising this vocabulary, the work described here focuses on the linguistic annotation of VOLP-1940 using eXtensible Markup Language (XML), an annotation metalanguage, and following the annotation directives of the Text Encoding Initiative (TEI), more specifically the application of TEI Lex-0, a new TEI sub-format. We aim to highlight the need for rigorous linguistic data processing in the creation of new lexical resources to increase the quality of t...
Slovenščina 2.0: empirical, applied and interdisciplinary research, 2020
The modelling and encoding of polylexical units, i.e. recurrent sequences of lexemes that are per... more The modelling and encoding of polylexical units, i.e. recurrent sequences of lexemes that are perceived as independent lexical units, is a topic that has not been covered adequately and in sufficient depth by the Guidelines of the Text Encoding Initiative (TEI), a de facto standard for the digital representation of textual resources in the scholarly research community. In this paper, we use the Dictionary of the Portuguese Academy of Sciences as a case study for presenting our ongoing work on encoding polylexical units using TEI Lex-0, an initiative aimed at simplifying and streamlining the encoding of lexical data with TEI in order to improve interoperability. We introduce the notion of macro- and microstructural relevance to differentiate between polylexicals that serve as headwords for their own independent dictionary entries and those which appear inside entries for different headwords. We develop the notion of lexicographic transparency to distinguish between those units which ...
RILEX. Revista sobre investigaciones léxicas, 2019
La actual revolución digital traza nuevos caminos en el ámbito de la producción y elaboración de ... more La actual revolución digital traza nuevos caminos en el ámbito de la producción y elaboración de recursos lexicográficos, concretamente en los diccionarios de lengua general, que se encuentran actualmente adaptados a nuevas necesidades de la sociedad en general y a las de sus usuarios en particular, tanto en la forma que asumen como en el contenido. A la par del léxico general, estas obras registran, describen y definen léxico especializado de diferentes áreas del conocimiento. El número de unidades terminológicas que forman parte de la nomenclatura de estos recursos tiene tendencia a aumentar, dado el auge tecnológico, la evolución de la sociedad y los fenómenos de globalización, una vez que estas unidades constituyen fuentes privilegiadas de renovación y enriquecimiento lexicales de los sistemas lingüísticos. De este modo, las marcas temáticas que etiquetan el léxico especializado en diccionarios monolingües son objeto del estudio del presente trabajo, cuya finalidad es contribuir...
This paper reports on an ongoing task of monolingual word sense alignment in which a comparative ... more This paper reports on an ongoing task of monolingual word sense alignment in which a comparative study between the Portuguese Academy of Sciences Dictionary and the Dicionário Aberto is carried out in the context of the ELEXIS (European Lexicographic Infrastructure) project. Word sense alignment involves searching for matching senses within dictionary entries of different lexical resources and linking them, which poses significant challenges. The lexicographic criteria are not always entirely consistent within individual dictionaries and even less so across different projects where different options may have been assumed in terms of structure and especially wording techniques of lexicographic glosses. This hinders the task of matching senses. We aim to present our annotation workflow in Portuguese using the Semantic Web technologies. The results obtained are useful for the discussion within the community.
Apresentação do projeto Edição Digital dos Vocabulários da Academia das Ciências de Lisboa: o VOL... more Apresentação do projeto Edição Digital dos Vocabulários da Academia das Ciências de Lisboa: o VOLP 1940.
Langues & Parole
Dans cet article, nous présentons OntoDomLab-Med, une ontologie des marques de domaines des scien... more Dans cet article, nous présentons OntoDomLab-Med, une ontologie des marques de domaines des sciences médicales et de la santé. Nous avons élaboré une taxonomie à partir des marques présentes dans la liste des abréviations du Dicionário da Língua Portuguesa Contemporânea de l’Académie des Sciences de Lisbonne. Notre objectif est de mettre en rapport OntoDomLab-Med et les entréessélectionnées du dictionnaire balisées en TEI Lex-0 – système de balisage plus stricte et plus adapté que TEI au codage des dictionnaires – en ligne avec les principes FAIR. L’ontologie construite avec Protégé et codifiée en OWL permet l’exportation des connaissances dans un format d’échange interopérable permettant que l’ontologie puisse être appliquée à différentes ressources lexicales pour référencer les domaines indépendamment de la langue utilisée.OntoDomLab-Med sera utile non seulement pour rechercher de l’information par domaine, mais permettra au lexicographe d’être plus cohérent dans son travail de ch...
De Gruyter eBooks, Dec 5, 2022
UIDB/00749/2020 UIDP/00749/2020This paper reports on an ongoing task of monolingual word sense al... more UIDB/00749/2020 UIDP/00749/2020This paper reports on an ongoing task of monolingual word sense alignment in which a comparative study between the Portuguese Academy of Sciences Dictionary and the Dicionário Aberto is carried out in the context of the ELEXIS (European Lexicographic Infrastructure) project. Word sense alignment involves searching for matching senses within dictionary entries of different lexical resources and linking them, which poses significant challenges. The lexicographic criteria are not always entirely consistent within individual dictionaries and even less so across different projects where different options may have been assumed in terms of structure and especially wording techniques of lexicographic glosses. This hinders the task of matching senses. We aim to present our annotation workflow in Portuguese using the Semantic Web standards. The results obtained are useful for the discussion within the community.publishersversionpublishe
UID/LIN/03213/2013MORDigital is a newly funded Portuguese lexicographic project that aims to prod... more UID/LIN/03213/2013MORDigital is a newly funded Portuguese lexicographic project that aims to produce high-quality and searchable digital versions of the first three editions (1789; 1813; 1823) of the Diccionario da Lingua Portugueza by António de Morais Silva, preserving and making accessible this important work of European heritage. This paper will describe the current state of the art, the project, its objectives and the methodology proposed, the latter of which is based on a rigorous linguistic analysis and will also include steps necessary for the ontologisation of knowledge contained in and relating to the text. A section will be dedicated to the various investigation domains of the project description. The output of the project will be made available via a dedicated platform.publishersversionpublishe
UIDB/00749/2020 UIDP/00749/2020Terms are a significant part of lexicographical nomenclatures in g... more UIDB/00749/2020 UIDP/00749/2020Terms are a significant part of lexicographical nomenclatures in general language dictionaries. In this paper, we focus on how football terms are treated in three Academy Dictionaries – Portuguese, French, and Spanish – and draw some conclusions about the lexicographical decisions taken in the three languages. After identifying every position football players can have on the field, we verify whether the dictionaries above include these terms. We propose the TEI encoding of the term “defesa” (defence), which designates a position occupied by football players on the field. Bearing in mind concepts such as reusability and interoperability, we intend to present: 1) a comparison of football terms in the three dictionaries; 2) TEI Lex-0 dictionary encoding, a streamlined standard to facilitate interoperability; 3) a consistent TEI modelling and description of the microstructural elements of lexicographical entries. In the end, we draw some conclusions.publis...
UIDB/00749/2020 UIDP/00749/2020In this article, we will introduce two of the new parts of the new... more UIDB/00749/2020 UIDP/00749/2020In this article, we will introduce two of the new parts of the new multi-part version of the Lexical Markup Framework(LMF) ISO standard, namely Part 3 of the standard (ISO 24613-3), which deals with etymological and diachronic data, andPart 4 (ISO 24613-4), which consists of a TEI serialisation of all of the prior parts of the model. We will demonstrate the useof both standards by describing the LMF encoding of a small number of examples taken from a sample conversion of thereference Portuguese dictionaryGrande Dicion ́ario Houaiss da L ́ıngua Portuguesa, part of a broader experiment comprisingthe analysis of different, heterogeneously encoded, Portuguese lexical resources. We present the examples in the UnifiedModelling Language (UML) and also in a couple of cases in TEI.publishersversionpublishe
Over the course of the last few years, lexicography has witnessed the burgeoning of increasingly ... more Over the course of the last few years, lexicography has witnessed the burgeoning of increasingly reliable automatic approaches supporting the creation of lexicographic resources such as dictionaries, lexical knowledge bases and annotated datasets. In fact, recent achievements in the field of Natural Language Processing and particularly in Word Sense Disambiguation have widely demonstrated their effectiveness not only for the creation of lexicographic resources, but also for enabling a deeper analysis of lexical-semantic data both within and across languages. Nevertheless, we argue that the potential derived from the connections between the two fields is far from exhausted. In this work, we address a serious limitation affecting both lexicography and Word Sense Disambiguation, i.e. the lack of high-quality sense-annotated data and describe our efforts aimed at constructing a novel entirely manually annotated parallel dataset in 10 European languages. For the purposes of the present p...
In this article, we will introduce two of the new parts of the new multi-part version of the Lexi... more In this article, we will introduce two of the new parts of the new multi-part version of the Lexical Markup Framework (LMF) ISO standard, namely Part 3 of the standard (ISO 24613-3), which deals with etymological and diachronic data, and Part 4 (ISO 24613-4), which consists of a TEI serialisation of all of the prior parts of the model. We will demonstrate the use of both standards by describing the LMF encoding of a small number of examples taken from a sample conversion of the reference Portuguese dictionary Grande Dicion´ario Houaiss da L´ıngua Portuguesa, part of a broader experiment comprising the analysis of different, heterogeneously encoded, Portuguese lexical resources. We present the examples in the Unified Modelling Language (UML) and also in a couple of cases in TEI.
In this presentation, we focus on how football terms are treated in three Academy Dictionaries – ... more In this presentation, we focus on how football terms are treated in three Academy Dictionaries – Portuguese, French, and Spanish – and will draw some assumptions about the lexicographical decisions that had been taken in the three languages.
This paper describes some experiments made while encoding the first complete dictionary of the Ac... more This paper describes some experiments made while encoding the first complete dictionary of the Academia das Ciências de Lisboa (DACL) in the context of TEI Lex-0, a community-based interchange format for lexical data aimed at facilitating the interoperability and reusability of lexical resources. Even though the original encoding of the DACL was based on TEI, we decided to switch to TEI Lex-0 because it allowed us to streamline our encoding. Our experiments show that even though TEI Lex-0 is stricter than TEI itself (allowing fewer elements and imposing certain constraints that are not present in plain TEI), it is fully capable of representing the complexities of the entry structure of the DACL. In the paper, we discuss the TEI Lex-0 encoding of the DACL, as well as the conversion methodology and the tools used for the automatic conversion from the original encoding. We are currently focusing on the macrostructural level, more precisely on the types of lexical units and on the writt...
The digital era has brought some challenges to lexicographers, but it has also brought new opport... more The digital era has brought some challenges to lexicographers, but it has also brought new opportunities as part of the rise of information technology and, more recently, the emergence of digital humanities. This paper provides a description of LeXmart, the framework that supports the digital development of the Portuguese Academy of Sciences Dictionary. LeXmart is a smart tool framework to support lexicographers' work that offers different types of tools, ranging from a structural editor to a set of validation tools. Given that the dictionary is stored in eXist-DB, LeXmart is developed on top of its ecosystem, using W3C standard languages, and offering default functionalities offered by eXist-DB, namely a RESTful API.
Aligning senses across resources and languages is a challenging task with beneficial applications... more Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.
In this article we describe the workflow implemented to convert a dictionary saved as a PDF file ... more In this article we describe the workflow implemented to convert a dictionary saved as a PDF file into an XML document and posterior importation into an XML aware database, and the process to edit, add and delete new entries. The conversion process was challenging given the format of the PDF file, and the fine grained detail of the XML schema that was used. For that, an iterative filtering approach was used. To store the dictionary we decided to use an XML aware database (eXist-DB), that stores each dictionary entry as a separate resource. It can be queried used a web interface developed using XQuery. The lexicographers can edit entries using the oXygen XML editor, reading and storing them directly in the database. In order to guarantee incremental backups, it was defined a mechanism to import the XML database into a GIT repository. Finally, a couple of programs were created in order to prepare regular reports on the dictionary revision process, as well as to backup it in a GIT repos...
Revista da Associação Portuguesa de Linguística, 2020
This paper presents the Digital Edition of the Vocabularies of the Academy of Sciences project, w... more This paper presents the Digital Edition of the Vocabularies of the Academy of Sciences project, which aims to digitise the spelling vocabularies of the Lisbon Academy of Sciences (ACL) in order to create a digital lexicographic corpus bringing together the printed versions of all these lexicographical reference works – the 1940, 1947, 1970, and finally the 2012 editions. The first stage started with the Vocabulário Ortográfico da Língua Portuguesa [Orthographic Vocabulary of the Portuguese Language] (VOLP-1940), our case study. After digitising this vocabulary, the work described here focuses on the linguistic annotation of VOLP-1940 using eXtensible Markup Language (XML), an annotation metalanguage, and following the annotation directives of the Text Encoding Initiative (TEI), more specifically the application of TEI Lex-0, a new TEI sub-format. We aim to highlight the need for rigorous linguistic data processing in the creation of new lexical resources to increase the quality of t...
Slovenščina 2.0: empirical, applied and interdisciplinary research, 2020
The modelling and encoding of polylexical units, i.e. recurrent sequences of lexemes that are per... more The modelling and encoding of polylexical units, i.e. recurrent sequences of lexemes that are perceived as independent lexical units, is a topic that has not been covered adequately and in sufficient depth by the Guidelines of the Text Encoding Initiative (TEI), a de facto standard for the digital representation of textual resources in the scholarly research community. In this paper, we use the Dictionary of the Portuguese Academy of Sciences as a case study for presenting our ongoing work on encoding polylexical units using TEI Lex-0, an initiative aimed at simplifying and streamlining the encoding of lexical data with TEI in order to improve interoperability. We introduce the notion of macro- and microstructural relevance to differentiate between polylexicals that serve as headwords for their own independent dictionary entries and those which appear inside entries for different headwords. We develop the notion of lexicographic transparency to distinguish between those units which ...
RILEX. Revista sobre investigaciones léxicas, 2019
La actual revolución digital traza nuevos caminos en el ámbito de la producción y elaboración de ... more La actual revolución digital traza nuevos caminos en el ámbito de la producción y elaboración de recursos lexicográficos, concretamente en los diccionarios de lengua general, que se encuentran actualmente adaptados a nuevas necesidades de la sociedad en general y a las de sus usuarios en particular, tanto en la forma que asumen como en el contenido. A la par del léxico general, estas obras registran, describen y definen léxico especializado de diferentes áreas del conocimiento. El número de unidades terminológicas que forman parte de la nomenclatura de estos recursos tiene tendencia a aumentar, dado el auge tecnológico, la evolución de la sociedad y los fenómenos de globalización, una vez que estas unidades constituyen fuentes privilegiadas de renovación y enriquecimiento lexicales de los sistemas lingüísticos. De este modo, las marcas temáticas que etiquetan el léxico especializado en diccionarios monolingües son objeto del estudio del presente trabajo, cuya finalidad es contribuir...
This paper reports on an ongoing task of monolingual word sense alignment in which a comparative ... more This paper reports on an ongoing task of monolingual word sense alignment in which a comparative study between the Portuguese Academy of Sciences Dictionary and the Dicionário Aberto is carried out in the context of the ELEXIS (European Lexicographic Infrastructure) project. Word sense alignment involves searching for matching senses within dictionary entries of different lexical resources and linking them, which poses significant challenges. The lexicographic criteria are not always entirely consistent within individual dictionaries and even less so across different projects where different options may have been assumed in terms of structure and especially wording techniques of lexicographic glosses. This hinders the task of matching senses. We aim to present our annotation workflow in Portuguese using the Semantic Web technologies. The results obtained are useful for the discussion within the community.
Apresentação do projeto Edição Digital dos Vocabulários da Academia das Ciências de Lisboa: o VOL... more Apresentação do projeto Edição Digital dos Vocabulários da Academia das Ciências de Lisboa: o VOLP 1940.
Travel Grant Reports Call I, 2018
Since I am the person in charge of the coordination of the new Portuguese Academy Dictionary, spe... more Since I am the person in charge of the coordination of the new Portuguese Academy Dictionary, specifically as regards planning the macrostructure and microstructure of this dictionary, it is crucial that I learn how other structures are devised, such as the Diccionario de la lengua española (DLE) published by an analogous academy – Real Academia Española.
This research is aimed at defining guidelines for the inclusion and description of terms in general language dictionaries in order to help lexicographers in this specific task. Combining lexicographical and terminological methods is definitely a plus for the planning of a dictionary’s macrostructure and microstructure, improving the organization and description of lexicographical articles as a whole.
During the Lexical Data Master Class (DARIAH, Berlin, 2018), I worked on the normalization of a p... more During the Lexical Data Master Class (DARIAH, Berlin, 2018), I worked on the normalization of a pre-TEI export from the database within which the Academy of Sciences Portuguese Dictionary was encoded. A lot of procedural aspects were identified and documented to proceed with this project. The output comes with a full-fledged TEI header and a precise typology of dictionary entries. One of the specific working areas was the representation of collocations which are now seen as real entries within the main ones. Another issue was to record the forms that have been impacted by the Portuguese spelling reform.
IX CONGRESO INTERNACIONAL DE LEXICOGRAFÍA HISPÁNICA: LEXICOGRAFÍA DEL ESPAÑOL. INTERNACIONALIZACIÓN E INTERCOMUNICACIÓN, 2021
El objetivo de esta comunicación es analizar la lista de dominios en el Diccionario de la lengua ... more El objetivo de esta comunicación es analizar la lista de dominios en el Diccionario de la lengua española (DLE) para repensar los supuestos teóricos y metodológicos de la tradición lexicográfica en torno al etiquetado de dominios. Después de la descripción y análisis de las marcas que identifican el léxico especializado y su cotejo en los diccionarios académicos ibéricos, compararemos la lista de marcas del DLE con otros sistemas de clasificación como EUROVOC , Tesauro de la UNESCO y WordNet Domains Hierarchy .
A partir de los diferentes tipos de marcas utilizadas en (meta)lexicografía – y ante la emergencia de cuestiones vinculadas a las humanidades digitales, como la interoperabilidad de recursos lexicográficos, – nuestra investigación recaerá sobre las marcas técnicas o temáticas del DLE. Así, nos referiremos a 74 marcas entre las que aparecerán marcas genéricas (“matemáticas”) jerarquizadas (“geometría” o “estadística”); marcas genéricas sin estructura jerárquica (“geología”); incluso marcas circunscritas a la tradición lexicográfica (“alquimia” o “heráldica”), pero constataremos la ausencia de marcas actuales (“turismo”).
Nuestro estudio de caso se detendrá en el dominio de la geología y su tratamiento lexicográfico en las entradas pertenecientes a esta área de conocimiento. Este dominio, una vez organizado conceptualmente, servirá como punto de partida para proponer una metodología que combinará los procedimientos lexicográficos y los terminológicos y que se revelará como una herramienta útil para la planificación de macro y microestructuras, es decir, para la organización y descripción de artículos lexicográficos.
[La Universidad de La Laguna (España) celebra este congreso, que se celebraba del del 17 al 19 de junio de 2020 ha sido aplazado: http://eventos.ull.es/37251/detail/ix-congreso-internacional-de-lexicografia-hispanica_-lexicografia-del-espanol_-internacionalizacion.html]
TEI Conference, What is text, really? TEI and beyond , 2019
In this paper, we report on the encoding of the Portuguese Academy Dictionary using TEI Lex0. We ... more In this paper, we report on the encoding of the Portuguese Academy Dictionary using TEI Lex0. We demonstrate how we applied this new baseline format for lexical data to mark up ‘special entries’ in the dictionary: part-of-speech homonyms (capital1, capital2, capital3), etymological homonyms (cota1, cota2), homographs (lobo1 /ó/, lobo2 /ô/), spelling variants (ouro, oiro), trademarks (donut), entries that have a different meaning in the plural (antepassados), and lexical variants (missanga, miçanga). Even though TEI Lex-0 reduces the number of TEI elements that can be used to describe entry-like objects from five (, , , and ) to only one (), our work shows that TEI Lex0 is fully capable of representing the complexities of the entry structure of the Portuguese
Academy Dictionary. Furthermore, we argue that this simplified array of elements can lead to more coherent and more legible encoding without sacrificing its semantic expressivity. In addition to justifying our concrete encoding choices, we will describe the process of converting our data from TEI to TEI Lex-0 and the documentation of the differences between our original TEI encoding and the TEI Lex-0 version. As of this writing, TEI Lex-01 is still a work in progress. This paper is therefore intended as both a contribution to and a commentary on the efforts of the TEI Lex-0 group.
ELEX Proceedings, 2019
The digital era has brought some challenges to lexicographers, but it has also brought new opport... more The digital era has brought some challenges to lexicographers, but it has also brought new opportunities as part of the rise of Information Technology and more recently, the emergence of Digital Humanities. This paper provides a description of LeXmart, the framework that supports the digital development of the Portuguese Academy of Sciences Dictionary (DACL). LeXmart is a smart tool framework to support lexicographers’ work that offers different types of tools, ranging from a structural editor to a set of validation tools.
Given that the dictionary is stored in eXist-DB, LeXmart is developed on top of its ecosystem, using W3C standard languages, and offering default functionalities offered by eXist-DB, namely a RESTful API.
ELEXIS Observer Event, 2019
The production processes of lexicographic work are changing to adapt to the digital era. To respo... more The production processes of lexicographic work are changing to adapt to the digital era. To respond to the needs (users, interoperability purposes, data structure, consistency), standards have the advantage to facilitate interoperability.
The Space of Languages, 2016
A presente comunicação pretende dar conta do avanço do trabalho lexicográfico em torno da atualiz... more A presente comunicação pretende dar conta do avanço do trabalho lexicográfico em torno da atualização do novo Dicionário da Academia das Ciências de Lisboa, que pressupõe um planeamento metodológico rigoroso e o estabelecimento de alguns procedimentos de trabalho.
A short presentation to Ilex (Real Academia Española)
Memórias da Academia das Ciências de Lisboa, 2018
O Instituto de Lexicologia e Lexicografia da Língua Portuguesa (IILLP) encontra-se a preparar um ... more O Instituto de Lexicologia e Lexicografia da Língua Portuguesa (IILLP) encontra-se a preparar um novo dicionário. Como o objetivo de disponibilizar online o Dicionário da Língua Portuguesa Contemporânea (2001) e atualizar o seu conteúdo, a condição sine qua non estabelecida foi o uso e desenvolvimento de ferramentas computacionais. Para o efeito, foi constituída uma equipa que conta com uma coordenação académica e o apoio da Universidade do Minho, nomeadamente da equipa de Processamento de Linguagem Natural (PLN) do Departamento de Informática, por intermédio de um protocolo de colaboração assinado entre ambas instituições.
Tendo a proposta de elaboração de um novo dicionário surgido no final do ano de 2015, o projeto foi sofrendo algumas modificações em relação ao plano inicialmente delineado, sobretudo devido à falta de financiamento. A presente comunicação visa apresentar os resultados obtidos até ao momento: a conversão do PDF da edição impressa para XML; a criação de uma base de dados; o desenvolvimento de ferramentas computacionais e funcionalidades que agilizam o trabalho de edição lexicográfica, a garantia da validação dos dados e seu controlo, bem como soluções adotadas para a gestão do próprio projeto.