Modelling Etymology in LMF/TEI: The 'Grande Dicionário Houaiss da Língua Portuguesa' Dictionary as a Use Case (original) (raw)
Related papers
The Grande Dicionário Houaiss da Língua Portuguesa Dictionary as a Use Case
2020
UIDB/00749/2020 UIDP/00749/2020In this article, we will introduce two of the new parts of the new multi-part version of the Lexical Markup Framework(LMF) ISO standard, namely Part 3 of the standard (ISO 24613-3), which deals with etymological and diachronic data, andPart 4 (ISO 24613-4), which consists of a TEI serialisation of all of the prior parts of the model. We will demonstrate the useof both standards by describing the LMF encoding of a small number of examples taken from a sample conversion of thereference Portuguese dictionaryGrande Dicion ́ario Houaiss da L ́ıngua Portuguesa, part of a broader experiment comprisingthe analysis of different, heterogeneously encoded, Portuguese lexical resources. We present the examples in the UnifiedModelling Language (UML) and also in a couple of cases in TEI.publishersversionpublishe
TEI Lex-0: a good fit for the encoding of the Portuguese Academy Dictionary?
2019
In this presentation, we report on the encoding of the Portuguese Academy Dictionary using TEI Lex-0. We demonstrate how we applied this new baseline format for lexical data to mark up 'special entries' in the dictionary: part-of-speech homonyms (<em>antepassadol</em>1, <em>antepassado</em>2, <em>antepassado</em>3), etymological homonyms (<em>cota</em>1, <em>cota</em>2), homographs (<em>lobo</em>1 /ó/, <em>lobo</em>2 /ô/), spelling variants (<em>ouro, oiro</em>), trademarks (<em>donut</em>), entries that have a different meaning in the plural (<em>antepassados</em>), and lexical variants (<em>missanga, miçanga</em>).
Adapting the Unisyn Lexicon to Portuguese: Preliminary Issues in the Development of LUPo
2009
This paper presents some preliminary issues and proposed solutions in the development of an accent-independent pronunciation lexicon for Portuguese, known as the Portuguese Unisyn Lexicon (LUPo). LUPo's objectives are presented within the context of the Portal da Lingua Portuguesa knowledge base. Key considerations are addressed for encoding morphological boundaries, treating orthographic forms, and handling loan words. Here, it is argued that the knowledge-driven paradigm exemplified in the original English Unisyn Lexicon, along with the Portal da Lingua Portuguesa's relational structure and rich lexicographic content present a good foundation for establishing a tightly integrated and well informed system.
Procura-PALavras (P-PAL): A Web-based interface for a new European Portuguese lexical database
Behavior research methods, 2018
In this article, we present Procura-PALavras (P-PAL), a Web-based interface for a new European Portuguese (EP) lexical database. Based on a contemporary printed corpus of over 227 million words, P-PAL provides a broad range of word attributes and statistics, including several measures of word frequency (e.g., raw counts, per-million word frequency, logarithmic Zipf scale), morpho-syntactic information (e.g., parts of speech [PoSs], grammatical gender and number, dominant PoS, and frequency and relative frequency of the dominant PoS), as well as several lexical and sublexical orthographic (e.g., number of letters; consonant-vowel orthographic structure; density and frequency of orthographic neighbors; orthographic Levenshtein distance; orthographic uniqueness point; orthographic syllabification; and trigram, bigram, and letter type and token frequencies), and phonological measures (e.g., pronunciation, number of phonemes, stress, density and frequency of phonological neighbors, trans...
TEI Lex-0 In Action: Improving the Encoding of the Dictionary of the Academia das Ciências de Lisboa
ELEX Proceedings, 2019
This paper describes some experiments made while encoding the first complete dictionary of the Academia das Ciências de Lisboa (DACL) in the context of TEI Lex-0, a community-based interchange format for lexical data aimed at facilitating the interoperability and reusability of lexical resources. Even though the original encoding of the DACL was based on TEI, we decided to switch to TEI Lex-0 because it allowed us to streamline our encoding. Our experiments show that even though TEI Lex-0 is stricter than TEI itself (allowing fewer elements and imposing certain constraints that are not present in plain TEI), it is fully capable of representing the complexities of the entry structure of the DACL. In the paper, we discuss the TEI Lex-0 encoding of the DACL, as well as the conversion methodology and the tools used for the automatic conversion from the original encoding. We are currently focusing on the macrostructural level, more precisely on the types of lexical units and on the written and spoken forms of the lemma, providing a set of modelling principles and representation forms of every type of entry in the DACL. This paper is part of ongoing work and a contribution to the efforts of the DARIAH-ERIC Lexical Resources working group.
This paper outlines the design principles and choices, as well as the ongoing development process of the Common Orthographic Vocabulary of the Portuguese Language (VOC), a large scale electronic lexical database which was adopted by the Community of Portuguese-Speaking Countries' (CPLP) Instituto Internacional da Língua Portuguesa to implement a spelling reform that is currently taking place. Given the different available resources and lexicographic traditions within the CPLP countries, a range of different solutions was adopted for different countries and integrated into a common development framework. Although the publication of lexicographic resources to implement spelling reforms has always been done for Portuguese, VOC represents a paradigm change, switching from idiosyncratic, closed source, paper-format official resources to standardized, open, free, web-accessible and reusable ones. We start by outlining the context that justifies the resource development and its requirements, then focusing on the description of the methodology, workflow and tools used, showing how a collaborative project in a common web-based platform and administration interface make the creation of such a long-sought and ambitious project possible.
Representing Etymology in the LiLa Knowledge Base of Linguistic Resources for Latin
2020
In this paper we describe the process of inclusion of etymological information in a knowledge base of interoperable Latin linguistic resources developed in the context of the LiLa: Linking Latin project. Interoperability is obtained by applying the Linked Open Data principles. Particularly, an extensive collection of Latin lemmas is used to link the (distributed) resources. For the etymology, we rely on the Ontolex-lemon ontology and the lemonEty extension to model the information, while the source data are taken from a recent etymological dictionary of Latin. As a result, the collection of lemmas LiLa is built around now includes 1,465 Proto-Italic and 1,393 Proto-Indo-European reconstructed forms that are used to explain the history of 1,400 Latin words. We discuss the motivation, methodology and modeling strategies of the work, as well as its possible applications and potential future developments.
2020
UIDB/00749/2020 UIDP/00749/2020This paper reports on an ongoing task of monolingual word sense alignment in which a comparative study between the Portuguese Academy of Sciences Dictionary and the Dicionário Aberto is carried out in the context of the ELEXIS (European Lexicographic Infrastructure) project. Word sense alignment involves searching for matching senses within dictionary entries of different lexical resources and linking them, which poses significant challenges. The lexicographic criteria are not always entirely consistent within individual dictionaries and even less so across different projects where different options may have been assumed in terms of structure and especially wording techniques of lexicographic glosses. This hinders the task of matching senses. We aim to present our annotation workflow in Portuguese using the Semantic Web standards. The results obtained are useful for the discussion within the community.publishersversionpublishe