Lenka Bajčetić - Academia.edu (original) (raw)
Papers by Lenka Bajčetić
HAL (Le Centre pour la Communication Scientifique Directe), Jun 27, 2023
We report on work in progress dealing with the automated generation of pronunciation information ... more We report on work in progress dealing with the automated generation of pronunciation information for English multiword terms (MWTs) in Wiktionary, combining information available for their single components. We describe the issues we were encountering, the building of an evaluation dataset, and our teaming with the DBnary resource maintainer. Our approach shows potential for automatically adding morphosyntactic and semantic information to the components of such MWTs.
HAL (Le Centre pour la Communication Scientifique Directe), Jun 27, 2023
CERN European Organization for Nuclear Research - Zenodo, Jul 19, 2022
This paper describes an approach aiming at utilizing Wiktionary data for creating specialized lex... more This paper describes an approach aiming at utilizing Wiktionary data for creating specialized lexical datasets which can be used for enriching other lexical (semantic) resources or for generating datasets that can be used for evaluating or improving NLP tasks, like Word Sense Disambiguation, Word-in-Context challenges, or Sense Linking across lexicons and dictionaries. We have focused on Wiktionary data about pronunciation information in English, and grammatical number and grammatical gender in German.
CERN European Organization for Nuclear Research - Zenodo, May 23, 2022
This paper describes the current status of the EDIE-Elexis DIctionary Evaluation tool, which is a... more This paper describes the current status of the EDIE-Elexis DIctionary Evaluation tool, which is aiming at evaluating the availability and usability of linked lexical resources and dictionaries which are accessible using the Elexis infrastructure.
CERN European Organization for Nuclear Research - Zenodo, Jun 29, 2022
This paper presents Edie: ELEXIS Dictionary Evaluator. Edie is designed to create profiles for le... more This paper presents Edie: ELEXIS Dictionary Evaluator. Edie is designed to create profiles for lexicographic resources accessible through the ELEXIS platform. These profiles can be used to evaluate and compare lexicographic resources, and in particular they can be used to identify potential data that could be linked.
Digital Humanities Workshop
This paper aims to present the digitization process of a very important piece of Serbian intangib... more This paper aims to present the digitization process of a very important piece of Serbian intangible cultural heritage, Српске народне пословице и друге различне као оне у обичај узете ријечи (Engl. Serbian folk proverbs and other common expressions and phrases), compiled by Vuk Stefanović Karadžić during the first half of the 19th century. In the paper, we discuss the necessary steps in the digitization process, the challenges we had to deal with as well as the solutions we came up with. The goal of this process is to have a fully digitized, user-friendly version of Serbian folk proverbs, that will also easily integrate and be compatible with other digitized resources and/or multi-dictionary portals.
CERN European Organization for Nuclear Research - Zenodo, Jul 5, 2021
In this paper we present ongoing work which aims to semi-automatically connect pronunciation info... more In this paper we present ongoing work which aims to semi-automatically connect pronunciation information to lexical semantic resources which currently lack such information, with a focus on WordNet. This is particularly relevant for the cases of heteronyms-homographs that have different meanings associated with different pronunciations-as this is a factor that implies a redesign and adaptation of the formal representation of the targeted lexical semantic resources: in the case of heteronyms it is not enough to just add a slot for pronunciation information to each WordNet entry. Also, there are numerous tools and resources which rely on WordNet, so we hope that enriching WordNet with valuable pronunciation information can prove beneficial for many applications in the future. Our work consists of compiling a small gold standard dataset of heteronymous words, which contains short documents created for each WordNet sense, in total 136 senses matched with their pronunciation from Wiktionary. For the task of matching WordNet senses with their corresponding Wiktionary entries, we train several supervised classifiers which rely on various similarity metrics, and we explore whether these metrics can serve as useful features as well as the quality of the different classifiers tested on our dataset. Finally, we explain in what way these results could be stored in OntoLex-Lemon and integrated to the Open English WordNet.
We present the current implementation state of our work consisting in interlinking language data ... more We present the current implementation state of our work consisting in interlinking language data and linguistic information included in different types of Slovenian language resources. The types of resources we currently deal with are a lexical database (which also contains collocations and example sentences), a morphological lexicon, and the Slovene WordNet. We first transform the encoding of the original data into the OntoLex-Lemon model and map the different descriptors used in the original sources onto the LexInfo vocabulary. This harmonization step is enabling the interlinking of the various types of information included in the different resources, by using relations defined in OntoLex-Lemon. As a result, we obtain a partial merging of the information that was originally distributed over different resources, which is leading to a cross-enrichment of those original data sources. A final goal of the presented work is to publish the linked and merged Slovene linguistic datasets in...
This paper describes our system for monolingual sense alignment across dictionaries. The task of ... more This paper describes our system for monolingual sense alignment across dictionaries. The task of monolingual word sense alignment is presented as a task of predicting the relationship between two senses. We will present two solutions, one based on supervised machine learning, and the other based on pre-trained neural network language model, specifically BERT. Our models perform competitively for binary classification, reporting high scores for almost all languages.
We describe ongoing work consisting in adding pronunciation information to wordnets, as such info... more We describe ongoing work consisting in adding pronunciation information to wordnets, as such information can indicate specific senses of a word. Many wordnets associate with their senses only a lemma form and a part-of-speech tag. At the same time, we are aware that additional linguistic information can be useful for identifying a specific sense of a wordnet lemma when encountered in a corpus. While work already deals with the addition of grammatical number or grammatical gender information to wordnet lemmas, we are investigating the linking of wordnet lemmas to pronunciation information, adding thus a speech-related modality to wordnets.
This paper presents a model of contextual awareness implemented for a social communicative robot ... more This paper presents a model of contextual awareness implemented for a social communicative robot Leolani. Our model starts from the assumption that robots and humans need to establish a common ground about the world they share. This is not trivial as robots make many errors and start with little knowledge. As such, the context in which communication takes place can both help and complicate the interaction: if the context is interpreted correctly it helps in disambiguating the signals, but if it is interpreted wrongly it may distort interpretation. We defined the surrounding world as a spatial context, the communication as a discourse context and the interaction as a social context, which are all three interconnected and have an impact on each other. We model the result of the interpretations as symbolic knowledge (RDF) in a triple store to reason over the result, detect conflicts, uncertainty and gaps. We explain how our model tries to combine the contexts and the signal interpretat...
People and robots make mistakes and should therefore recognize and communicate about their “imper... more People and robots make mistakes and should therefore recognize and communicate about their “imperfectness” when they collaborate. In previous work [3, 2], we described a female robot model Leolani(L) that supports open-domain learning through natural language communication, having a drive to learn new information and build social relationships. The absorbed knowledge consists of everything people tell her and the situations and objects she perceives. For this demo, we focus on the symbolic representation of the resulting knowledge. We describe how L can query and reason over her knowledge and experiences as well as access the Semantic Web. As such, we envision L to become a semantic agent which people could naturally interact with.1.
Text, Speech, and Dialogue, 2018
Our state of mind is based on experiences and what other people tell us. This may result in confl... more Our state of mind is based on experiences and what other people tell us. This may result in conflicting information, uncertainty, and alternative facts. We present a robot that models relativity of knowledge and perception within social interaction following principles of the theory of mind. We utilized vision and speech capabilities on a Pepper robot to build an interaction model that stores the interpretations of perceptions and conversations in combination with provenance on its sources. The robot learns directly from what people tell it, possibly in relation to its perception. We demonstrate how the robot's communication is driven by hunger to acquire more knowledge from and on people and objects, to resolve uncertainties and conflicts, and to share awareness of the perceived environment. Likewise, the robot can make reference to the world and its knowledge about the world and the encounters with people that yielded this knowledge.
CERN European Organization for Nuclear Research - Zenodo, Jul 5, 2021
Sense linking is the task of inferring any potential relationships between senses stored in two d... more Sense linking is the task of inferring any potential relationships between senses stored in two dictionaries. This is a challenging task and in this paper we present our system that combines Natural Language Processing (NLP) and non-textual approaches to solve this task. We formalise linking as inferring links between pairs of senses as exact equivalents, partial equivalents (broader/narrower) or a looser relation or no relation between the two senses. This formulates the problem as a five-class classification for each pair of senses between the two dictionary entries. The work is limited to the case where the dictionaries are in the same language and thus we are only matching senses whose headword matches exactly; we call this task Monolingual Word Sense Alignment (MWSA). We have built tools for this task into an existing framework called Naisc and we describe the architecture of this system as part of the ELEXIS infrastructure, which covers all parts of the lexicographic process including dictionary drafting. Next, we look at methods of linking that rely on the text of the definitions to link, firstly looking at some basic methodologies and then implementing methods that use deep learning models such as BERT. We then look at methods that can exploit non-textual information about the senses in a meaningful way. Afterwards, we describe the challenge of inferring links holistically, taking into account that the links inferred by direct comparison of the definitions may lead to logical contradictions, e.g., multiple senses being equivalent to a single target sense. Finally, we document the creation of a test set for this MWSA task that covers 17 dictionary pairs in 15 languages and some results for our systems on this benchmark. The combination of these tools provides a highly flexible implementation that can link senses between a wide variety of input dictionaries and we demonstrate how linking can be done as part of the ELEXIS toolchain.
We describe a model for a robot that learns about the world and her com-panions through natural l... more We describe a model for a robot that learns about the world and her com-panions through natural language communication. The model supports open-domain learning, where the robot has a drive to learn about new con-cepts, new friends, and new properties of friends and concept instances. The robot tries to fill gaps, resolve uncertainties and resolve conflicts. The absorbed knowledge consists of everything people tell her, the situations and objects she perceives and whatever she finds on the web. The results of her interactions and perceptions are kept in an RDF triple store to enable reasoning over her knowledge and experiences. The robot uses a theory of mind to keep track of who said what, when and where. Accumulating knowledge results in complex states to which the robot needs to respond. In this paper, we look into two specific aspects of such complex knowl-edge states: 1) reflecting on the status of the knowledge acquired through a new notion of thoughts and 2) defining the conte...
This paper describes ongoing work aiming at adding pronunciation information to lexical semantic ... more This paper describes ongoing work aiming at adding pronunciation information to lexical semantic resources, with a focus on open wordnets. Our goal is not only to add a new modality to those semantic networks, but also to mark heteronyms listed in them with the pronunciation information associated with their different meanings. This work could contribute in the longer term to the disambiguation of multi-modal resources, which are combining text and speech.
We present an Event Factuality machine learning classification pipeline, which trains and tests o... more We present an Event Factuality machine learning classification pipeline, which trains and tests on the FactBank corpus. We detail the preprocessing and feature extraction steps, and report on our implementation of an XGBoost, and an SVM classifier, with the former scoring just over 78% accuracy.
HAL (Le Centre pour la Communication Scientifique Directe), Jun 27, 2023
We report on work in progress dealing with the automated generation of pronunciation information ... more We report on work in progress dealing with the automated generation of pronunciation information for English multiword terms (MWTs) in Wiktionary, combining information available for their single components. We describe the issues we were encountering, the building of an evaluation dataset, and our teaming with the DBnary resource maintainer. Our approach shows potential for automatically adding morphosyntactic and semantic information to the components of such MWTs.
HAL (Le Centre pour la Communication Scientifique Directe), Jun 27, 2023
CERN European Organization for Nuclear Research - Zenodo, Jul 19, 2022
This paper describes an approach aiming at utilizing Wiktionary data for creating specialized lex... more This paper describes an approach aiming at utilizing Wiktionary data for creating specialized lexical datasets which can be used for enriching other lexical (semantic) resources or for generating datasets that can be used for evaluating or improving NLP tasks, like Word Sense Disambiguation, Word-in-Context challenges, or Sense Linking across lexicons and dictionaries. We have focused on Wiktionary data about pronunciation information in English, and grammatical number and grammatical gender in German.
CERN European Organization for Nuclear Research - Zenodo, May 23, 2022
This paper describes the current status of the EDIE-Elexis DIctionary Evaluation tool, which is a... more This paper describes the current status of the EDIE-Elexis DIctionary Evaluation tool, which is aiming at evaluating the availability and usability of linked lexical resources and dictionaries which are accessible using the Elexis infrastructure.
CERN European Organization for Nuclear Research - Zenodo, Jun 29, 2022
This paper presents Edie: ELEXIS Dictionary Evaluator. Edie is designed to create profiles for le... more This paper presents Edie: ELEXIS Dictionary Evaluator. Edie is designed to create profiles for lexicographic resources accessible through the ELEXIS platform. These profiles can be used to evaluate and compare lexicographic resources, and in particular they can be used to identify potential data that could be linked.
Digital Humanities Workshop
This paper aims to present the digitization process of a very important piece of Serbian intangib... more This paper aims to present the digitization process of a very important piece of Serbian intangible cultural heritage, Српске народне пословице и друге различне као оне у обичај узете ријечи (Engl. Serbian folk proverbs and other common expressions and phrases), compiled by Vuk Stefanović Karadžić during the first half of the 19th century. In the paper, we discuss the necessary steps in the digitization process, the challenges we had to deal with as well as the solutions we came up with. The goal of this process is to have a fully digitized, user-friendly version of Serbian folk proverbs, that will also easily integrate and be compatible with other digitized resources and/or multi-dictionary portals.
CERN European Organization for Nuclear Research - Zenodo, Jul 5, 2021
In this paper we present ongoing work which aims to semi-automatically connect pronunciation info... more In this paper we present ongoing work which aims to semi-automatically connect pronunciation information to lexical semantic resources which currently lack such information, with a focus on WordNet. This is particularly relevant for the cases of heteronyms-homographs that have different meanings associated with different pronunciations-as this is a factor that implies a redesign and adaptation of the formal representation of the targeted lexical semantic resources: in the case of heteronyms it is not enough to just add a slot for pronunciation information to each WordNet entry. Also, there are numerous tools and resources which rely on WordNet, so we hope that enriching WordNet with valuable pronunciation information can prove beneficial for many applications in the future. Our work consists of compiling a small gold standard dataset of heteronymous words, which contains short documents created for each WordNet sense, in total 136 senses matched with their pronunciation from Wiktionary. For the task of matching WordNet senses with their corresponding Wiktionary entries, we train several supervised classifiers which rely on various similarity metrics, and we explore whether these metrics can serve as useful features as well as the quality of the different classifiers tested on our dataset. Finally, we explain in what way these results could be stored in OntoLex-Lemon and integrated to the Open English WordNet.
We present the current implementation state of our work consisting in interlinking language data ... more We present the current implementation state of our work consisting in interlinking language data and linguistic information included in different types of Slovenian language resources. The types of resources we currently deal with are a lexical database (which also contains collocations and example sentences), a morphological lexicon, and the Slovene WordNet. We first transform the encoding of the original data into the OntoLex-Lemon model and map the different descriptors used in the original sources onto the LexInfo vocabulary. This harmonization step is enabling the interlinking of the various types of information included in the different resources, by using relations defined in OntoLex-Lemon. As a result, we obtain a partial merging of the information that was originally distributed over different resources, which is leading to a cross-enrichment of those original data sources. A final goal of the presented work is to publish the linked and merged Slovene linguistic datasets in...
This paper describes our system for monolingual sense alignment across dictionaries. The task of ... more This paper describes our system for monolingual sense alignment across dictionaries. The task of monolingual word sense alignment is presented as a task of predicting the relationship between two senses. We will present two solutions, one based on supervised machine learning, and the other based on pre-trained neural network language model, specifically BERT. Our models perform competitively for binary classification, reporting high scores for almost all languages.
We describe ongoing work consisting in adding pronunciation information to wordnets, as such info... more We describe ongoing work consisting in adding pronunciation information to wordnets, as such information can indicate specific senses of a word. Many wordnets associate with their senses only a lemma form and a part-of-speech tag. At the same time, we are aware that additional linguistic information can be useful for identifying a specific sense of a wordnet lemma when encountered in a corpus. While work already deals with the addition of grammatical number or grammatical gender information to wordnet lemmas, we are investigating the linking of wordnet lemmas to pronunciation information, adding thus a speech-related modality to wordnets.
This paper presents a model of contextual awareness implemented for a social communicative robot ... more This paper presents a model of contextual awareness implemented for a social communicative robot Leolani. Our model starts from the assumption that robots and humans need to establish a common ground about the world they share. This is not trivial as robots make many errors and start with little knowledge. As such, the context in which communication takes place can both help and complicate the interaction: if the context is interpreted correctly it helps in disambiguating the signals, but if it is interpreted wrongly it may distort interpretation. We defined the surrounding world as a spatial context, the communication as a discourse context and the interaction as a social context, which are all three interconnected and have an impact on each other. We model the result of the interpretations as symbolic knowledge (RDF) in a triple store to reason over the result, detect conflicts, uncertainty and gaps. We explain how our model tries to combine the contexts and the signal interpretat...
People and robots make mistakes and should therefore recognize and communicate about their “imper... more People and robots make mistakes and should therefore recognize and communicate about their “imperfectness” when they collaborate. In previous work [3, 2], we described a female robot model Leolani(L) that supports open-domain learning through natural language communication, having a drive to learn new information and build social relationships. The absorbed knowledge consists of everything people tell her and the situations and objects she perceives. For this demo, we focus on the symbolic representation of the resulting knowledge. We describe how L can query and reason over her knowledge and experiences as well as access the Semantic Web. As such, we envision L to become a semantic agent which people could naturally interact with.1.
Text, Speech, and Dialogue, 2018
Our state of mind is based on experiences and what other people tell us. This may result in confl... more Our state of mind is based on experiences and what other people tell us. This may result in conflicting information, uncertainty, and alternative facts. We present a robot that models relativity of knowledge and perception within social interaction following principles of the theory of mind. We utilized vision and speech capabilities on a Pepper robot to build an interaction model that stores the interpretations of perceptions and conversations in combination with provenance on its sources. The robot learns directly from what people tell it, possibly in relation to its perception. We demonstrate how the robot's communication is driven by hunger to acquire more knowledge from and on people and objects, to resolve uncertainties and conflicts, and to share awareness of the perceived environment. Likewise, the robot can make reference to the world and its knowledge about the world and the encounters with people that yielded this knowledge.
CERN European Organization for Nuclear Research - Zenodo, Jul 5, 2021
Sense linking is the task of inferring any potential relationships between senses stored in two d... more Sense linking is the task of inferring any potential relationships between senses stored in two dictionaries. This is a challenging task and in this paper we present our system that combines Natural Language Processing (NLP) and non-textual approaches to solve this task. We formalise linking as inferring links between pairs of senses as exact equivalents, partial equivalents (broader/narrower) or a looser relation or no relation between the two senses. This formulates the problem as a five-class classification for each pair of senses between the two dictionary entries. The work is limited to the case where the dictionaries are in the same language and thus we are only matching senses whose headword matches exactly; we call this task Monolingual Word Sense Alignment (MWSA). We have built tools for this task into an existing framework called Naisc and we describe the architecture of this system as part of the ELEXIS infrastructure, which covers all parts of the lexicographic process including dictionary drafting. Next, we look at methods of linking that rely on the text of the definitions to link, firstly looking at some basic methodologies and then implementing methods that use deep learning models such as BERT. We then look at methods that can exploit non-textual information about the senses in a meaningful way. Afterwards, we describe the challenge of inferring links holistically, taking into account that the links inferred by direct comparison of the definitions may lead to logical contradictions, e.g., multiple senses being equivalent to a single target sense. Finally, we document the creation of a test set for this MWSA task that covers 17 dictionary pairs in 15 languages and some results for our systems on this benchmark. The combination of these tools provides a highly flexible implementation that can link senses between a wide variety of input dictionaries and we demonstrate how linking can be done as part of the ELEXIS toolchain.
We describe a model for a robot that learns about the world and her com-panions through natural l... more We describe a model for a robot that learns about the world and her com-panions through natural language communication. The model supports open-domain learning, where the robot has a drive to learn about new con-cepts, new friends, and new properties of friends and concept instances. The robot tries to fill gaps, resolve uncertainties and resolve conflicts. The absorbed knowledge consists of everything people tell her, the situations and objects she perceives and whatever she finds on the web. The results of her interactions and perceptions are kept in an RDF triple store to enable reasoning over her knowledge and experiences. The robot uses a theory of mind to keep track of who said what, when and where. Accumulating knowledge results in complex states to which the robot needs to respond. In this paper, we look into two specific aspects of such complex knowl-edge states: 1) reflecting on the status of the knowledge acquired through a new notion of thoughts and 2) defining the conte...
This paper describes ongoing work aiming at adding pronunciation information to lexical semantic ... more This paper describes ongoing work aiming at adding pronunciation information to lexical semantic resources, with a focus on open wordnets. Our goal is not only to add a new modality to those semantic networks, but also to mark heteronyms listed in them with the pronunciation information associated with their different meanings. This work could contribute in the longer term to the disambiguation of multi-modal resources, which are combining text and speech.
We present an Event Factuality machine learning classification pipeline, which trains and tests o... more We present an Event Factuality machine learning classification pipeline, which trains and tests on the FactBank corpus. We detail the preprocessing and feature extraction steps, and report on our implementation of an XGBoost, and an SVM classifier, with the former scoring just over 78% accuracy.