Agnieszka Mykowiecka | Institute of Computer Science, Polish Academy of Sciences (original) (raw)
Papers by Agnieszka Mykowiecka
ArXiv, 2021
In the paper, we test two different approaches to the unsupervised word sense disambiguation task... more In the paper, we test two different approaches to the unsupervised word sense disambiguation task for Polish. In both methods, we use neural language models to predict words similar to those being disambiguated and, on the basis of these words, we predict the partition of word senses in different ways. In the first method, we cluster selected similar words, while in the second, we cluster vectors representing their subsets. The evaluation was carried out on texts annotated with plWordNet senses and provided a relatively good result (F1=0.68 for all ambiguous words). The results are significantly better than those obtained for the neural model-based unsupervised method proposed in (Wawer and Mykowiecka, 2017) and are at the level of the supervised method presented there. The proposed method may be a way of solving word sense disambiguation problem for languages which lack sense annotated data.
In the paper, we address the problem of recognition of non-domain phrases in terminology lists ob... more In the paper, we address the problem of recognition of non-domain phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms and discourse function expressions. We tested several methods based on domain corpora comparison and a method based on contexts of phrases identified in a large corpus of general language. We compared the results of the methods to manual annotation. The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method with the precision about 0.75 at the half of the tested list was the context based method using a modified contextual diversity coefficient.
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), 2019
The paper addresses experiments to expand ad hoc ambiguous abbreviations in medical notes on the ... more The paper addresses experiments to expand ad hoc ambiguous abbreviations in medical notes on the basis of morphologically annotated texts, without using additional domain resources. We work on Polish data but the described approaches can be used for other languages too. We test two methods to select candidates for word abbreviation expansions. The first one automatically selects all words in text which might be an expansion of an abbreviation according to the language rules. The second method uses clustering of abbreviation occurrences to select representative elements which are manually annotated to determine lists of potential expansions. We then train a classifier to assign expansions to abbreviations based on three training sets: automatically obtained, consisting of manual annotation, and concatenation of the two previous ones. The results obtained for the manually annotated training data significantly outperform automatically obtained training data. Adding the automatically obtained training data to the manually annotated data improves the results, in particular for less frequent abbreviations. In this context the proposed a priori data driven selection of possible extensions turned out to be crucial.
Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries
We present a new version of the terminology extraction tool-Ter-moPL. This version not only allow... more We present a new version of the terminology extraction tool-Ter-moPL. This version not only allows the ranking of term candidates but also their semantic grouping. To ensure the results are precise, we use the WordNet lexical database for identifying semantic relations between words. The tool was designed primarily for Polish texts, but the current version is tagset-independent and can be adapted to process texts in other languages. The new semantic grouping feature has been fully implemented for Polish texts, but we plan to make it available for English texts as well. CCS CONCEPTS • Applied computing → Document searching; Document analysis.
Proceedings of the Workshop on Figurative Language Processing, 2018
The paper addresses the detection of figurative usage of words in English text. The chosen method... more The paper addresses the detection of figurative usage of words in English text. The chosen method was to use neural nets fed by pre-trained word embeddings. The obtained results show that simple solutions, based on word embeddings only, are comparable to complex solutions, using additional information as a result of taggers or a psycholinguistic database. This approach can be easily applied to other languages, even less-studied, for which we only have raw texts available.
The paper addresses an experiment in detecting metaphorical usage of adjectives and nouns in Poli... more The paper addresses an experiment in detecting metaphorical usage of adjectives and nouns in Polish data. First, we describe the data developed for the experiment. The corpus consists of 1833 excerpts containing adjective-noun phrases which can have both metaphorical and literal senses. Annotators assign literal or metaphorical senses to all adjectives and nouns in the data. Then, we describe a method for literal/metaphorical sense classification. We use Bi-LSTM neural network architecture and word embeddings of both tokenand character-level. We examine the influence of adversarial training and perform analysis by part-of-speech. Our approach proved successful and an F1 score that exceeded 0.81 was achieved.
Computational Science – ICCS 2021, 2021
The purpose of this paper is to introduce the TermoPL tool created to extract terminology from do... more The purpose of this paper is to introduce the TermoPL tool created to extract terminology from domain corpora in Polish. The program extracts noun phrases, term candidates, with the help of a simple grammar that can be adapted for user’s needs. It applies the C-value method to rank term candidates being either the longest identified nominal phrases or their nested subphrases. The method operates on simplified base forms in order to unify morphological variants of terms and to recognize their contexts. We support the recognition of nested terms by word connection strength which allows us to eliminate truncated phrases from the top part of the term list. The program has an option to convert simplified forms of phrases into correct phrases in the nominal case. TermoPL accepts as input morphologically annotated and disambiguated domain texts and creates a list of terms, the top part of which comprises domain terminology. It can also compare two candidate term lists using three different...
We propose a new algorithm for word sense disambiguation, exploiting data from a WordNet with man... more We propose a new algorithm for word sense disambiguation, exploiting data from a WordNet with many types of lexical relations, such as plWordNet for Polish. In this method, sense probabilities in context are approximated with a language model. To estimate the likelihood of a sense appearing amidst the word sequence, the token being disambiguated is substituted with words related lexically to the given sense or words appearing in its WordNet gloss. We test this approach on a set of sense-annotated Polish sentences with a number of neural language models. Our best setup achieves the accuracy score of 55.12% (72.02% when first senses are excluded), up from 51.77% of an existing PageRank-based method. While not exceeding the first (often meaning most frequent) sense baseline in the standard case, this encourages further research on combining WordNet data with neural models.
The paper addresses the Polish version of SimLex-999 which we extended to contain not only measur... more The paper addresses the Polish version of SimLex-999 which we extended to contain not only measurement of similarity but also relatedness. The data was translated by three independent linguists; discrepancies in translation were resolved by a fourth person. The agreement rates between the translators were counted and an analysis of problems was performed. Then, pairs of words were rated by other annotators on a scale of 0–10 for similarity and relatedness of words. Finally, we compared the human annotations with the distributional semantics models of Polish based on lemmas and forms. We compared our work with the results reported for other languages.
Lecture Notes in Computer Science, 2020
Is it true that patients with similar conditions get similar diagnoses? In this paper we present ... more Is it true that patients with similar conditions get similar diagnoses? In this paper we present a natural language processing (NLP) method that can be used to validate this claim. We (1) introduce a method for representation of medical visits based on free-text descriptions recorded by doctors, (2) introduce a new method for segmentation of patients' visits, (3) present an application of the proposed method on a corpus of 100,000 medical visits and (4) show tools for interpretation and exploration of derived knowledge representation. With the proposed method we obtained stable and separated segments of visits which were positively validated against medical diagnoses. We show how the presented algorithm may be used to aid doctors in their practice.
Linking Theory and Practice of Digital Libraries, 2021
Cognitive Studies | Études cognitives, 2017
Testing word embeddings for PolishDistributional Semantics postulates the representation of word ... more Testing word embeddings for PolishDistributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skip-gram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition, are compared. The results show that it is not possible to identify one universal approach to vector creation applicable to various tasks. The most important feature is the quality and size of the data, but different strategy choices can also lead to significantly different results. Testowanie wektorowych reprezentacji dystrybucyjnych słów języka polskiegoSemantyka dystryb...
Computational terminology and filtering of terminological information, 2018
In our paper, we address the problem of recognition of irrelevant phrases in terminology lists ob... more In our paper, we address the problem of recognition of irrelevant phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms or discourse expressions. We defined several methods based on comparison of domain corpora and a method based on contexts of phrases identified in a large corpus of general language. The methods were tested on Polish data. We used six domain corpora and one general corpus. Two test sets were prepared to evaluate the methods. The first one consisted of many presumably irrelevant phrases, as we selected phrases which occurred in at least three domain corpora. The second set mainly consisted of domain terms, as it was composed of the top-ranked phrases automatically extracted from the analyzed domain corpora. The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase orde...
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, 2017
This paper compares two approaches to word sense disambiguation using word embeddings trained on ... more This paper compares two approaches to word sense disambiguation using word embeddings trained on unambiguous synonyms. The first one is an unsupervised method based on computing log probability from sequences of word embedding vectors, taking into account ambiguous word senses and guessing correct sense from context. The second method is supervised. We use a multilayer neural network model to learn a context-sensitive transformation that maps an input vector of ambiguous word into an output vector representing its sense. We evaluate both methods on corpora with manual annotations of word senses from the Polish wordnet.
Terminology, 2015
Domain corpora are often not very voluminous and even important terms can occur in them not as is... more Domain corpora are often not very voluminous and even important terms can occur in them not as isolated maximal phrases but only within more complex constructions. Appropriate recognition of nested terms can thus influence the content of the extracted candidate term list and its order. We propose a new method for identifying nested terms based on a combination of two aspects: grammatical correctness and normalised pointwise mutual information (NPMI) counted for all bigrams in a given corpus. NPMI is typically used for recognition of strong word connections, but in our solution we use it to recognise the weakest points to suggest the best place for division of a phrase into two parts. By creating, at most, two nested phrases in each step, we introduce a binary term structure. We test the impact of the proposed method applied, together with the C-value ranking method, to the automatic term recognition task performed on three corpora, two in Polish and one in English.
Lecture Notes in Computer Science, 2009
... of Computer Science, Polish Academy of Sciences JK Ordona 21, 01-237 Warsaw, Poland {Agnieszk... more ... of Computer Science, Polish Academy of Sciences JK Ordona 21, 01-237 Warsaw, Poland {Agnieszka.Mykowiecka,Malgorzata.Marciniak, Joanna.Rabiega}@ipipan ... pan [lex=filler] dojechać do alternatively you may go to Dworc[lex=~-]a[-lex=~] Centralnego i tam przesiąść się ...
ArXiv, 2021
In the paper, we test two different approaches to the unsupervised word sense disambiguation task... more In the paper, we test two different approaches to the unsupervised word sense disambiguation task for Polish. In both methods, we use neural language models to predict words similar to those being disambiguated and, on the basis of these words, we predict the partition of word senses in different ways. In the first method, we cluster selected similar words, while in the second, we cluster vectors representing their subsets. The evaluation was carried out on texts annotated with plWordNet senses and provided a relatively good result (F1=0.68 for all ambiguous words). The results are significantly better than those obtained for the neural model-based unsupervised method proposed in (Wawer and Mykowiecka, 2017) and are at the level of the supervised method presented there. The proposed method may be a way of solving word sense disambiguation problem for languages which lack sense annotated data.
In the paper, we address the problem of recognition of non-domain phrases in terminology lists ob... more In the paper, we address the problem of recognition of non-domain phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms and discourse function expressions. We tested several methods based on domain corpora comparison and a method based on contexts of phrases identified in a large corpus of general language. We compared the results of the methods to manual annotation. The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method with the precision about 0.75 at the half of the tested list was the context based method using a modified contextual diversity coefficient.
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), 2019
The paper addresses experiments to expand ad hoc ambiguous abbreviations in medical notes on the ... more The paper addresses experiments to expand ad hoc ambiguous abbreviations in medical notes on the basis of morphologically annotated texts, without using additional domain resources. We work on Polish data but the described approaches can be used for other languages too. We test two methods to select candidates for word abbreviation expansions. The first one automatically selects all words in text which might be an expansion of an abbreviation according to the language rules. The second method uses clustering of abbreviation occurrences to select representative elements which are manually annotated to determine lists of potential expansions. We then train a classifier to assign expansions to abbreviations based on three training sets: automatically obtained, consisting of manual annotation, and concatenation of the two previous ones. The results obtained for the manually annotated training data significantly outperform automatically obtained training data. Adding the automatically obtained training data to the manually annotated data improves the results, in particular for less frequent abbreviations. In this context the proposed a priori data driven selection of possible extensions turned out to be crucial.
Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries
We present a new version of the terminology extraction tool-Ter-moPL. This version not only allow... more We present a new version of the terminology extraction tool-Ter-moPL. This version not only allows the ranking of term candidates but also their semantic grouping. To ensure the results are precise, we use the WordNet lexical database for identifying semantic relations between words. The tool was designed primarily for Polish texts, but the current version is tagset-independent and can be adapted to process texts in other languages. The new semantic grouping feature has been fully implemented for Polish texts, but we plan to make it available for English texts as well. CCS CONCEPTS • Applied computing → Document searching; Document analysis.
Proceedings of the Workshop on Figurative Language Processing, 2018
The paper addresses the detection of figurative usage of words in English text. The chosen method... more The paper addresses the detection of figurative usage of words in English text. The chosen method was to use neural nets fed by pre-trained word embeddings. The obtained results show that simple solutions, based on word embeddings only, are comparable to complex solutions, using additional information as a result of taggers or a psycholinguistic database. This approach can be easily applied to other languages, even less-studied, for which we only have raw texts available.
The paper addresses an experiment in detecting metaphorical usage of adjectives and nouns in Poli... more The paper addresses an experiment in detecting metaphorical usage of adjectives and nouns in Polish data. First, we describe the data developed for the experiment. The corpus consists of 1833 excerpts containing adjective-noun phrases which can have both metaphorical and literal senses. Annotators assign literal or metaphorical senses to all adjectives and nouns in the data. Then, we describe a method for literal/metaphorical sense classification. We use Bi-LSTM neural network architecture and word embeddings of both tokenand character-level. We examine the influence of adversarial training and perform analysis by part-of-speech. Our approach proved successful and an F1 score that exceeded 0.81 was achieved.
Computational Science – ICCS 2021, 2021
The purpose of this paper is to introduce the TermoPL tool created to extract terminology from do... more The purpose of this paper is to introduce the TermoPL tool created to extract terminology from domain corpora in Polish. The program extracts noun phrases, term candidates, with the help of a simple grammar that can be adapted for user’s needs. It applies the C-value method to rank term candidates being either the longest identified nominal phrases or their nested subphrases. The method operates on simplified base forms in order to unify morphological variants of terms and to recognize their contexts. We support the recognition of nested terms by word connection strength which allows us to eliminate truncated phrases from the top part of the term list. The program has an option to convert simplified forms of phrases into correct phrases in the nominal case. TermoPL accepts as input morphologically annotated and disambiguated domain texts and creates a list of terms, the top part of which comprises domain terminology. It can also compare two candidate term lists using three different...
We propose a new algorithm for word sense disambiguation, exploiting data from a WordNet with man... more We propose a new algorithm for word sense disambiguation, exploiting data from a WordNet with many types of lexical relations, such as plWordNet for Polish. In this method, sense probabilities in context are approximated with a language model. To estimate the likelihood of a sense appearing amidst the word sequence, the token being disambiguated is substituted with words related lexically to the given sense or words appearing in its WordNet gloss. We test this approach on a set of sense-annotated Polish sentences with a number of neural language models. Our best setup achieves the accuracy score of 55.12% (72.02% when first senses are excluded), up from 51.77% of an existing PageRank-based method. While not exceeding the first (often meaning most frequent) sense baseline in the standard case, this encourages further research on combining WordNet data with neural models.
The paper addresses the Polish version of SimLex-999 which we extended to contain not only measur... more The paper addresses the Polish version of SimLex-999 which we extended to contain not only measurement of similarity but also relatedness. The data was translated by three independent linguists; discrepancies in translation were resolved by a fourth person. The agreement rates between the translators were counted and an analysis of problems was performed. Then, pairs of words were rated by other annotators on a scale of 0–10 for similarity and relatedness of words. Finally, we compared the human annotations with the distributional semantics models of Polish based on lemmas and forms. We compared our work with the results reported for other languages.
Lecture Notes in Computer Science, 2020
Is it true that patients with similar conditions get similar diagnoses? In this paper we present ... more Is it true that patients with similar conditions get similar diagnoses? In this paper we present a natural language processing (NLP) method that can be used to validate this claim. We (1) introduce a method for representation of medical visits based on free-text descriptions recorded by doctors, (2) introduce a new method for segmentation of patients' visits, (3) present an application of the proposed method on a corpus of 100,000 medical visits and (4) show tools for interpretation and exploration of derived knowledge representation. With the proposed method we obtained stable and separated segments of visits which were positively validated against medical diagnoses. We show how the presented algorithm may be used to aid doctors in their practice.
Linking Theory and Practice of Digital Libraries, 2021
Cognitive Studies | Études cognitives, 2017
Testing word embeddings for PolishDistributional Semantics postulates the representation of word ... more Testing word embeddings for PolishDistributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skip-gram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition, are compared. The results show that it is not possible to identify one universal approach to vector creation applicable to various tasks. The most important feature is the quality and size of the data, but different strategy choices can also lead to significantly different results. Testowanie wektorowych reprezentacji dystrybucyjnych słów języka polskiegoSemantyka dystryb...
Computational terminology and filtering of terminological information, 2018
In our paper, we address the problem of recognition of irrelevant phrases in terminology lists ob... more In our paper, we address the problem of recognition of irrelevant phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms or discourse expressions. We defined several methods based on comparison of domain corpora and a method based on contexts of phrases identified in a large corpus of general language. The methods were tested on Polish data. We used six domain corpora and one general corpus. Two test sets were prepared to evaluate the methods. The first one consisted of many presumably irrelevant phrases, as we selected phrases which occurred in at least three domain corpora. The second set mainly consisted of domain terms, as it was composed of the top-ranked phrases automatically extracted from the analyzed domain corpora. The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase orde...
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, 2017
This paper compares two approaches to word sense disambiguation using word embeddings trained on ... more This paper compares two approaches to word sense disambiguation using word embeddings trained on unambiguous synonyms. The first one is an unsupervised method based on computing log probability from sequences of word embedding vectors, taking into account ambiguous word senses and guessing correct sense from context. The second method is supervised. We use a multilayer neural network model to learn a context-sensitive transformation that maps an input vector of ambiguous word into an output vector representing its sense. We evaluate both methods on corpora with manual annotations of word senses from the Polish wordnet.
Terminology, 2015
Domain corpora are often not very voluminous and even important terms can occur in them not as is... more Domain corpora are often not very voluminous and even important terms can occur in them not as isolated maximal phrases but only within more complex constructions. Appropriate recognition of nested terms can thus influence the content of the extracted candidate term list and its order. We propose a new method for identifying nested terms based on a combination of two aspects: grammatical correctness and normalised pointwise mutual information (NPMI) counted for all bigrams in a given corpus. NPMI is typically used for recognition of strong word connections, but in our solution we use it to recognise the weakest points to suggest the best place for division of a phrase into two parts. By creating, at most, two nested phrases in each step, we introduce a binary term structure. We test the impact of the proposed method applied, together with the C-value ranking method, to the automatic term recognition task performed on three corpora, two in Polish and one in English.
Lecture Notes in Computer Science, 2009
... of Computer Science, Polish Academy of Sciences JK Ordona 21, 01-237 Warsaw, Poland {Agnieszk... more ... of Computer Science, Polish Academy of Sciences JK Ordona 21, 01-237 Warsaw, Poland {Agnieszka.Mykowiecka,Malgorzata.Marciniak, Joanna.Rabiega}@ipipan ... pan [lex=filler] dojechać do alternatively you may go to Dworc[lex=~-]a[-lex=~] Centralnego i tam przesiąść się ...