Senja Pollak - Academia.edu (original) (raw)
Papers by Senja Pollak
arXiv (Cornell University), Jan 31, 2021
Keyword extraction is the task of identifying words (or multi-word expressions) that best describ... more Keyword extraction is the task of identifying words (or multi-word expressions) that best describe a given document and serve in news portals to link articles of similar topics. In this work, we develop and evaluate our methods on four novel data sets covering lessrepresented, morphologically-rich languages in European news media industry (Croatian, Estonian, Latvian, and Russian). First, we perform evaluation of two supervised neural transformer-based methods, Transformerbased Neural Tagger for Keyword Identification (TNT-KID) and Bidirectional Encoder Representations from Transformers (BERT) with an additional Bidirectional Long Short-Term Memory Conditional Random Fields (BiLSTM CRF) classification head, and compare them to a baseline Term Frequency-Inverse Document Frequency (TF-IDF) based unsupervised approach. Next, we show that by combining the keywords retrieved by both neural transformer-based methods and extending the final set of keywords with an unsupervised TF-IDF based technique, we can drastically improve the recall of the system, making it appropriate for usage as a recommendation system in the media house environment.
Zenodo (CERN European Organization for Nuclear Research), Sep 16, 2020
This article describes initial work into the automatic classification of user-generated content i... more This article describes initial work into the automatic classification of user-generated content in news media to support human moderators. We work with real-world data-comments posted by readers under online news articles-in two less-resourced European languages, Croatian and Estonian. We describe our dataset, and experiments into automatic classification using a range of models. Performance obtained is reasonable but not as good as might be expected given similar work in offensive language classification in other languages; we then investigate possible reasons in terms of the variability and reliability of the data and its annotation.
Zenodo (CERN European Organization for Nuclear Research), Mar 25, 2022
Forward-looking sentences are often a subject of studies of financial texts. Detection of such se... more Forward-looking sentences are often a subject of studies of financial texts. Detection of such sentences is usually performed with wordlists of inclusive and exclusive keywords that are used as indicators of the forward-looking nature of the sentences at hand. In this paper we describe our assessment of potential improvements of forward-looking sentence detection wordlists by combining them together and by extending them with neighboring words in word-vector representations. Our current results indicate that simple combinations and straightforward extensions of wordlists with vector-space representation neighbors might not be suitable for FLS detection without further methodological improvements.
Communications in computer and information science, 2023
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), 2022
Zenodo (CERN European Organization for Nuclear Research), Jan 15, 2022
The rapid growth of literature related to the COVID-19 pandemic results in a multitude of article... more The rapid growth of literature related to the COVID-19 pandemic results in a multitude of articles which cannot be manually labeled due to the lack of human resources, In this work we present a solution to the shared task titled LitCovid track Multi-label topic classification for COVID-19 literature annotation. Our proposed solution constructs classifiers for each class by using an autoML system for text named autoBOT. Albeit the proposed system performed sub-optimally in terms of recall, it offered better-than-baseline (macro) precision, indication that automated representation learning is a promising approach to multilabel classification of COVID-19-related texts.
Lecture Notes in Computer Science, 2021
The COVID-19 pandemic triggered a wave of novel scientific literature that is impossible to inspe... more The COVID-19 pandemic triggered a wave of novel scientific literature that is impossible to inspect and study in a reasonable time frame manually. Current machine learning methods offer to project such body of literature into the vector space, where similar documents are located close to each other, offering an insightful exploration of scientific papers and other knowledge sources associated with COVID-19. However, to start searching, such texts need to be appropriately annotated, which is seldom the case due to the lack of human resources. In our system, the current body of COVID-19-related literature is annotated using unsupervised keyphrase extraction, facilitating the initial queries to the latent space containing the learned document embeddings (lowdimensional representations). The solution is accessible through a web server capable of interactive search, term ranking, and exploration of potentially interesting literature. We demonstrate the usefulness of the approach via case studies from the medicinal chemistry domain.
Lecture Notes in Computer Science, 2018
The aim of this work is to reproduce the approach to detecting semantic orientations in economic ... more The aim of this work is to reproduce the approach to detecting semantic orientations in economic texts that was presented in the paper Good Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts by Malo et al. The approach employs the Linearized Phrase Structure model for sentence level classification of short economic texts into a positive, negative or neutral category from investor’s perspective and yields state-of-the-art results. The proposed method employs both rule based linguistic models and machine learning. Where possible we follow the same approach as described in the original paper, with some documented modifications. Our solution is simplified in at least two aspects, but its performance is comparable to the original and overall remains better than the reported results of other benchmark algorithms mentioned in the original paper. The differences between the two models and results are described in detail and lead to conclusion that the original approach is to a large extent repeatable and that our simplified version does not overly sacrifice performance for generalizability.
ICCC, 2018
We describe a novel slogan generator that employs bisociation in combination with the selection o... more We describe a novel slogan generator that employs bisociation in combination with the selection of stylistic literary devices. Advertising slogans are a key marketing tool for every company and a memorable slogan provides an advantage on the market. A good slogan is catchy and unique and projects the values of the company. To get an insight in construction of such slogans, we first analyze a large corpus of advertising slogans in terms of alliteration, assonance, consonance and rhyme. Then we develop an approach for constructing slogans that contain these stylistic devices which can help make the slogans easy to remember. At the same time, we use bisociation to imprint a unique message into the slogan by allowing the user to specify the original and bisociated domains from where the generator selects the words. These word sets are first expanded with the help of FastText embeddings and then used to fill in the empty slots in slogan skeletons generated from a database of existing slogans. We use a language model to increase semantical cohesion of generated slogans and a relevance evaluation system to score the slogans by their connectedness to the selected domains. The evaluation of generated slogans for two companies shows that even if slogan generation is a hard problem, we can find some generated slogans that are suitable for the use in production without any modification and a much larger number of slogans that are positively evaluated according to at least one criteria (e.g., humor, catchiness).
Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, Dec 1, 2014
Depression is a mental illness that negatively affects a person's well-being and can, if left unt... more Depression is a mental illness that negatively affects a person's well-being and can, if left untreated, lead to serious consequences such as suicide. Therefore, it is important to recognize the signs of depression early. In the last decade, social media has become one of the most common places to express one's feelings. Hence, there is a possibility of text processing and applying machine learning techniques to detect possible signs of depression. In this paper, we present our approaches to solving the shared task titled Detecting Signs of Depression from Social Media Text. We explore three different approaches to solve the challenge: finetuning BERT model, leveraging AutoML for the construction of features and classifier selection and finally, we explore latent spaces derived from the combination of textual and knowledge-based representations. We ranked 9th out of 31 teams in the competition. Our best solution, based on knowledge graph and textual representations, was 4.9% behind the best model in terms of Macro F1, and only 1.9% behind in terms of Recall.
Companion Proceedings of the Web Conference 2021, Apr 19, 2021
Ontologies are increasingly used for machine reasoning over the last few years. They can provide ... more Ontologies are increasingly used for machine reasoning over the last few years. They can provide explanations of concepts or be used for concept classification if there exists a mapping from the desired labels to the relevant ontology. Another advantage of using ontologies is that they do not need a learning process, meaning that we do not need the train data or time before using them. This paper presents a practical use of an ontology for a classification problem from the financial domain. It first transforms a given ontology to a graph and proceeds with generalization with the aim to find common semantic descriptions of the input sets of financial concepts. We present a solution to the shared task on Learning Semantic Similarities for the Financial Domain (FinSim-2 task). The task is to design a system that can automatically classify concepts from the Financial domain into the most relevant hypernym concept in an external ontology-the Financial Industry Business Ontology. We propose a method that maps given concepts to the mentioned ontology and performs a graph search for the most relevant hypernyms. We also employ a word vectorization method and a machine learning classifier to supplement the method with a ranked list of labels for each concept.
Machine Learning, Apr 14, 2021
Learning from texts has been widely adopted throughout industry and science. While stateof-the-ar... more Learning from texts has been widely adopted throughout industry and science. While stateof-the-art neural language models have shown very promising results for text classification, they are expensive to (pre-)train, require large amounts of data and tuning of hundreds of millions or more parameters. This paper explores how automatically evolved text representations can serve as a basis for explainable, low-resource branch of models with competitive performance that are subject to automated hyperparameter tuning. We present autoBOT (automatic Bags-Of-Tokens), an autoML approach suitable for low resource learning scenarios, where both the hardware and the amount of data required for training are limited. The proposed approach consists of an evolutionary algorithm that jointly optimizes various sparse representations of a given text (including word, subword, POS tag, keyword-based, knowledge graph-based and relational features) and two types of document embeddings (non-sparse representations). The key idea of autoBOT is that, instead of evolving at the learner level, evolution is conducted at the representation level. The proposed method offers competitive classification performance on fourteen real-world classification tasks when compared against a competitive autoML approach that evolves ensemble models, as well as state-of-the-art neural language models such as BERT and RoBERTa. Moreover, the approach is explainable, as the importance of the parts of the input space is part of the final solution yielded by the proposed optimization procedure, offering potential for meta-transfer learning.
arXiv (Cornell University), Jan 17, 2023
Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort... more Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. As units of knowledge in a specific field of expertise, extracted terms are not only beneficial for several terminographical tasks, but also support and improve several complex downstream tasks, e.g., information retrieval, machine translation, topic detection, and sentiment analysis. ATE systems, along with annotated datasets, have been studied and developed widely for decades, but recently we observed a surge in novel neural systems for the task at hand. Despite a large amount of new research on ATE, systematic survey studies covering novel neural approaches are lacking. We present a comprehensive survey of deep learning-based approaches to ATE, with a focus on Transformer-based neural models. The study also offers a comparison between these systems and previous ATE approaches, which were based on feature engineering and non-neural supervised learning algorithms.
Database
The coronavirus disease 2019 (COVID-19) pandemic has been severely impacting global society since... more The coronavirus disease 2019 (COVID-19) pandemic has been severely impacting global society since December 2019. The related findings such as vaccine and drug development have been reported in biomedical literature—at a rate of about 10 000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretation. For instance, LitCovid is a literature database of COVID-19-related articles in PubMed, which has accumulated more than 200 000 articles with millions of accesses each month by users worldwide. One primary curation task is to assign up to eight topics (e.g. Diagnosis and Treatment) to the articles in LitCovid. The annotated topics have been widely used for navigating the COVID literature, rapidly locating articles of interest and other downstream studies. However, annotating the topics has been the bottleneck of manual curation. Despite the continuing advances in biomedical text-mining methods, few have been dedicated to topic annotation...
arXiv (Cornell University), Aug 15, 2022
Efficiently identifying keyphrases that represent a given document is a challenging task. In the ... more Efficiently identifying keyphrases that represent a given document is a challenging task. In the last years, plethora of keyword detection approaches were proposed. These approaches can be based on statistical (frequency-based) properties of e.g., tokens, specialized neural language models, or a graph-based structure derived from a given document. The graph-based methods can be computationally amongst the most efficient ones, while maintaining the retrieval performance. One of the main properties, common to graph-based methods, is their immediate conversion of token space into graphs, followed by subsequent processing. In this paper, we explore a novel unsupervised approach which merges parts of a document in sequential form, prior to construction of the token graph. Further, by leveraging personalized PageRank, which considers frequencies of such sub-phrases alongside token lengths during node ranking, we demonstrate stateof-the-art retrieval capabilities while being up to two orders of magnitude faster than current state-of-the-art unsupervised detectors such as YAKE and MultiPar-titeRank. The proposed method's scalability was also demonstrated by computing keyphrases for a biomedical corpus comprised of 14 million documents in less than a minute.
arXiv (Cornell University), Jan 31, 2021
Keyword extraction is the task of identifying words (or multi-word expressions) that best describ... more Keyword extraction is the task of identifying words (or multi-word expressions) that best describe a given document and serve in news portals to link articles of similar topics. In this work, we develop and evaluate our methods on four novel data sets covering lessrepresented, morphologically-rich languages in European news media industry (Croatian, Estonian, Latvian, and Russian). First, we perform evaluation of two supervised neural transformer-based methods, Transformerbased Neural Tagger for Keyword Identification (TNT-KID) and Bidirectional Encoder Representations from Transformers (BERT) with an additional Bidirectional Long Short-Term Memory Conditional Random Fields (BiLSTM CRF) classification head, and compare them to a baseline Term Frequency-Inverse Document Frequency (TF-IDF) based unsupervised approach. Next, we show that by combining the keywords retrieved by both neural transformer-based methods and extending the final set of keywords with an unsupervised TF-IDF based technique, we can drastically improve the recall of the system, making it appropriate for usage as a recommendation system in the media house environment.
Zenodo (CERN European Organization for Nuclear Research), Sep 16, 2020
This article describes initial work into the automatic classification of user-generated content i... more This article describes initial work into the automatic classification of user-generated content in news media to support human moderators. We work with real-world data-comments posted by readers under online news articles-in two less-resourced European languages, Croatian and Estonian. We describe our dataset, and experiments into automatic classification using a range of models. Performance obtained is reasonable but not as good as might be expected given similar work in offensive language classification in other languages; we then investigate possible reasons in terms of the variability and reliability of the data and its annotation.
Zenodo (CERN European Organization for Nuclear Research), Mar 25, 2022
Forward-looking sentences are often a subject of studies of financial texts. Detection of such se... more Forward-looking sentences are often a subject of studies of financial texts. Detection of such sentences is usually performed with wordlists of inclusive and exclusive keywords that are used as indicators of the forward-looking nature of the sentences at hand. In this paper we describe our assessment of potential improvements of forward-looking sentence detection wordlists by combining them together and by extending them with neighboring words in word-vector representations. Our current results indicate that simple combinations and straightforward extensions of wordlists with vector-space representation neighbors might not be suitable for FLS detection without further methodological improvements.
Communications in computer and information science, 2023
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), 2022
Zenodo (CERN European Organization for Nuclear Research), Jan 15, 2022
The rapid growth of literature related to the COVID-19 pandemic results in a multitude of article... more The rapid growth of literature related to the COVID-19 pandemic results in a multitude of articles which cannot be manually labeled due to the lack of human resources, In this work we present a solution to the shared task titled LitCovid track Multi-label topic classification for COVID-19 literature annotation. Our proposed solution constructs classifiers for each class by using an autoML system for text named autoBOT. Albeit the proposed system performed sub-optimally in terms of recall, it offered better-than-baseline (macro) precision, indication that automated representation learning is a promising approach to multilabel classification of COVID-19-related texts.
Lecture Notes in Computer Science, 2021
The COVID-19 pandemic triggered a wave of novel scientific literature that is impossible to inspe... more The COVID-19 pandemic triggered a wave of novel scientific literature that is impossible to inspect and study in a reasonable time frame manually. Current machine learning methods offer to project such body of literature into the vector space, where similar documents are located close to each other, offering an insightful exploration of scientific papers and other knowledge sources associated with COVID-19. However, to start searching, such texts need to be appropriately annotated, which is seldom the case due to the lack of human resources. In our system, the current body of COVID-19-related literature is annotated using unsupervised keyphrase extraction, facilitating the initial queries to the latent space containing the learned document embeddings (lowdimensional representations). The solution is accessible through a web server capable of interactive search, term ranking, and exploration of potentially interesting literature. We demonstrate the usefulness of the approach via case studies from the medicinal chemistry domain.
Lecture Notes in Computer Science, 2018
The aim of this work is to reproduce the approach to detecting semantic orientations in economic ... more The aim of this work is to reproduce the approach to detecting semantic orientations in economic texts that was presented in the paper Good Debt or Bad Debt: Detecting Semantic Orientations in Economic Texts by Malo et al. The approach employs the Linearized Phrase Structure model for sentence level classification of short economic texts into a positive, negative or neutral category from investor’s perspective and yields state-of-the-art results. The proposed method employs both rule based linguistic models and machine learning. Where possible we follow the same approach as described in the original paper, with some documented modifications. Our solution is simplified in at least two aspects, but its performance is comparable to the original and overall remains better than the reported results of other benchmark algorithms mentioned in the original paper. The differences between the two models and results are described in detail and lead to conclusion that the original approach is to a large extent repeatable and that our simplified version does not overly sacrifice performance for generalizability.
ICCC, 2018
We describe a novel slogan generator that employs bisociation in combination with the selection o... more We describe a novel slogan generator that employs bisociation in combination with the selection of stylistic literary devices. Advertising slogans are a key marketing tool for every company and a memorable slogan provides an advantage on the market. A good slogan is catchy and unique and projects the values of the company. To get an insight in construction of such slogans, we first analyze a large corpus of advertising slogans in terms of alliteration, assonance, consonance and rhyme. Then we develop an approach for constructing slogans that contain these stylistic devices which can help make the slogans easy to remember. At the same time, we use bisociation to imprint a unique message into the slogan by allowing the user to specify the original and bisociated domains from where the generator selects the words. These word sets are first expanded with the help of FastText embeddings and then used to fill in the empty slots in slogan skeletons generated from a database of existing slogans. We use a language model to increase semantical cohesion of generated slogans and a relevance evaluation system to score the slogans by their connectedness to the selected domains. The evaluation of generated slogans for two companies shows that even if slogan generation is a hard problem, we can find some generated slogans that are suitable for the use in production without any modification and a much larger number of slogans that are positively evaluated according to at least one criteria (e.g., humor, catchiness).
Slovenščina 2.0: Empirične, Aplikativne in Interdisciplinarne Raziskave, Dec 1, 2014
Depression is a mental illness that negatively affects a person's well-being and can, if left unt... more Depression is a mental illness that negatively affects a person's well-being and can, if left untreated, lead to serious consequences such as suicide. Therefore, it is important to recognize the signs of depression early. In the last decade, social media has become one of the most common places to express one's feelings. Hence, there is a possibility of text processing and applying machine learning techniques to detect possible signs of depression. In this paper, we present our approaches to solving the shared task titled Detecting Signs of Depression from Social Media Text. We explore three different approaches to solve the challenge: finetuning BERT model, leveraging AutoML for the construction of features and classifier selection and finally, we explore latent spaces derived from the combination of textual and knowledge-based representations. We ranked 9th out of 31 teams in the competition. Our best solution, based on knowledge graph and textual representations, was 4.9% behind the best model in terms of Macro F1, and only 1.9% behind in terms of Recall.
Companion Proceedings of the Web Conference 2021, Apr 19, 2021
Ontologies are increasingly used for machine reasoning over the last few years. They can provide ... more Ontologies are increasingly used for machine reasoning over the last few years. They can provide explanations of concepts or be used for concept classification if there exists a mapping from the desired labels to the relevant ontology. Another advantage of using ontologies is that they do not need a learning process, meaning that we do not need the train data or time before using them. This paper presents a practical use of an ontology for a classification problem from the financial domain. It first transforms a given ontology to a graph and proceeds with generalization with the aim to find common semantic descriptions of the input sets of financial concepts. We present a solution to the shared task on Learning Semantic Similarities for the Financial Domain (FinSim-2 task). The task is to design a system that can automatically classify concepts from the Financial domain into the most relevant hypernym concept in an external ontology-the Financial Industry Business Ontology. We propose a method that maps given concepts to the mentioned ontology and performs a graph search for the most relevant hypernyms. We also employ a word vectorization method and a machine learning classifier to supplement the method with a ranked list of labels for each concept.
Machine Learning, Apr 14, 2021
Learning from texts has been widely adopted throughout industry and science. While stateof-the-ar... more Learning from texts has been widely adopted throughout industry and science. While stateof-the-art neural language models have shown very promising results for text classification, they are expensive to (pre-)train, require large amounts of data and tuning of hundreds of millions or more parameters. This paper explores how automatically evolved text representations can serve as a basis for explainable, low-resource branch of models with competitive performance that are subject to automated hyperparameter tuning. We present autoBOT (automatic Bags-Of-Tokens), an autoML approach suitable for low resource learning scenarios, where both the hardware and the amount of data required for training are limited. The proposed approach consists of an evolutionary algorithm that jointly optimizes various sparse representations of a given text (including word, subword, POS tag, keyword-based, knowledge graph-based and relational features) and two types of document embeddings (non-sparse representations). The key idea of autoBOT is that, instead of evolving at the learner level, evolution is conducted at the representation level. The proposed method offers competitive classification performance on fourteen real-world classification tasks when compared against a competitive autoML approach that evolves ensemble models, as well as state-of-the-art neural language models such as BERT and RoBERTa. Moreover, the approach is explainable, as the importance of the parts of the input space is part of the final solution yielded by the proposed optimization procedure, offering potential for meta-transfer learning.
arXiv (Cornell University), Jan 17, 2023
Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort... more Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. As units of knowledge in a specific field of expertise, extracted terms are not only beneficial for several terminographical tasks, but also support and improve several complex downstream tasks, e.g., information retrieval, machine translation, topic detection, and sentiment analysis. ATE systems, along with annotated datasets, have been studied and developed widely for decades, but recently we observed a surge in novel neural systems for the task at hand. Despite a large amount of new research on ATE, systematic survey studies covering novel neural approaches are lacking. We present a comprehensive survey of deep learning-based approaches to ATE, with a focus on Transformer-based neural models. The study also offers a comparison between these systems and previous ATE approaches, which were based on feature engineering and non-neural supervised learning algorithms.
Database
The coronavirus disease 2019 (COVID-19) pandemic has been severely impacting global society since... more The coronavirus disease 2019 (COVID-19) pandemic has been severely impacting global society since December 2019. The related findings such as vaccine and drug development have been reported in biomedical literature—at a rate of about 10 000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretation. For instance, LitCovid is a literature database of COVID-19-related articles in PubMed, which has accumulated more than 200 000 articles with millions of accesses each month by users worldwide. One primary curation task is to assign up to eight topics (e.g. Diagnosis and Treatment) to the articles in LitCovid. The annotated topics have been widely used for navigating the COVID literature, rapidly locating articles of interest and other downstream studies. However, annotating the topics has been the bottleneck of manual curation. Despite the continuing advances in biomedical text-mining methods, few have been dedicated to topic annotation...
arXiv (Cornell University), Aug 15, 2022
Efficiently identifying keyphrases that represent a given document is a challenging task. In the ... more Efficiently identifying keyphrases that represent a given document is a challenging task. In the last years, plethora of keyword detection approaches were proposed. These approaches can be based on statistical (frequency-based) properties of e.g., tokens, specialized neural language models, or a graph-based structure derived from a given document. The graph-based methods can be computationally amongst the most efficient ones, while maintaining the retrieval performance. One of the main properties, common to graph-based methods, is their immediate conversion of token space into graphs, followed by subsequent processing. In this paper, we explore a novel unsupervised approach which merges parts of a document in sequential form, prior to construction of the token graph. Further, by leveraging personalized PageRank, which considers frequencies of such sub-phrases alongside token lengths during node ranking, we demonstrate stateof-the-art retrieval capabilities while being up to two orders of magnitude faster than current state-of-the-art unsupervised detectors such as YAKE and MultiPar-titeRank. The proposed method's scalability was also demonstrated by computing keyphrases for a biomedical corpus comprised of 14 million documents in less than a minute.