Extending Neural Keyword Extraction with TF-IDF tagset matching (original) (raw)

TNT-KID: Transformer-based neural tagger for keyword identification

Natural Language Engineering

With growing amounts of available textual data, development of algorithms capable of automatic analysis, categorization, and summarization of these data has become a necessity. In this research, we present a novel algorithm for keyword identification, that is, an extraction of one or multiword phrases representing key aspects of a given document, called Transformer-Based Neural Tagger for Keyword IDentification (TNT-KID). By adapting the transformer architecture for a specific task at hand and leveraging language model pretraining on a domain-specific corpus, the model is capable of overcoming deficiencies of both supervised and unsupervised state-of-the-art approaches to keyword extraction by offering competitive and robust performance on a variety of different datasets while requiring only a fraction of manually labeled data required by the best-performing systems. This study also offers thorough error analysis with valuable insights into the inner workings of the model and an abl...

Transforming Term Extraction: Transformer-Based Approaches to Multilingual Term Extraction Across Domains

2021

Automated Term Extraction (ATE), even though well-investigated, continues to be a challenging task. Approaches conventionally extract terms on corpus or document level and the benefits of neural models still remain underexplored with very few exceptions. We introduce three transformer-based term extraction models operating on sentence level: a language model for token classification, one for sequence classification, and an innovative use of Neural Machine Translation (NMT), which learns to reduce sentences to terms. All three models are trained and tested on the dataset of the ATE challenge TermEval 2020 in English, French, and Dutch across four specialized domains. The two best performing approaches are also evaluated on the ACL RD-TEC 2.0 dataset. Our models outperform previous baselines, one of which is BERT-based, by a substantial margin, with the token-classifier language model performing best.

TransKP: Transformer based Key-Phrase Extraction

2020

Increased connectivity has led to a sharp rise in the creation and availability of structured and unstructured text content, with millions of new documents being generated every minute. Key-phrase extraction is the process of finding the most important words and phrases which best capture the overall meaning and topics of a text document. Common techniques follow supervised or unsupervised methods for extractive or abstractive key-phrase extraction, but struggle to perform well and generalize to different datasets. In this paper, we follow a supervised, extractive approach and model the key-phrase extraction problem as a sequence labeling task. We utilize the power of transformers on sequential tasks and explore the effect of initializing the embedding layer of the model with pre-trained weights. We test our model on different standard key-phrase extraction datasets and our results significantly outperform all baselines as well as state-of-the-art scores on all the datasets.

The Recent Advances in Automatic Term Extraction: A survey

arXiv (Cornell University), 2023

Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. As units of knowledge in a specific field of expertise, extracted terms are not only beneficial for several terminographical tasks, but also support and improve several complex downstream tasks, e.g., information retrieval, machine translation, topic detection, and sentiment analysis. ATE systems, along with annotated datasets, have been studied and developed widely for decades, but recently we observed a surge in novel neural systems for the task at hand. Despite a large amount of new research on ATE, systematic survey studies covering novel neural approaches are lacking. We present a comprehensive survey of deep learning-based approaches to ATE, with a focus on Transformer-based neural models. The study also offers a comparison between these systems and previous ATE approaches, which were based on feature engineering and non-neural supervised learning algorithms.

Neural Tagger for Czech Language: Capturing Linguistic Phenomena in Web Corpora

2019

We propose a new tagger for the Czech language and particularly for the tagset used for annotation of corpora of the TenTen family. The tagger is based on neural networks with pretrained word embeddings. We selected the newest Czech Web corpus of the TenTen family as training data, but we removed sentences with phenomena that were often annotated incorrectly. We let the tagger to learn the annotation of these phenomena on its own. We also experimented with the recognition of multi-word expressions since this information can support the correct tagging. We evaluated the tagger on 6,950 sentences (84,023 tokens) from the cstenten17 corpus and achieved 75.25% accuracy when compared by tags. When compared by attributes, we achieved 91.62% accuracy; the accuracy of POS tag prediction is 96.5%.

Keyword Extraction Performance Analysis

2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)

This paper presents a survey-cum-evaluation of methods for the comprehensive comparison of the task of keyword extraction using datasets of various sizes, forms, and genre. We use four different datasets which includes Amazon product data-Automotive, SemEval 2010, TMDB and Stack Exchange. Moreover, a subset of 100 Amazon product reviews is annotated and utilized for evaluation in this paper, to our knowledge, for the first time. Datasets are evaluated by five Natural Language Processing approaches (3 unsupervised and 2 supervised), which include TF-IDF, RAKE, TextRank, LDA and Shallow Neural Network. We use a tenfold cross-validation scheme and evaluate the performance of the aforementioned approaches using recall, precision and F-score. Our analysis and results provide guidelines on the proper approaches to use for different types of datasets. Furthermore, our results indicate that certain approaches achieve improved performance with certain datasets due to inherent characteristics of the data.

Ensembling Transformers for Cross-domain Automatic Term Extraction

From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries

Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single-and multiword terms, we also experiment with ensembles of mono-and multilingual models by conducting the intersection or union on the term output sets of different language models. Our experiments have been conducted on the ACTER corpus covering four specialized domains (Corruption, Wind energy, Equitation, and Heart failure) and three languages (English, French, and Dutch), and on the RSDO5 Slovenian corpus covering four additional domains (Biomechanics, Chemistry, Veterinary, and Linguistics). The results show that the strategy of employing monolingual models outperforms the state-of-the-art approaches from the related work leveraging multilingual models, regarding all the languages except Dutch and French if the term extraction task excludes the extraction of named entity terms. Furthermore, by combining the outputs of the two best performing models, we achieve significant improvements.

Cross-Lingual Information to the Rescue in Keyword Extraction

Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014

We introduce a method that extracts keywords in a language with the help of the other. In our approach, we bridge and fuse conventionally irrelevant word statistics in languages. The method involves estimating preferences for keywords w.r.t. domain topics and generating cross-lingual bridges for word statistics integration. At run-time, we transform parallel articles into word graphs, build cross-lingual edges, and exploit PageRank with word keyness information for keyword extraction. We present the system, BiKEA, that applies the method to keyword analysis. Experiments show that keyword extraction benefits from PageRank, globally learned keyword preferences, and cross-lingual word statistics interaction which respects language diversity.

Extracting and Analyzing Features in Natural Language Processing for Deep Learning with English Language

International Journal of Research Publication and Reviews, 2023

Natural Language Processing (NLP) is a field of study that develops software capable of interpreting human speech for mechanical use. Words are the building blocks of advanced grammatical and semantic analysis, and word segmentation is often the first order of business for natural language processing. This paper introduces the feature extraction method of deep learning and applies the ideas of deep learning to multi-modal feature extraction in order to address the practical problem of huge structural differences between different data modalities in a multi-modal environment. In this study, we present a neural network that can process information from several sources at once. Each mode is represented by a separate multilayer sub-neural network structure. It's purpose is to transform features from one mode to another. In order to solve the issues of current word segmentation techniques not being able to ensure long-term reliance on text semantics and lengthy training prediction time, a hybrid network English word segmentation processing approach is presented. This approach uses the BI-GRU (Bidirectional Gated Recurrent Unit) to segment English words and the CRF (Conditional Random Field) model to sequentially annotate sentences, which eliminates the long-distance dependency of text semantics and reduces the time needed to train the network and predict its performance. Compared to the BI-LSTM-CRF (Bidirectional-Long Short Term Memory-Conditional Random Field) model, the experimental results reveal that this technique achieves equivalent processing effects on word segmentation, while also boosting processing efficiency by a factor.

A novel, Language-Independent Keyword Extraction method

2013

Obtaining the most representative set of words in a document is a very significant task, since it allows characterizing the document and simplifies search and classification activities. This paper presents a novel method, called LIKE, that offers the ability of automatically extracting keywords from a document regardless of the language used in it. To do so, it uses a three-stage process: the first stage identifies the most representative terms, the second stage builds a numeric representation that is appropriate for those terms, and the third one uses a feed-forward neural network to obtain a predictive model. To measure the efficacy of the LIKE method, the articles published by the Workshop of Computer Science Researchers (WICC) in the last 14 years (1999-2012) were used. The results obtained show that LIKE is better than the KEA method, which is one of the most widely mentioned solutions in literature about this topic.