A Domain Independent Double Layered Approach to Keyphrase Generation (original) (raw)
Related papers
Associating meaningful keyphrases to documents and web pages is an activity that can greatly increase the accuracy of Information Retrieval and Personalization systems, but the growing amount of text data available is too large for an extensive manual annotation. On the other hand, automatic keyphrase generation, a complex task involving Natural Language Processing and Knowledge Engineering, can signifi-cantly support this activity. Several different strategies have been pro-posed over the years, but most of them require extensive training data, which are not always available, suffer high ambiguity and differences in writing style, are highly domain-specific, and often rely on a well-structured knowledge that is very hard to acquire and encode. In order to overcome these limitations, we propose in this paper an innovative un-supervised and domain-independent approach that combines keyphrase extraction and keyphrase inference based on loosely structured, collab-orative knowledge such...
Keyphrase Generation: A Multi-Aspect Survey
Proceedings of FRUCT 2019, the 25th Conference of the Open Innovations Association FRUCT, Helsinki, Finland, 2019
Extractive keyphrase generation research has been around since the nineties, but the more advanced abstractive approach based on the encoder-decoder framework and sequence-to-sequence learning has been explored only recently. In fact, more than a dozen of abstractive methods have been proposed in the last three years, producing meaningful keyphrases and achieving state-of-the-art scores. In this survey, we examine various aspects of the extractive keyphrase generation methods and focus mostly on the more recent abstractive methods that are based on neural networks. We pay particular attention to the mechanisms that have driven the perfection of the later. A huge collection of scientific article metadata and the corresponding keyphrases is created and released for the research community. We also present various keyphrase generation and text summarization research patterns and trends of the last two decades.
Unsupervised Open-domain Keyphrase Generation
arXiv (Cornell University), 2023
In this work, we study the problem of unsupervised open-domain keyphrase generation, where the objective is a keyphrase generation model that can be built without using human-labeled data and can perform consistently across domains. To solve this problem, we propose a seq2seq model that consists of two modules, namely phraseness and informativeness module, both of which can be built in an unsupervised and open-domain fashion. The phraseness module generates phrases, while the informativeness module guides the generation towards those that represent the core concepts of the text. We thoroughly evaluate our proposed method using eight benchmark datasets from different domains. Results on in-domain datasets show that our approach achieves stateof-the-art results compared with existing unsupervised models, and overall narrows the gap between supervised and unsupervised methods down to about 16%. Furthermore, we demonstrate that our model performs consistently across domains, as it overall surpasses the baselines on out-of-domain datasets. 1 .
KEA: Practical automatic keyphrase extraction
1999
Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine-learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea's effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.
A Semantic Metadata Generator for Web Pages Based on Keyphrase Extraction
The annotation of documents and web pages with semantic metatdata is an activity that can greatly increase the accuracy of Information Retrieval and Personalization systems, but the growing amount of text data available is too large for an extensive manual process. On the other hand, automatic keyphrase generation and wikification can significantly support this activity. In this demonstration we present a system that automatically extracts keyphrases, identifies candidate DBpedia entities, and returns as output a set of RDF triples compliant with the Opengraph and the Schema.org vocabularies.
KD Strikes Back: from Keyphrases to Labelled Domains Using External Knowledge Sources
Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), 2016
This paper presents L-KD, a tool that relies on available linguistic and knowledge resources to perform keyphrase clustering and labelling. The aim of L-KD is to help finding and tracing themes in English and Italian text data, represented by groups of keyphrases and associated domains. We perform an evaluation of the top-ranked domains using the 20 Newsgroup dataset, and we show that 8 domains out of 10 match with manually assigned labels. This confirms the good accuracy of this approach, which does not require supervision.
Domain-specific keyphrase extraction
1999
Keyphrases are an important means of document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore it is highly desirable to automate the keyphrase extraction process. This paper shows that a simple procedure for keyphrase extraction based on the naive Bayes learning scheme performs comparably to the state of the art. It goes on to explain how this procedure's performance can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases improves significantly when domain-specific information is exploited.
A Supervised Learning Approach for Automatic Keyphrase Extraction
Keyphrases, synonymously spoken as keywords, represent semantic metadata and play an important role to capture the main theme represented by a large text data collection. Although authors provide a list of about five to ten keywords in scientific publications that are used to map them to respective domains, due to exponential growth of non-scientific documents either on the World Wide Web or in textual databases, an automatic mechanism is sought to identify keyphrases embedded within them. In this paper, we propose the design of a lightweight machine learning approach to identify feasible keyphrases in text documents. The proposed method mines various lexical and semantic features from texts to learn a classification model. The efficacy of the proposed system is established through experimentation on datasets from three different domains.
A New Multi-lingual Knowledge-base Approach to Keyphrase Extraction for the Italian Language
Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, 2014
Associating meaningful keyphrases to text documents and Web pages is an activity that can significantly increase the accuracy of Information Retrieval, Personalization and Recommender systems, but the growing amount of text data available is too large for an extensive manual annotation. On the other hand, automatic keyphrase generation can significantly support this activity. This task is already performed with satisfactory results by several systems proposed in the literature, however, most of them focuses solely on the English language which represents approximately more than 50% of Web contents. Only few other languages have been investigated and Italian, despite being the ninth most used language on the Web, is not among them. In order to overcome this shortage, we propose a novel multi-language, unsupervised, knowledge-based approach towards keyphrase generation. To support our claims, we developed DIKpE-G, a prototype system which integrates several kinds of knowledge for selecting and evaluating meaningful keyphrases, ranging from linguistic to statistical, meta/structural, social, and ontological knowledge. DIKpE-G performs well over English and Italian texts.
Keyphrase Generation: A Text Summarization Struggle
Proceedings of 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, USA, 2019
Authors’ keyphrases assigned to scientific articles are essential for recognizing content and topic aspects. Most of the proposed supervised and unsupervised methods for keyphrase generation are unable to produce terms that are valuable but do not appear in the text. In this paper, we explore the possibility of considering the keyphrase string as an abstractive summary of the title and the abstract. First, we collect, process and release a large dataset of scientific paper metadata that contains 2.2 million records. Then we experiment with popular text summarization neural architectures. Despite using advanced deep learning models, large quantities of data and many days of computation, our systematic evaluation on four test datasets reveals that the explored text summarization methods could not produce better keyphrases than the simpler unsupervised methods, or the existing supervised ones.