Nirmala Pudota - Academia.edu (original) (raw)

Papers by Nirmala Pudota

Research paper thumbnail of Exploration of Open Large Language Models for eDiscovery

Research paper thumbnail of Generative AI Text Classification using Ensemble LLM Approaches

arXiv (Cornell University), Sep 13, 2023

Large Language Models (LLMs) have shown impressive performance across a variety of Artificial Int... more Large Language Models (LLMs) have shown impressive performance across a variety of Artificial Intelligence (AI) and natural language processing tasks, such as content creation, report generation, etc. However, unregulated malign application of these models can create undesirable consequences such as generation of fake news, plagiarism, etc. As a result, accurate detection of AI-generated language can be crucial in responsible usage of LLMs. In this work, we explore 1) whether a certain body of text is AI generated or written by human, and 2) attribution of a specific language model in generating a body of text. Texts in both English and Spanish are considered. The datasets used in this study are provided as part of the Automated Text Identification (AuTexTification) shared task. For each of the research objectives stated above, we propose an ensemble neural model that generates probabilities from different pre-trained LLMs which are used as features to a Traditional Machine Learning (TML) classifier following it. For the first task of distinguishing between AI and human generated text, our model ranked in fifth and thirteenth place (with macro 1 scores of 0.733 and 0.649) for English and Spanish texts, respectively. For the second task on model attribution, our model ranked in first place with macro 1 scores of 0.625 and 0.653 for English and Spanish texts, respectively.

Research paper thumbnail of A Simple yet Efficient Ensemble Approach for AI-generated Text Detection

arXiv (Cornell University), Nov 5, 2023

Recent Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text ... more Recent Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing across wide range of styles and genres. However, such capabilities are prone to potential abuse, such as fake news generation, spam email creation, and misuse in academic assignments. Hence, it is essential to build automated approaches capable of distinguishing between artificially generated text and humanauthored text. In this paper, we propose a simple yet efficient solution to this problem by ensembling predictions from multiple constituent LLMs. Compared to previous state-of-the-art approaches, which are perplexity-based or uses ensembles with a number of LLMs, our condensed ensembling approach uses only two constituent LLMs to achieve comparable performance. Experiments conducted on four benchmark datasets for generative text classification show performance improvements in the range of 0.5 to 100% compared to previous state-ofthe-art approaches. We also study the influence that the training data from individual LLMs have on model performance. We found that substituting commercially-restrictive Generative Pre-trained Transformer (GPT) data with data generated from other open language models such as Falcon, Large Language Model Meta AI (LLaMA2), and Mosaic Pretrained Transformers (MPT) is a feasible alternative when developing generative text detectors. Furthermore, to demonstrate zero-shot generalization, we experimented with an English essays dataset, and results suggest that our ensembling approach can handle new data effectively.

Research paper thumbnail of Leveraging deep survival models to predict quality of care risk in diverse hospital readmissions

Scientific Reports

Hospital readmissions rate is reportedly high and has caused huge financial burden on health care... more Hospital readmissions rate is reportedly high and has caused huge financial burden on health care systems in many countries. It is viewed as an important indicator of health care providers’ quality of care. We examine the use of machine learning-based survival analysis to assess quality of care risk in hospital readmissions. This study applies various survival models to explore the risk of hospital readmissions given patient demographics and their respective hospital discharges extracted from a health care claims dataset. We explore advanced feature representation techniques such as BioBERT and Node2Vec to encode high-dimensional diagnosis code features. To our knowledge, this study is the first to apply deep-learning based survival-analysis models for predicting hospital readmission risk agnostic of specific medical conditions and a fixed window for readmission. We found that modeling the time from discharge date to readmission date as a Weibull distribution as in the SparseDeepWei...

Research paper thumbnail of An Ensemble-Based Approach for Generative Language Model Attribution

Lecture Notes in Computer Science, Dec 31, 2022

Research paper thumbnail of Toward Semantic Digital Libraries: Exploiting Web 2.0 and Semantic Services in Cultural Heritage

Journal of Digital Information, Dec 21, 2009

Developing and maintaining a digital library requires substantial investments that are not simply... more Developing and maintaining a digital library requires substantial investments that are not simply a matter of technological decisions, but include also organizational issues (user roles, workflows, types of contents, etc.). These issues are often handled by approaches based ...

Research paper thumbnail of Handbook of research on Web 2.0, 3.0, and X.0: technologies, business, and social applications

Choice Reviews Online

Product or company names used in this set are for identification purposes only. Inclusion of the ... more Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Handbook of research on Web 2.0, 3.0, and X.0 : technologies, business, and social applications / San Murugesan, editor. p. cm. Includes bibliographical references and index. Summary: "This book provides a comprehensive reference source on next generation Web technologies and their applications"-Provided by publisher.

Research paper thumbnail of Toward Semantic Digital Libraries: Exploiting Semantic Web in Cultural Heritage

Developing and maintaining a digital library requires substantial investments which are not simpl... more Developing and maintaining a digital library requires substantial investments which are not simply a matter of technological decisions, but include also organizational aspects (which user roles are involved in content production, which workflows are needed, etc.). Moreover, starting a digital library initiative requires to tackle several issues such as the introduction of new user roles, workflows, and types of contents. These issues are often handled by approaches based on a physical perspective which treats the stored information either in terms of data formats or physical space needed to archive them. All these perspectives ignore quite completely the semantic aspects of the digital contents. In this paper we address such semantic perspective. More specifically, we propose a service oriented architecture which explicitly includes a semantic layer that provides primitive services to the applications built on the top of the digital library. As part of this layer, a specific component is described: the PIRATES framework. This module assists final users to complete several tasks concerning the retrieval of the most relevant content with respect to a description of their information needs (a search query, a user profile, etc.). Techniques of user modeling, adaptive personalization, and knowledge representation are exploited to build the PIRATES services in order to fill the gap existing between traditional and semantic digital libraries.

Research paper thumbnail of Towards Bridging the Gap between Personalization and Information Extraction

Ircdl, 2008

In this paper we propose to integrate Information Extraction and Adaptive Personalization in orde... more In this paper we propose to integrate Information Extraction and Adaptive Personalization in order to empower information access and Web search experience. We describe the PIE (Personalized Information Extraction) architecture which exploits zz-structures for organizing information and user profiles for capturing personal user interests in digital libraries. We apply our model to Bibliomed system in order to extend its functionalities.

Research paper thumbnail of A New Machine Learning Based Approach for Sentiment Classification of Italian documents

Ircdl, 2008

Several sites allow users to publish personal reviews about products and services available on th... more Several sites allow users to publish personal reviews about products and services available on the market. In this paper, we consider the problem of applying classification techniques to identify, in terms of positive or negative degree, the overall opinion polarity expressed in these documents, written in natural language. In particular, we are interested in evaluating the performance obtained by applying machine learning techniques, based on n-gram selection and originally developed for English, to documents written in Italian. In order to obtain results comparable to those presented in the literature for English, we use the same evaluation procedure applied in the majority of the works in this field. We have developed a specific framework for experimentation in Italian. The research is ongoing and we present some preliminary results, a comparison with results presented in literature and an overview of our future work.

Research paper thumbnail of Organizers

can be defined as a network similar to today's World Wide Web, linking people, organization,... more can be defined as a network similar to today's World Wide Web, linking people, organization, and concepts rather than documents. The main principle behind the Social Web is to harness the collective wisdom of communities of users. Over the last few years, we have observed the growth of several Social Web technologies. Social tagging, social networking, social search, social navigation, collaborative sharing and publishing are examples of these technologies. The technologies have been implemented in social systems such as Facebook (social networking), LiveJournal (blog), Del.icio.us (social bookmarking). They are all categorized by their user contributed information and knowledge in the form of user created content and user feedback. Growth of social systems and abundance of user created information highlight the importance of adaptation and personalization. Collective information distilled by social technologies is an excellent source for adaptation reasoning. While a set of cl...

Research paper thumbnail of A General Framework for Personalized Text Classification and Annotation

The tremendous volume of digital contents available today on the Web and the rapid spread of Web ... more The tremendous volume of digital contents available today on the Web and the rapid spread of Web 2.0 sites, blogs and forums have exacerbated the classical information overload problem. Moreover, they have made even worse the challenge of finding new content appropriate to individual needs. In order to alleviate these issues, new approaches and tools are needed to provide personalized content recommendation and classification schemata. This paper presents the PIRATES framework: a Personalized Intelligent tag Recommendation and Annotation TEStbed for text-based content retrieval and classification. Using an integrated set of tools, this framework lets the users experiment, customize, and personalize the way they retrieve, filter, and organize the large amount of information available on the Web. Furthermore, the PIRATES framework undertakes a novel approach that automates typical manual tasks such as content annotation and tagging, by means of personalized tags recommendations and ot...

Research paper thumbnail of For Content-Based

Research paper thumbnail of Accessing, Analyzing, and Extracting Information from User Generated Contents

Technologies, Business, and Social Applications, 2010

The concepts of the participative Web, mass collaboration, and collective intelligence grow out o... more The concepts of the participative Web, mass collaboration, and collective intelligence grow out of a set of Web methodologies and technologies which improve interaction with users in the development, rating, and distribution of user-generated content. UGC is one of the cornerstones of Web 2.0 and is the core concept of several different kinds of applications. UGC suggests new value chains and business models; it proposes innovative social, cultural, and economic opportunities and impacts. However, several open issues concerning semantic understanding and managing of digital information available on the Web, like information overload, heterogeneity of the available content, and effectiveness of retrieval are still unsolved. The research experiences we present in this chapter, described in literature or achieved in our research laboratory, are aimed at reducing the gap between users and information understanding, by means of collaborative and cognitive filtering, sentiment analysis, i...

Research paper thumbnail of A Keyphrase-Based Paper Recommender System

Communications in Computer and Information Science, 2011

Current digital libraries suffer from the information overload problem which prevents an effectiv... more Current digital libraries suffer from the information overload problem which prevents an effective access to knowledge. This is particularly true for scientific digital libraries where a growing amount of scientific articles can be explored by users with different needs, backgrounds, and interests. Recommender systems can tackle this limitation by filtering resources according to specific user needs. This paper introduces a content-based recommendation approach for enhancing the access to scientific digital libraries where a keyphrase extraction module is used to produce a rich description of both content of papers and user interests.

Research paper thumbnail of A New Domain Independent Keyphrase Extraction System

Communications in Computer and Information Science, 2010

In this paper we present a keyphrase extraction system that can extract potential phrases from a ... more In this paper we present a keyphrase extraction system that can extract potential phrases from a single document in an unsupervised, domain-independent way. We extract word n-grams from input document. We incorporate linguistic knowledge (i.e., part-of-speech tags), and statistical information (i.e., frequency, position, lifespan) of each n-gram in defining candidate phrases and their respective feature sets. The proposed approach can be applied to any document, however, in order to know the effectiveness of the system for digital libraries, we have carried out the evaluation on a set of scientific documents, and compared our results with current keyphrase extraction systems. The authors acknowledge the financial support of the Italian Ministry of Education, University and Research (MIUR) within the FIRB project number RBIN04M8S8. 2 Related Work Keyphrase extraction methods usually work in two stages: (i) a candidate identification stage, identifies all possible phrases from the document and (ii) a selection stage, selects only few candidate phrases as keyphrases. Existing methods for keyphrase extraction can be divided into supervised and unsupervised approaches, illustrated in the following: A. The supervised approach treats the problem as a classification task. In this approach, a model is constructed by using training documents, that have already keyphrases assigned (by humans) to them. This model is applied in order to select keyphrases from previously unseen documents. Turney

Research paper thumbnail of Automatic keyphrase extraction and ontology mining for content-based tag recommendation

International Journal of Intelligent Systems, 2010

Collaborative tagging represents for the Web a potential way for organizing and sharing informati... more Collaborative tagging represents for the Web a potential way for organizing and sharing information and for heightening the capabilities of existing search engines. However, because of the lack of automatic methodologies for generating the tags and supporting the tagging activity, many resources on the Web are deficient in tag information, and recommending opportune tags is both a current open issue and an exciting challenge. This paper approaches the problem by applying a combined set of techniques and tools (that uses tags, domain ontologies, keyphrase extraction methods) thereby generating tags automatically. The proposed approach is implemented in the PIRATES (Personalized Intelligent tag Recommender and Annotator TEStbed) framework, a prototype system for personalized content retrieval, annotation, and classification. A case study application is developed using a domain ontology for software engineering.

Research paper thumbnail of A New Machine Learning Based Approach for Sentiment Classification of Italian documents

celfi.unimc.it

Abstract—Several sites allow users to publish personal reviews about products and services availa... more Abstract—Several sites allow users to publish personal reviews about products and services available on the market. In this pa-per, we consider the problem of applying classification techniques to identify, in terms of positive or negative degree, the overall opinion polarity expressed ...

Research paper thumbnail of Towards bridging the gap between personalization and information extraction

IRCDL

I. INTRODUCTION The explosive growth and popularity of the World Wide Web has resulted in a massi... more I. INTRODUCTION The explosive growth and popularity of the World Wide Web has resulted in a massive amount of information sources on the Internet, creating a scenario where the answers to information needs of the users are available online some-where in some format; but in ...

Research paper thumbnail of Toward Semantic Digital Libraries: Exploiting Web 2.0 and Semantic Services in Cultural Heritage

Journal of Digital Information, 2009

Developing and maintaining a digital library requires substantial investments which are not simpl... more Developing and maintaining a digital library requires substantial investments which are not simply a matter of technological decisions, but include also organizational aspects (which user roles are involved in content production, which workflows are needed, etc.). Moreover, starting a digital library initiative requires to tackle several issues such as the introduction of new user roles, workflows, and types of contents. These issues are often handled by approaches based on a physical perspective which treats the stored information either in terms of data formats or physical space needed to archive them. All these perspectives ignore quite completely the semantic aspects of the digital contents. In this paper we address such semantic perspective. More specifically, we propose a service oriented architecture which explicitly includes a semantic layer that provides primitive services to the applications built on the top of the digital library. As part of this layer, a specific component is described: the PIRATES framework. This module assists final users to complete several tasks concerning the retrieval of the most relevant content with respect to a description of their information needs (a search query, a user profile, etc.). Techniques of user modeling, adaptive personalization, and knowledge representation are exploited to build the PIRATES services in order to fill the gap existing between traditional and semantic digital libraries.

Research paper thumbnail of Exploration of Open Large Language Models for eDiscovery

Research paper thumbnail of Generative AI Text Classification using Ensemble LLM Approaches

arXiv (Cornell University), Sep 13, 2023

Large Language Models (LLMs) have shown impressive performance across a variety of Artificial Int... more Large Language Models (LLMs) have shown impressive performance across a variety of Artificial Intelligence (AI) and natural language processing tasks, such as content creation, report generation, etc. However, unregulated malign application of these models can create undesirable consequences such as generation of fake news, plagiarism, etc. As a result, accurate detection of AI-generated language can be crucial in responsible usage of LLMs. In this work, we explore 1) whether a certain body of text is AI generated or written by human, and 2) attribution of a specific language model in generating a body of text. Texts in both English and Spanish are considered. The datasets used in this study are provided as part of the Automated Text Identification (AuTexTification) shared task. For each of the research objectives stated above, we propose an ensemble neural model that generates probabilities from different pre-trained LLMs which are used as features to a Traditional Machine Learning (TML) classifier following it. For the first task of distinguishing between AI and human generated text, our model ranked in fifth and thirteenth place (with macro 1 scores of 0.733 and 0.649) for English and Spanish texts, respectively. For the second task on model attribution, our model ranked in first place with macro 1 scores of 0.625 and 0.653 for English and Spanish texts, respectively.

Research paper thumbnail of A Simple yet Efficient Ensemble Approach for AI-generated Text Detection

arXiv (Cornell University), Nov 5, 2023

Recent Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text ... more Recent Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing across wide range of styles and genres. However, such capabilities are prone to potential abuse, such as fake news generation, spam email creation, and misuse in academic assignments. Hence, it is essential to build automated approaches capable of distinguishing between artificially generated text and humanauthored text. In this paper, we propose a simple yet efficient solution to this problem by ensembling predictions from multiple constituent LLMs. Compared to previous state-of-the-art approaches, which are perplexity-based or uses ensembles with a number of LLMs, our condensed ensembling approach uses only two constituent LLMs to achieve comparable performance. Experiments conducted on four benchmark datasets for generative text classification show performance improvements in the range of 0.5 to 100% compared to previous state-ofthe-art approaches. We also study the influence that the training data from individual LLMs have on model performance. We found that substituting commercially-restrictive Generative Pre-trained Transformer (GPT) data with data generated from other open language models such as Falcon, Large Language Model Meta AI (LLaMA2), and Mosaic Pretrained Transformers (MPT) is a feasible alternative when developing generative text detectors. Furthermore, to demonstrate zero-shot generalization, we experimented with an English essays dataset, and results suggest that our ensembling approach can handle new data effectively.

Research paper thumbnail of Leveraging deep survival models to predict quality of care risk in diverse hospital readmissions

Scientific Reports

Hospital readmissions rate is reportedly high and has caused huge financial burden on health care... more Hospital readmissions rate is reportedly high and has caused huge financial burden on health care systems in many countries. It is viewed as an important indicator of health care providers’ quality of care. We examine the use of machine learning-based survival analysis to assess quality of care risk in hospital readmissions. This study applies various survival models to explore the risk of hospital readmissions given patient demographics and their respective hospital discharges extracted from a health care claims dataset. We explore advanced feature representation techniques such as BioBERT and Node2Vec to encode high-dimensional diagnosis code features. To our knowledge, this study is the first to apply deep-learning based survival-analysis models for predicting hospital readmission risk agnostic of specific medical conditions and a fixed window for readmission. We found that modeling the time from discharge date to readmission date as a Weibull distribution as in the SparseDeepWei...

Research paper thumbnail of An Ensemble-Based Approach for Generative Language Model Attribution

Lecture Notes in Computer Science, Dec 31, 2022

Research paper thumbnail of Toward Semantic Digital Libraries: Exploiting Web 2.0 and Semantic Services in Cultural Heritage

Journal of Digital Information, Dec 21, 2009

Developing and maintaining a digital library requires substantial investments that are not simply... more Developing and maintaining a digital library requires substantial investments that are not simply a matter of technological decisions, but include also organizational issues (user roles, workflows, types of contents, etc.). These issues are often handled by approaches based ...

Research paper thumbnail of Handbook of research on Web 2.0, 3.0, and X.0: technologies, business, and social applications

Choice Reviews Online

Product or company names used in this set are for identification purposes only. Inclusion of the ... more Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Handbook of research on Web 2.0, 3.0, and X.0 : technologies, business, and social applications / San Murugesan, editor. p. cm. Includes bibliographical references and index. Summary: "This book provides a comprehensive reference source on next generation Web technologies and their applications"-Provided by publisher.

Research paper thumbnail of Toward Semantic Digital Libraries: Exploiting Semantic Web in Cultural Heritage

Developing and maintaining a digital library requires substantial investments which are not simpl... more Developing and maintaining a digital library requires substantial investments which are not simply a matter of technological decisions, but include also organizational aspects (which user roles are involved in content production, which workflows are needed, etc.). Moreover, starting a digital library initiative requires to tackle several issues such as the introduction of new user roles, workflows, and types of contents. These issues are often handled by approaches based on a physical perspective which treats the stored information either in terms of data formats or physical space needed to archive them. All these perspectives ignore quite completely the semantic aspects of the digital contents. In this paper we address such semantic perspective. More specifically, we propose a service oriented architecture which explicitly includes a semantic layer that provides primitive services to the applications built on the top of the digital library. As part of this layer, a specific component is described: the PIRATES framework. This module assists final users to complete several tasks concerning the retrieval of the most relevant content with respect to a description of their information needs (a search query, a user profile, etc.). Techniques of user modeling, adaptive personalization, and knowledge representation are exploited to build the PIRATES services in order to fill the gap existing between traditional and semantic digital libraries.

Research paper thumbnail of Towards Bridging the Gap between Personalization and Information Extraction

Ircdl, 2008

In this paper we propose to integrate Information Extraction and Adaptive Personalization in orde... more In this paper we propose to integrate Information Extraction and Adaptive Personalization in order to empower information access and Web search experience. We describe the PIE (Personalized Information Extraction) architecture which exploits zz-structures for organizing information and user profiles for capturing personal user interests in digital libraries. We apply our model to Bibliomed system in order to extend its functionalities.

Research paper thumbnail of A New Machine Learning Based Approach for Sentiment Classification of Italian documents

Ircdl, 2008

Several sites allow users to publish personal reviews about products and services available on th... more Several sites allow users to publish personal reviews about products and services available on the market. In this paper, we consider the problem of applying classification techniques to identify, in terms of positive or negative degree, the overall opinion polarity expressed in these documents, written in natural language. In particular, we are interested in evaluating the performance obtained by applying machine learning techniques, based on n-gram selection and originally developed for English, to documents written in Italian. In order to obtain results comparable to those presented in the literature for English, we use the same evaluation procedure applied in the majority of the works in this field. We have developed a specific framework for experimentation in Italian. The research is ongoing and we present some preliminary results, a comparison with results presented in literature and an overview of our future work.

Research paper thumbnail of Organizers

can be defined as a network similar to today's World Wide Web, linking people, organization,... more can be defined as a network similar to today's World Wide Web, linking people, organization, and concepts rather than documents. The main principle behind the Social Web is to harness the collective wisdom of communities of users. Over the last few years, we have observed the growth of several Social Web technologies. Social tagging, social networking, social search, social navigation, collaborative sharing and publishing are examples of these technologies. The technologies have been implemented in social systems such as Facebook (social networking), LiveJournal (blog), Del.icio.us (social bookmarking). They are all categorized by their user contributed information and knowledge in the form of user created content and user feedback. Growth of social systems and abundance of user created information highlight the importance of adaptation and personalization. Collective information distilled by social technologies is an excellent source for adaptation reasoning. While a set of cl...

Research paper thumbnail of A General Framework for Personalized Text Classification and Annotation

The tremendous volume of digital contents available today on the Web and the rapid spread of Web ... more The tremendous volume of digital contents available today on the Web and the rapid spread of Web 2.0 sites, blogs and forums have exacerbated the classical information overload problem. Moreover, they have made even worse the challenge of finding new content appropriate to individual needs. In order to alleviate these issues, new approaches and tools are needed to provide personalized content recommendation and classification schemata. This paper presents the PIRATES framework: a Personalized Intelligent tag Recommendation and Annotation TEStbed for text-based content retrieval and classification. Using an integrated set of tools, this framework lets the users experiment, customize, and personalize the way they retrieve, filter, and organize the large amount of information available on the Web. Furthermore, the PIRATES framework undertakes a novel approach that automates typical manual tasks such as content annotation and tagging, by means of personalized tags recommendations and ot...

Research paper thumbnail of For Content-Based

Research paper thumbnail of Accessing, Analyzing, and Extracting Information from User Generated Contents

Technologies, Business, and Social Applications, 2010

The concepts of the participative Web, mass collaboration, and collective intelligence grow out o... more The concepts of the participative Web, mass collaboration, and collective intelligence grow out of a set of Web methodologies and technologies which improve interaction with users in the development, rating, and distribution of user-generated content. UGC is one of the cornerstones of Web 2.0 and is the core concept of several different kinds of applications. UGC suggests new value chains and business models; it proposes innovative social, cultural, and economic opportunities and impacts. However, several open issues concerning semantic understanding and managing of digital information available on the Web, like information overload, heterogeneity of the available content, and effectiveness of retrieval are still unsolved. The research experiences we present in this chapter, described in literature or achieved in our research laboratory, are aimed at reducing the gap between users and information understanding, by means of collaborative and cognitive filtering, sentiment analysis, i...

Research paper thumbnail of A Keyphrase-Based Paper Recommender System

Communications in Computer and Information Science, 2011

Current digital libraries suffer from the information overload problem which prevents an effectiv... more Current digital libraries suffer from the information overload problem which prevents an effective access to knowledge. This is particularly true for scientific digital libraries where a growing amount of scientific articles can be explored by users with different needs, backgrounds, and interests. Recommender systems can tackle this limitation by filtering resources according to specific user needs. This paper introduces a content-based recommendation approach for enhancing the access to scientific digital libraries where a keyphrase extraction module is used to produce a rich description of both content of papers and user interests.

Research paper thumbnail of A New Domain Independent Keyphrase Extraction System

Communications in Computer and Information Science, 2010

In this paper we present a keyphrase extraction system that can extract potential phrases from a ... more In this paper we present a keyphrase extraction system that can extract potential phrases from a single document in an unsupervised, domain-independent way. We extract word n-grams from input document. We incorporate linguistic knowledge (i.e., part-of-speech tags), and statistical information (i.e., frequency, position, lifespan) of each n-gram in defining candidate phrases and their respective feature sets. The proposed approach can be applied to any document, however, in order to know the effectiveness of the system for digital libraries, we have carried out the evaluation on a set of scientific documents, and compared our results with current keyphrase extraction systems. The authors acknowledge the financial support of the Italian Ministry of Education, University and Research (MIUR) within the FIRB project number RBIN04M8S8. 2 Related Work Keyphrase extraction methods usually work in two stages: (i) a candidate identification stage, identifies all possible phrases from the document and (ii) a selection stage, selects only few candidate phrases as keyphrases. Existing methods for keyphrase extraction can be divided into supervised and unsupervised approaches, illustrated in the following: A. The supervised approach treats the problem as a classification task. In this approach, a model is constructed by using training documents, that have already keyphrases assigned (by humans) to them. This model is applied in order to select keyphrases from previously unseen documents. Turney

Research paper thumbnail of Automatic keyphrase extraction and ontology mining for content-based tag recommendation

International Journal of Intelligent Systems, 2010

Collaborative tagging represents for the Web a potential way for organizing and sharing informati... more Collaborative tagging represents for the Web a potential way for organizing and sharing information and for heightening the capabilities of existing search engines. However, because of the lack of automatic methodologies for generating the tags and supporting the tagging activity, many resources on the Web are deficient in tag information, and recommending opportune tags is both a current open issue and an exciting challenge. This paper approaches the problem by applying a combined set of techniques and tools (that uses tags, domain ontologies, keyphrase extraction methods) thereby generating tags automatically. The proposed approach is implemented in the PIRATES (Personalized Intelligent tag Recommender and Annotator TEStbed) framework, a prototype system for personalized content retrieval, annotation, and classification. A case study application is developed using a domain ontology for software engineering.

Research paper thumbnail of A New Machine Learning Based Approach for Sentiment Classification of Italian documents

celfi.unimc.it

Abstract—Several sites allow users to publish personal reviews about products and services availa... more Abstract—Several sites allow users to publish personal reviews about products and services available on the market. In this pa-per, we consider the problem of applying classification techniques to identify, in terms of positive or negative degree, the overall opinion polarity expressed ...

Research paper thumbnail of Towards bridging the gap between personalization and information extraction

IRCDL

I. INTRODUCTION The explosive growth and popularity of the World Wide Web has resulted in a massi... more I. INTRODUCTION The explosive growth and popularity of the World Wide Web has resulted in a massive amount of information sources on the Internet, creating a scenario where the answers to information needs of the users are available online some-where in some format; but in ...

Research paper thumbnail of Toward Semantic Digital Libraries: Exploiting Web 2.0 and Semantic Services in Cultural Heritage

Journal of Digital Information, 2009

Developing and maintaining a digital library requires substantial investments which are not simpl... more Developing and maintaining a digital library requires substantial investments which are not simply a matter of technological decisions, but include also organizational aspects (which user roles are involved in content production, which workflows are needed, etc.). Moreover, starting a digital library initiative requires to tackle several issues such as the introduction of new user roles, workflows, and types of contents. These issues are often handled by approaches based on a physical perspective which treats the stored information either in terms of data formats or physical space needed to archive them. All these perspectives ignore quite completely the semantic aspects of the digital contents. In this paper we address such semantic perspective. More specifically, we propose a service oriented architecture which explicitly includes a semantic layer that provides primitive services to the applications built on the top of the digital library. As part of this layer, a specific component is described: the PIRATES framework. This module assists final users to complete several tasks concerning the retrieval of the most relevant content with respect to a description of their information needs (a search query, a user profile, etc.). Techniques of user modeling, adaptive personalization, and knowledge representation are exploited to build the PIRATES services in order to fill the gap existing between traditional and semantic digital libraries.