Paolo Rosso | Universitat Politècnica de València (original) (raw)
Papers by Paolo Rosso
Lecture Notes in Computer Science, 2007
Among various document clustering algorithms that have been proposed so far, the most useful are ... more Among various document clustering algorithms that have been proposed so far, the most useful are those that automatically reveal the number of clusters and assign each target document to exactly one cluster. However, in many real situations, there not exists an exact boundary between different clusters. In this work, we introduce a fuzzy version of the MajorClust algorithm. The proposed clustering method assigns documents to more than one category by taking into account a membership function for both, edges and nodes of the corresponding underlying graph. Thus, the clustering problem is formulated in terms of weighted fuzzy graphs. The fuzzy approach permits to decrease some negative effects which appear in clustering of large-sized corpora with noisy data.
Lecture Notes in Computer Science, 2011
Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, abo... more Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content.
Proceedings of the 4th ACM workshop on Geographical information retrieval - GIR '07, 2007
In this paper we compare two methods for the automatic identification of geographical articles in... more In this paper we compare two methods for the automatic identification of geographical articles in encyclopedic resources such as Wikipedia. The methods are a WordNet-based method that uses a set of keywords related to geographical places, and a multinomial Naïve Bayes classificator, trained over a randomly selected subset of the English Wikipedia. This task may be included into the broader task of Named Entity classification, a well-known problem in the field of Natural Language Processing. The experiments were carried out considering both the full text of the articles and only the definition of the entity being described in the article. The obtained results show that the information contained in the page templates and the category labels is more useful than the text of the articles.
Lecture Notes in Computer Science, 2006
This paper describes how we managed to use the WordNet ontology for the GeoCLEF 2005 English mono... more This paper describes how we managed to use the WordNet ontology for the GeoCLEF 2005 English monolingual task. Both a query expansion method, based on the expansion of geographical terms by means of WordNet synonyms and meronyms, and a method based on the expansion of index terms, which exploits WordNet synonyms and holonyms. The obtained results show that the query expansion method was not suitable for the GeoCLEF track, while WordNet could be used in a more effective way during the indexing phase.
Lecture Notes in Computer Science, 2007
This paper presents an indexing technique based on Word-Net synonyms and holonyms. This technique... more This paper presents an indexing technique based on Word-Net synonyms and holonyms. This technique has been developed for the Geographical Information Retrieval task. It may help in finding implicit geographic information contained in texts, particularly if the indication of the containing geographical entity is omitted. Our experiments were carried out with the Lucene search engine over the GeoCLEF 2006 set of topics. Results show that expansion can improve recall in some cases, although a specific ranking function is needed in order to obtain better results in terms of precision.
Lecture Notes in Computer Science, 2007
This paper presents a simple approach to the Wikipedia Question Answering pilot task in CLEF 2006... more This paper presents a simple approach to the Wikipedia Question Answering pilot task in CLEF 2006. The approach ranks the snippets, retrieved using the Lucene search engine, by means of a similarity measure based on bags of words extracted from both the snippets and the articles in wikipedia. Our participation was in the monolingual English and Spanish tasks. We obtained the best results in the Spanish one.
Lecture Notes in Computer Science, 2012
Question Answering is an Information Retrieval task where the query is posed using natural langua... more Question Answering is an Information Retrieval task where the query is posed using natural language and the expected result is a concise answer. Voice-activated Question Answering systems represent an interesting application, where the question is formulated by speech. In these systems, an Automatic Speech Recognition module can be used to transcribe the question. Thus, recognition errors may be introduced, producing a significant effect on the answer retrieval process. In this work we study the relationship between some features of misrecognized words and the retrieval results. The features considered are the redundancy of a word in the result set and its inverse document frequency calculated over the collection. The results show that the redundancy of a word may be an important clue on whether an error on it would deteriorate the retrieval results, at least if a closed model is used for speech recognition.
Lecture Notes in Computer Science, 2009
This paper describes our approach to the Question Answering -Word Sense Disambiguation task. This... more This paper describes our approach to the Question Answering -Word Sense Disambiguation task. This task consists in carrying out Question Answering over a disambiguated document collection. In our approach, disambiguated documents are used to improve the accuracy of the retrieval phase. In order to do this, we added a WordNet-expanded index to the document collection. The expanded index contains synonyms, hypernyms and holonyms of the words already in the documents. Question words are searched for in both the expanded WordNet index and the default index. The obtained results show that the system that exploited disambiguation obtained better precision than the non-WSD one.
Lecture Notes in Computer Science, 2006
The disambiguation of verbs is usually considered to be more difficult with respect to other part... more The disambiguation of verbs is usually considered to be more difficult with respect to other part-of-speech categories. This is due both to the high polysemy of verbs compared with the other categories, and to the lack of lexical resources providing relations between verbs and nouns. One of such resources is WordNet, which provides plenty of information and relationships for nouns, whereas it is less comprehensive with respect to verbs. In this paper we focus on the disambiguation of verbs by means of Support Vector Machines and the use of WordNet-extracted features, based on the hyperonyms of context nouns.
Lecture Notes in Computer Science, 2013
Lecture Notes in Computer Science, 2007
In this paper we present some results obtained in humour classification over a corpus of Italian ... more In this paper we present some results obtained in humour classification over a corpus of Italian quotations manually extracted and tagged from the Wikiquote project. The experiments were carried out using both a multinomial Naïve Bayes classifier and a Support Vector Machine (SVM). The considered features range from single words to ngrams and sentence length. The obtained results show that it is possible to identify the funny quotes even with the simplest features (bag of words); the bayesian classifier performed better than the SVM. However, the size of the corpus size is too small to support definitive assertions.
Lecture Notes in Computer Science, 2009
We present a method that uses GeoWordNet for Geographical Information Retrieval. During the index... more We present a method that uses GeoWordNet for Geographical Information Retrieval. During the indexing phase, all places are disambiguated and assigned their coordinates on the world map. Documents are first searched for by means of a term-based search method, and then re-ranked according to the geographical information. The results show that map-based re-ranking allows to improve the results obtained by the base system, which relies only on textual information.
Lecture Notes in Computer Science, 2005
... MISC{Alex_clusteranalysis, author = {Mikhail Alex and Emilio Sanchis and Paolo Rosso}, title ... more ... MISC{Alex_clusteranalysis, author = {Mikhail Alex and Emilio Sanchis and Paolo Rosso}, title = {Cluster Analysis of Railway Directory Inquire Dialogs *}, year = {} }. ... 1, Segarra E.: Applying dialog constraints to the understanding process in a dialog system Sanchis, Garcia, et al. ...
Proceeding of the 2nd international workshop on Geographic information retrieval - GIR '08, 2008
Toponym Disambiguation, i.e. the task of assigning to place name their correct reference in the w... more Toponym Disambiguation, i.e. the task of assigning to place name their correct reference in the world, is getting more attention from many researchers. Many methods have been proposed since now, making use of different resources, techniques and sense inventories. Unfortunately, a gold standard for the evaluation of those methods is not yet available; therefore, it is difficult to verify the performance of such methods. Recently, a georeferenced version of WordNet has been developed, a resource that can be used to compare methods that are based on geographical data with methods that use textual information. In this paper we carry out a comparison between two of these methods. The results show that the knowledge-based method allowed us to obtain better results with a smaller context size. On the other hand, we observed that the map-based method needs a large context to obtain a good accuracy.
This report describes our approach to the Robust -Word Sense Disambiguation task. We applied the ... more This report describes our approach to the Robust -Word Sense Disambiguation task. We applied the same index expansion technique used in 2008 for the Question Answering WSD task, with the addition of pseudo (blind) relevance feedback. In our approach, a WordNet expanded index is generated from the disambiguated document collection. This index contains synonyms, hypernyms and holonyms of the disambiguated words contained in documents. Query words are searched for in both the expanded WordNet index and the default index. The results show that the use of the extended index did not prove useful, obtaining 14 − 16% less in MAP with respect to the base system.
This paper investigates the effectiveness of using the redundancy of the web for solving the Word... more This paper investigates the effectiveness of using the redundancy of the web for solving the Word Sense Disambiguation task. The web-based algorithm looks for the adjective-noun pairs in the web to disambiguate an english noun. Preliminary results show that a better precision than the baseline is obtained but with a low recall. Moreover, the web seems to be more effective than the WordNet Doamains when integrated rather than stand-alone.
Abstract. Recently, the relation between the entropy of words (a new measure from Information The... more Abstract. Recently, the relation between the entropy of words (a new measure from Information Theory introduced by Montemurro in 2001) and the role of words in literary texts, as well as the capacity of entropy for clustering words, has been shown. Our final goal is to ...
Lecture Notes in Computer Science, 2007
Among various document clustering algorithms that have been proposed so far, the most useful are ... more Among various document clustering algorithms that have been proposed so far, the most useful are those that automatically reveal the number of clusters and assign each target document to exactly one cluster. However, in many real situations, there not exists an exact boundary between different clusters. In this work, we introduce a fuzzy version of the MajorClust algorithm. The proposed clustering method assigns documents to more than one category by taking into account a membership function for both, edges and nodes of the corresponding underlying graph. Thus, the clustering problem is formulated in terms of weighted fuzzy graphs. The fuzzy approach permits to decrease some negative effects which appear in clustering of large-sized corpora with noisy data.
Lecture Notes in Computer Science, 2011
Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, abo... more Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content.
Proceedings of the 4th ACM workshop on Geographical information retrieval - GIR '07, 2007
In this paper we compare two methods for the automatic identification of geographical articles in... more In this paper we compare two methods for the automatic identification of geographical articles in encyclopedic resources such as Wikipedia. The methods are a WordNet-based method that uses a set of keywords related to geographical places, and a multinomial Naïve Bayes classificator, trained over a randomly selected subset of the English Wikipedia. This task may be included into the broader task of Named Entity classification, a well-known problem in the field of Natural Language Processing. The experiments were carried out considering both the full text of the articles and only the definition of the entity being described in the article. The obtained results show that the information contained in the page templates and the category labels is more useful than the text of the articles.
Lecture Notes in Computer Science, 2006
This paper describes how we managed to use the WordNet ontology for the GeoCLEF 2005 English mono... more This paper describes how we managed to use the WordNet ontology for the GeoCLEF 2005 English monolingual task. Both a query expansion method, based on the expansion of geographical terms by means of WordNet synonyms and meronyms, and a method based on the expansion of index terms, which exploits WordNet synonyms and holonyms. The obtained results show that the query expansion method was not suitable for the GeoCLEF track, while WordNet could be used in a more effective way during the indexing phase.
Lecture Notes in Computer Science, 2007
This paper presents an indexing technique based on Word-Net synonyms and holonyms. This technique... more This paper presents an indexing technique based on Word-Net synonyms and holonyms. This technique has been developed for the Geographical Information Retrieval task. It may help in finding implicit geographic information contained in texts, particularly if the indication of the containing geographical entity is omitted. Our experiments were carried out with the Lucene search engine over the GeoCLEF 2006 set of topics. Results show that expansion can improve recall in some cases, although a specific ranking function is needed in order to obtain better results in terms of precision.
Lecture Notes in Computer Science, 2007
This paper presents a simple approach to the Wikipedia Question Answering pilot task in CLEF 2006... more This paper presents a simple approach to the Wikipedia Question Answering pilot task in CLEF 2006. The approach ranks the snippets, retrieved using the Lucene search engine, by means of a similarity measure based on bags of words extracted from both the snippets and the articles in wikipedia. Our participation was in the monolingual English and Spanish tasks. We obtained the best results in the Spanish one.
Lecture Notes in Computer Science, 2012
Question Answering is an Information Retrieval task where the query is posed using natural langua... more Question Answering is an Information Retrieval task where the query is posed using natural language and the expected result is a concise answer. Voice-activated Question Answering systems represent an interesting application, where the question is formulated by speech. In these systems, an Automatic Speech Recognition module can be used to transcribe the question. Thus, recognition errors may be introduced, producing a significant effect on the answer retrieval process. In this work we study the relationship between some features of misrecognized words and the retrieval results. The features considered are the redundancy of a word in the result set and its inverse document frequency calculated over the collection. The results show that the redundancy of a word may be an important clue on whether an error on it would deteriorate the retrieval results, at least if a closed model is used for speech recognition.
Lecture Notes in Computer Science, 2009
This paper describes our approach to the Question Answering -Word Sense Disambiguation task. This... more This paper describes our approach to the Question Answering -Word Sense Disambiguation task. This task consists in carrying out Question Answering over a disambiguated document collection. In our approach, disambiguated documents are used to improve the accuracy of the retrieval phase. In order to do this, we added a WordNet-expanded index to the document collection. The expanded index contains synonyms, hypernyms and holonyms of the words already in the documents. Question words are searched for in both the expanded WordNet index and the default index. The obtained results show that the system that exploited disambiguation obtained better precision than the non-WSD one.
Lecture Notes in Computer Science, 2006
The disambiguation of verbs is usually considered to be more difficult with respect to other part... more The disambiguation of verbs is usually considered to be more difficult with respect to other part-of-speech categories. This is due both to the high polysemy of verbs compared with the other categories, and to the lack of lexical resources providing relations between verbs and nouns. One of such resources is WordNet, which provides plenty of information and relationships for nouns, whereas it is less comprehensive with respect to verbs. In this paper we focus on the disambiguation of verbs by means of Support Vector Machines and the use of WordNet-extracted features, based on the hyperonyms of context nouns.
Lecture Notes in Computer Science, 2013
Lecture Notes in Computer Science, 2007
In this paper we present some results obtained in humour classification over a corpus of Italian ... more In this paper we present some results obtained in humour classification over a corpus of Italian quotations manually extracted and tagged from the Wikiquote project. The experiments were carried out using both a multinomial Naïve Bayes classifier and a Support Vector Machine (SVM). The considered features range from single words to ngrams and sentence length. The obtained results show that it is possible to identify the funny quotes even with the simplest features (bag of words); the bayesian classifier performed better than the SVM. However, the size of the corpus size is too small to support definitive assertions.
Lecture Notes in Computer Science, 2009
We present a method that uses GeoWordNet for Geographical Information Retrieval. During the index... more We present a method that uses GeoWordNet for Geographical Information Retrieval. During the indexing phase, all places are disambiguated and assigned their coordinates on the world map. Documents are first searched for by means of a term-based search method, and then re-ranked according to the geographical information. The results show that map-based re-ranking allows to improve the results obtained by the base system, which relies only on textual information.
Lecture Notes in Computer Science, 2005
... MISC{Alex_clusteranalysis, author = {Mikhail Alex and Emilio Sanchis and Paolo Rosso}, title ... more ... MISC{Alex_clusteranalysis, author = {Mikhail Alex and Emilio Sanchis and Paolo Rosso}, title = {Cluster Analysis of Railway Directory Inquire Dialogs *}, year = {} }. ... 1, Segarra E.: Applying dialog constraints to the understanding process in a dialog system Sanchis, Garcia, et al. ...
Proceeding of the 2nd international workshop on Geographic information retrieval - GIR '08, 2008
Toponym Disambiguation, i.e. the task of assigning to place name their correct reference in the w... more Toponym Disambiguation, i.e. the task of assigning to place name their correct reference in the world, is getting more attention from many researchers. Many methods have been proposed since now, making use of different resources, techniques and sense inventories. Unfortunately, a gold standard for the evaluation of those methods is not yet available; therefore, it is difficult to verify the performance of such methods. Recently, a georeferenced version of WordNet has been developed, a resource that can be used to compare methods that are based on geographical data with methods that use textual information. In this paper we carry out a comparison between two of these methods. The results show that the knowledge-based method allowed us to obtain better results with a smaller context size. On the other hand, we observed that the map-based method needs a large context to obtain a good accuracy.
This report describes our approach to the Robust -Word Sense Disambiguation task. We applied the ... more This report describes our approach to the Robust -Word Sense Disambiguation task. We applied the same index expansion technique used in 2008 for the Question Answering WSD task, with the addition of pseudo (blind) relevance feedback. In our approach, a WordNet expanded index is generated from the disambiguated document collection. This index contains synonyms, hypernyms and holonyms of the disambiguated words contained in documents. Query words are searched for in both the expanded WordNet index and the default index. The results show that the use of the extended index did not prove useful, obtaining 14 − 16% less in MAP with respect to the base system.
This paper investigates the effectiveness of using the redundancy of the web for solving the Word... more This paper investigates the effectiveness of using the redundancy of the web for solving the Word Sense Disambiguation task. The web-based algorithm looks for the adjective-noun pairs in the web to disambiguate an english noun. Preliminary results show that a better precision than the baseline is obtained but with a low recall. Moreover, the web seems to be more effective than the WordNet Doamains when integrated rather than stand-alone.
Abstract. Recently, the relation between the entropy of words (a new measure from Information The... more Abstract. Recently, the relation between the entropy of words (a new measure from Information Theory introduced by Montemurro in 2001) and the role of words in literary texts, as well as the capacity of entropy for clustering words, has been shown. Our final goal is to ...