Alberto Barron - Academia.edu (original) (raw)
Papers by Alberto Barron
Lecture Notes in Computer Science, 2018
We present an overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verificati... more We present an overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims, with focus on Task 1: Check-Worthiness. The task asks to predict which claims in a political debate should be prioritized for fact-checking. In particular, given a debate or a political speech, the goal was to produce a ranked list of its sentences based on their worthiness for fact checking. We offered the task in both English and Arabic, based on debates from the 2016 US Presidential Campaign, as well as on some speeches during and after the campaign. A total of 30 teams registered to participate in the Lab and seven teams actually submitted systems for Task 1. The most successful approaches used by the participants relied on recurrent and multi-layer neural networks, as well as on combinations of distributional representations, on matchings claims' vocabulary against lexicons, and on measures of syntactic dependency. The best systems achieved mean average precision of 0.18 and 0.15 on the English and on the Arabic test datasets, respectively. This leaves large room for further improvement, and thus we release all datasets and the scoring scripts, which should enable further research in check-worthiness estimation.
Lecture Notes in Computer Science, 2013
The development of models for automatic detection of text re-use and plagiarism across languages ... more The development of models for automatic detection of text re-use and plagiarism across languages has received increasing attention in the last years. However, the lack of an evaluation framework composed of annotated datasets has caused these efforts to be isolated. In this paper we present the CL!TR 2011 corpus, the first manually created corpus for the analysis of cross-language text re-use between English and Hindi. The corpus was used during the Cross-Language !ndian Text Re-Use Detection Competition. Here we overview the approaches applied the contestants and evaluate their quality when detecting a re-used text together with its source.
Natural Language Processing and Information Systems, 2011
Internet has made available huge amounts of information, also source code. Source code repositori... more Internet has made available huge amounts of information, also source code. Source code repositories and, in general, programming related websites, facilitate its reuse. In this work, we propose a simple approach to the detection of cross-language source code reuse, a nearly investigated problem. Our preliminary experiments, based on character n-grams comparison, show that considering different sections of the code (i.e., comments, code, reserved words, etc.), leads to different results. When considering three programming languages: C++, Java, and Python, the best result is obtained when comments are discarded and the entire source code is considered.
Lecture Notes in Computer Science, 2010
The automatic detection of shared content in written documents-which includes text reuse and its ... more The automatic detection of shared content in written documents-which includes text reuse and its unacknowledged commitment, plagiarism-has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.
ACM SIGIR Forum, 2011
The Fourth International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misus... more The Fourth International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN10) was held in conjunction with the 2010 Conference on Multilingual and Multimodal Information Access Evaluation (CLEF-10) in Padua, Italy. The workshop was organized as a competition covering two tasks: plagiarism detection and Wikipedia vandalism detection. This report gives a short overview of the plagiarism detection task. Detailed analyses of both tasks have been published as CLEF Notebook Papers [3, 6], which can be downloaded at www.webis.de/publications.
Lecture Notes in Computer Science, 2009
When automatic plagiarism detection is carried out considering a reference corpus, a suspicious t... more When automatic plagiarism detection is carried out considering a reference corpus, a suspicious text is compared to a set of original documents in order to relate the plagiarised text fragments to their potential source. One of the biggest difficulties in this task is to locate plagiarised fragments that have been modified (by rewording, insertion or deletion, for example) from the source text. The definition of proper text chunks as comparison units of the suspicious and original texts is crucial for the success of this kind of applications. Our experiments with the METER corpus show that the best results are obtained when considering low level word n-grams comparisons (n = {2, 3}).
Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics, 2012
This work addresses the issue of cross-language high similarity and near-duplicates search, where... more This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.
Language Resources and Evaluation, 2010
Cross-language plagiarism detection deals with the automatic identification and extraction of pla... more Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (i) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (ii) state-of-the-art solutions for two important subtasks are reviewed, (iii) retrieval models for the assessment of cross-language similarity are surveyed, and, (iv) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120 000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the 6 languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on "exact" translations but does not generalize well.
Journal of Algorithms, 2009
The existence of huge volumes of documents written in multiple languages in Internet lead to inve... more The existence of huge volumes of documents written in multiple languages in Internet lead to investigate novel approaches to deal with information of this kind. We propose to use a statistical approach in order to tackle the problem of dealing with crosslingual natural language tasks. In particular, we apply the IBM alignment model 1 with the aim of obtaining a statistical bilingual dictionary which may further be used in order to approximate the relatedness probability of two given documents (written in different languages). The experimental results sucessfully obtained in three different tasks-text classification, information retrieval and plagiarism analysis-highlight the benefit of using the presented statistical approach.
Computational Linguistics, 2013
Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attent... more Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analyzed in the light of this annotation. The presented experiments show that (i) more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, (ii) lexical su...
Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set... more Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of documents in order to relate the plagiarised fragments to their potential source. The suspicious and source documents can be written wether in the same language (monolingual) or in different languages (crosslingual). In the context of the Ph. D., our work has been focused on both monolingual and crosslingual plagiarism detection. The monolingual approach is based on a search space reduction process followed by an exhaustive word n-grams comparison. Surprisingly it seems that the application of the reduction process has not been explored in this task previously. The crosslingual one is based on the well known IBM-1 alignment model. Having a competition on these topics will make our work available to the Spanish scientific community interested in plagiarism detection.
Proceedings of the 23rd …, Aug 23, 2010
Plagiarism, the unacknowledged reuse of text, does not end at language boundaries. Cross-language... more Plagiarism, the unacknowledged reuse of text, does not end at language boundaries. Cross-language plagiarism occurs if a text is translated from a fragment written in a different language and no proper citation is provided. Regardless of the change of language, the contents and, in particular, the ideas remain the same. Whereas different methods for the detection of monolingual plagiarism have been developed, less attention has been paid to the cross-language case.
Detección de reuso de código fuente entre lenguajes de programación con base en la frecuencia de ... more Detección de reuso de código fuente entre lenguajes de programación con base en la frecuencia de términos Enrique Flores, Alberto Barrón-Cedeno, Paolo Rosso, and Lidia Moreno Universidad Politécnica de Valencia, Dpto. de Sistemas Informáticos y Computación Camino de Vera ...
Lecture Notes in Computer Science, 2018
We present an overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verificati... more We present an overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims, with focus on Task 1: Check-Worthiness. The task asks to predict which claims in a political debate should be prioritized for fact-checking. In particular, given a debate or a political speech, the goal was to produce a ranked list of its sentences based on their worthiness for fact checking. We offered the task in both English and Arabic, based on debates from the 2016 US Presidential Campaign, as well as on some speeches during and after the campaign. A total of 30 teams registered to participate in the Lab and seven teams actually submitted systems for Task 1. The most successful approaches used by the participants relied on recurrent and multi-layer neural networks, as well as on combinations of distributional representations, on matchings claims' vocabulary against lexicons, and on measures of syntactic dependency. The best systems achieved mean average precision of 0.18 and 0.15 on the English and on the Arabic test datasets, respectively. This leaves large room for further improvement, and thus we release all datasets and the scoring scripts, which should enable further research in check-worthiness estimation.
Lecture Notes in Computer Science, 2013
The development of models for automatic detection of text re-use and plagiarism across languages ... more The development of models for automatic detection of text re-use and plagiarism across languages has received increasing attention in the last years. However, the lack of an evaluation framework composed of annotated datasets has caused these efforts to be isolated. In this paper we present the CL!TR 2011 corpus, the first manually created corpus for the analysis of cross-language text re-use between English and Hindi. The corpus was used during the Cross-Language !ndian Text Re-Use Detection Competition. Here we overview the approaches applied the contestants and evaluate their quality when detecting a re-used text together with its source.
Natural Language Processing and Information Systems, 2011
Internet has made available huge amounts of information, also source code. Source code repositori... more Internet has made available huge amounts of information, also source code. Source code repositories and, in general, programming related websites, facilitate its reuse. In this work, we propose a simple approach to the detection of cross-language source code reuse, a nearly investigated problem. Our preliminary experiments, based on character n-grams comparison, show that considering different sections of the code (i.e., comments, code, reserved words, etc.), leads to different results. When considering three programming languages: C++, Java, and Python, the best result is obtained when comments are discarded and the entire source code is considered.
Lecture Notes in Computer Science, 2010
The automatic detection of shared content in written documents-which includes text reuse and its ... more The automatic detection of shared content in written documents-which includes text reuse and its unacknowledged commitment, plagiarism-has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.
ACM SIGIR Forum, 2011
The Fourth International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misus... more The Fourth International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN10) was held in conjunction with the 2010 Conference on Multilingual and Multimodal Information Access Evaluation (CLEF-10) in Padua, Italy. The workshop was organized as a competition covering two tasks: plagiarism detection and Wikipedia vandalism detection. This report gives a short overview of the plagiarism detection task. Detailed analyses of both tasks have been published as CLEF Notebook Papers [3, 6], which can be downloaded at www.webis.de/publications.
Lecture Notes in Computer Science, 2009
When automatic plagiarism detection is carried out considering a reference corpus, a suspicious t... more When automatic plagiarism detection is carried out considering a reference corpus, a suspicious text is compared to a set of original documents in order to relate the plagiarised text fragments to their potential source. One of the biggest difficulties in this task is to locate plagiarised fragments that have been modified (by rewording, insertion or deletion, for example) from the source text. The definition of proper text chunks as comparison units of the suspicious and original texts is crucial for the success of this kind of applications. Our experiments with the METER corpus show that the best results are obtained when considering low level word n-grams comparisons (n = {2, 3}).
Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics, 2012
This work addresses the issue of cross-language high similarity and near-duplicates search, where... more This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.
Language Resources and Evaluation, 2010
Cross-language plagiarism detection deals with the automatic identification and extraction of pla... more Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (i) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (ii) state-of-the-art solutions for two important subtasks are reviewed, (iii) retrieval models for the assessment of cross-language similarity are surveyed, and, (iv) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120 000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the 6 languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on "exact" translations but does not generalize well.
Journal of Algorithms, 2009
The existence of huge volumes of documents written in multiple languages in Internet lead to inve... more The existence of huge volumes of documents written in multiple languages in Internet lead to investigate novel approaches to deal with information of this kind. We propose to use a statistical approach in order to tackle the problem of dealing with crosslingual natural language tasks. In particular, we apply the IBM alignment model 1 with the aim of obtaining a statistical bilingual dictionary which may further be used in order to approximate the relatedness probability of two given documents (written in different languages). The experimental results sucessfully obtained in three different tasks-text classification, information retrieval and plagiarism analysis-highlight the benefit of using the presented statistical approach.
Computational Linguistics, 2013
Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attent... more Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little attention has been paid to its analysis in the framework of automatic plagiarism detection. Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism, paying special attention to which paraphrase phenomena underlie acts of plagiarism and which of them are detected by plagiarism detection systems. With this aim in mind, we created the P4P corpus, a new resource that uses a paraphrase typology to annotate a subset of the PAN-PC-10 corpus for automatic plagiarism detection. The results of the Second International Competition on Plagiarism Detection were analyzed in the light of this annotation. The presented experiments show that (i) more complex paraphrase phenomena and a high density of paraphrase mechanisms make plagiarism detection more difficult, (ii) lexical su...
Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set... more Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of documents in order to relate the plagiarised fragments to their potential source. The suspicious and source documents can be written wether in the same language (monolingual) or in different languages (crosslingual). In the context of the Ph. D., our work has been focused on both monolingual and crosslingual plagiarism detection. The monolingual approach is based on a search space reduction process followed by an exhaustive word n-grams comparison. Surprisingly it seems that the application of the reduction process has not been explored in this task previously. The crosslingual one is based on the well known IBM-1 alignment model. Having a competition on these topics will make our work available to the Spanish scientific community interested in plagiarism detection.
Proceedings of the 23rd …, Aug 23, 2010
Plagiarism, the unacknowledged reuse of text, does not end at language boundaries. Cross-language... more Plagiarism, the unacknowledged reuse of text, does not end at language boundaries. Cross-language plagiarism occurs if a text is translated from a fragment written in a different language and no proper citation is provided. Regardless of the change of language, the contents and, in particular, the ideas remain the same. Whereas different methods for the detection of monolingual plagiarism have been developed, less attention has been paid to the cross-language case.
Detección de reuso de código fuente entre lenguajes de programación con base en la frecuencia de ... more Detección de reuso de código fuente entre lenguajes de programación con base en la frecuencia de términos Enrique Flores, Alberto Barrón-Cedeno, Paolo Rosso, and Lidia Moreno Universidad Politécnica de Valencia, Dpto. de Sistemas Informáticos y Computación Camino de Vera ...