Ahmet Aker | The University of Sheffield (original) (raw)
Papers by Ahmet Aker
lrec-conf.org
... Emina Kurtić1,2, Bill Wells2, Guy J. Brown1, Timothy Kempton1, Ahmet Aker1 1Department of Com... more ... Emina Kurtić1,2, Bill Wells2, Guy J. Brown1, Timothy Kempton1, Ahmet Aker1 1Department of Computer Science, 2Department of Human Communication ... large data sets of conversational phenomena can offer an integrated picture of the use of language and non-verbal cues in ...
Lecture Notes in Computer Science, 2010
This paper presents two different approaches to automatic captioning of geo-tagged images by summ... more This paper presents two different approaches to automatic captioning of geo-tagged images by summarizing multiple web-documents that contain information related to an image's location: a graph-based and a statistical-based approach. The graph-based method uses text cohesion techniques to identify information relevant to a location. The statistical-based technique relies on different word or noun phrases frequency counting for identifying pieces of information relevant to a location. Our results show that summaries generated using these two approaches lead indeed to higher ROUGE scores than n-gram language models reported in previous work.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012
Images with geo-tagging information are increasingly available on the Web. However, such images n... more Images with geo-tagging information are increasingly available on the Web. However, such images need to be annotated with additional textual information if they are to be retrievable, since users do not search by geo-coordinates. We propose to automatically generate such textual information by (1) generating toponyms from the geo-tagging information (2) retrieving Web documents using toponyms as queries (3) summarizing the retrieved documents. The summaries are then used to index the images. In this paper we investigate how various summarization techniques affect image retrieval performance and show significant improvements can be obtained when using the summaries for indexing.
aclweb.org
This paper proposes a method for automatically generating summaries taking into account the infor... more This paper proposes a method for automatically generating summaries taking into account the information in which users may be interested. Our approach relies on existing model summaries from tourist sites and captures from them the type of information humans use to describe places around the world. Relational patterns are first extracted and categorized by the type of information they encode. Then, we apply them to the collection of input documents to automatically extract the most relevant sentences and build the summaries. In order to evaluate the performance of our approach, we conduct two types of evaluation. On the one hand, we use ROUGE to assess the information contained in our summaries against existing human written summaries, whereas on the other hand, we carry out a human readability evaluation. Our results indicate that our approach achieves high performance both in ROUGE and manual evaluation.
Proceedings of LREC 2012, May 1, 2012
Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. Howeve... more Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages, parallel corpora are not readily available. To overcome this problem previous work has recognized the potential of using comparable corpora as training data. The process of obtaining such data usually involves (1) downloading a separate list of documents for each language,(2) matching the documents between two languages usually by comparing the document contents, and ...
Proceedings of LREC, 2012
The emergence of crowdsourcing as a commonly used approach to collect vast quantities of human as... more The emergence of crowdsourcing as a commonly used approach to collect vast quantities of human assessments on a variety of tasks represents nothing less than a paradigm shift. This is particularly true in academic research where it has suddenly become possible to collect (high-quality) annotations rapidly without the need of an expert. In this paper we investigate factors which can influence the quality of the results obtained through Amazon's Mechanical Turk crowdsourcing platform. We investigated the impact of different ...
This paper presents a detailed analysis of the use of crowdsourcing services for the Text Summari... more This paper presents a detailed analysis of the use of crowdsourcing services for the Text Summarization task in the context of the tourist domain. In particular, our aim is to retrieve relevant information about a place or an object pictured in an image in order to provide a short summary which will be of great help for a tourist. For tackling this task, we proposed a broad set of experiments using crowdsourcing services that could be useful as a reference for others who want to rely also on crowdsourcing. From the analysis carried out through our experimental setup and the results obtained, we can conclude that although crowdsourcing services were not good to simply gather gold-standard summaries (i.e., from the results obtained for experiments 1, 2 and 4), the encouraging results obtained in the third and sixth experiments motivate us to strongly believe that they can be successfully employed for finding some patterns of behaviour humans have when generating summaries, and for validating and checking other tasks. Furthermore, this analysis serves as a guideline for the types of experiments that might or might not work when using crowdsourcing in the context of text summarization.
In this article, we investigate what sorts of information humans request about geographical objec... more In this article, we investigate what sorts of information humans request about geographical objects of the same type. For example, Edinburgh Castle and Bodiam Castle are two objects of the same type: "castle." The question is whether specific information is requested for the object type "castle" and how this information differs for objects of other types (e.g., church, museum, or lake).
Product and service reviews are abundantly available online, but selecting relevant information f... more Product and service reviews are abundantly available online, but selecting relevant information from them is time consuming. Starlet solves this problem by extracting multidocument summaries that consider aspect rating distributions and language modeling.
In this paper we present a method for extracting bilingual terminologies from comparable corpora.... more In this paper we present a method for extracting bilingual terminologies from comparable corpora. In our approach we treat bilingual term extraction as a classification problem. For classification we use an SVM binary classifier and training data taken from the EUROVOC thesaurus. We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs. The performance of our classifier reaches the 100% precision level for many language pairs. We also perform manual evaluation on bilingual terms extracted from English-German term-tagged comparable corpora. The results of this manual evaluation showed 60-83% of the term pairs generated are exact translations and over 90% exact or partial translations.
Different people may describe the same object in different ways, and at varied levels of granular... more Different people may describe the same object in different ways, and at varied levels of granularity ("poodle", "dog", "pet" or "animal"?) In this paper, we propose the idea of 'granularityaware' groupings where semantically related concepts are grouped across different levels of granularity to capture the variation in how different people describe the same image content. The idea is demonstrated in the task of automatic image annotation, where these semantic groupings are used to alter the results of image annotation in a manner that affords different insights from its initial, category-independent rankings. The semantic groupings are also incorporated during evaluation against image descriptions written by humans. Our experiments show that semantic groupings result in image annotations that are more informative and flexible than without groupings, although being too flexible may result in image annotations that are less informative.
In this paper we describe and evaluate an approach to linking readers' comments to online news ar... more In this paper we describe and evaluate an approach to linking readers' comments to online news articles. For each comment that is linked based on its comment, we also determine whether the commenter agrees, disagrees or stays neutral with respect to what is stated in the article, as well as what the commenter's sentiment towards the article is. We use similarity features to link comments to relevant article segments and Support Vector Regression models for assigning argument structure and sentiment. Our results are compared to competing systems that took part in MultiLing OnForumS 2015 shared task, where we achieved best linking scores for English and second best for Italian.
Online commenting to news articles provides a communication channel between media professionals a... more Online commenting to news articles provides a communication channel between media professionals and readers offering a crucial tool for opinion exchange and freedom of expression. Currently, comments are detached from the news article and thus removed from the context that they were written for. In this work, we propose a method to connect readers' comments to the news article segments they refer to. We use similarity features to link comments to relevant article segments and evaluate both word-based and term-based vector spaces. Our results are comparable to state-of-theart topic modeling techniques when used for linking tasks. We demonstrate that article segments and comments representation are relevant to linking accuracy since we achieve better performances when similarity features are computed using similarity between terms rather than words.
Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dicti... more Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dictionaries contain a lot of noise, because of which the qualities of outputs of tools relying on the dictionaries are negatively affected. In this work, we present three different methods for cleaning noise from automatically generated bilingual dictionaries: LLR, pivot and transliteration based approach. We have applied these approaches on the GIZA++ dictionaries -dictionaries covering official EU languages -in order to remove noise. Our evaluation showed that all methods help to reduce noise. However, the best performance is achieved using the transliteration based approach. We provide all bilingual dictionaries (the original GIZA++ dictionaries and the cleaned ones) free for download. We also provide the cleaning tools and scripts for free download.
In this paper we investigate a number of questions relating to the identification of the domain o... more In this paper we investigate a number of questions relating to the identification of the domain of a term by domain classification of the document in which the term occurs. We propose and evaluate a straightforward method for domain classification of documents in 24 languages that exploits a multilingual thesaurus and Wikipedia. We investigate and provide quantitative results about the extent to which humans agree about the domain classification of documents and terms also the extent to which terms are likely to "inherit" the domain of their parent document.
Terminology extraction resources are needed for a wide range of human language technology applica... more Terminology extraction resources are needed for a wide range of human language technology applications, including knowledge management, information extraction, semantic search, cross-language information retrieval and automatic and assisted translation. We report a low cost method for creating terminology extraction resources for 21 non-English EU languages. Using parallel corpora and a projection method, we create a General POS Tagger for these languages. We also investigate the use of EuroVoc terms and Wikipedia to automatically create a term grammar for each language. Our results show that these automatically generated resources can assist the term extraction process, achieving similar performance to manually generated resources. All POS tagger and term grammar resources resulting from this work are freely available for download.
The availability of parallel corpora (ie documents that are exact translation of each other in di... more The availability of parallel corpora (ie documents that are exact translation of each other in different languages) has been a key factor in the development of the Machine Translation (MT) technology, in particular data driven (statistical and example-based) MT. Parallel corpora however are not always easily available. This is the case for certain underresourced languages or under-resourced domains. Nevertheless, the richness of the Web can allow the discovery of pieces of text in different languages that could be used to ...
Terminology extraction resources are needed for a wide range of human language technology applica... more Terminology extraction resources are needed for a wide range of human language technology applications, including knowledge management, information extraction, semantic search, cross-language information retrieval and automatic and assisted translation. We report a low cost method for creating terminology extraction resources for 21 non-English EU languages. Using parallel corpora and a projection method, we create a General POS Tagger for these languages. We also investigate the use of EuroVoc terms and Wikipedia to automatically create a term grammar for each language. Our results show that these automatically generated resources can assist the term extraction process, achieving similar performance to manually generated resources. All POS tagger and term grammar resources resulting from this work are freely available for download.
This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English,... more This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, Greek-English, Greek-Romanian, Croatian-English, Romanian-English, Romanian-German and Slovenian-English. The objective of this work was to gather texts from the same domains and genres and with a similar level of comparability in order to use them as a starting point in defining criteria and metrics of comparability. These criteria and metrics will be applied to comparable texts to determine their suitability for use in Statistical Machine Translation, particularly in the case where translation is performed from or into under-resourced languages for which substantial parallel corpora are unavailable. The size of collected corpora is about 1million words for each under-resourced language.
This paper reports an initial study that aims to assess the viability of a state-of-the-art multi... more This paper reports an initial study that aims to assess the viability of a state-of-the-art multi-document summarizer for automatic captioning of geo-referenced images. The automatic captioning procedure requires summarizing multiple web documents that contain information related to images' location. We use SUMMA (Saggion and Gaizauskas, 2005) to generate generic and query-based multi-document summaries and evaluate them using ROUGE evaluation metrics (Lin, 2004) relative to human generated summaries. Results show that, even though query-based summaries perform better than generic ones, they are still not selecting the information that human participants do. In particular, the areas of interest that human summaries display (history, travel information, etc.) are not contained in the query-based summaries. For our future work in automatic image captioning this result suggests that developing the query-based summarizer further and biasing it to account for user-specific requirements will prove worthwhile.
lrec-conf.org
... Emina Kurtić1,2, Bill Wells2, Guy J. Brown1, Timothy Kempton1, Ahmet Aker1 1Department of Com... more ... Emina Kurtić1,2, Bill Wells2, Guy J. Brown1, Timothy Kempton1, Ahmet Aker1 1Department of Computer Science, 2Department of Human Communication ... large data sets of conversational phenomena can offer an integrated picture of the use of language and non-verbal cues in ...
Lecture Notes in Computer Science, 2010
This paper presents two different approaches to automatic captioning of geo-tagged images by summ... more This paper presents two different approaches to automatic captioning of geo-tagged images by summarizing multiple web-documents that contain information related to an image's location: a graph-based and a statistical-based approach. The graph-based method uses text cohesion techniques to identify information relevant to a location. The statistical-based technique relies on different word or noun phrases frequency counting for identifying pieces of information relevant to a location. Our results show that summaries generated using these two approaches lead indeed to higher ROUGE scores than n-gram language models reported in previous work.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012
Images with geo-tagging information are increasingly available on the Web. However, such images n... more Images with geo-tagging information are increasingly available on the Web. However, such images need to be annotated with additional textual information if they are to be retrievable, since users do not search by geo-coordinates. We propose to automatically generate such textual information by (1) generating toponyms from the geo-tagging information (2) retrieving Web documents using toponyms as queries (3) summarizing the retrieved documents. The summaries are then used to index the images. In this paper we investigate how various summarization techniques affect image retrieval performance and show significant improvements can be obtained when using the summaries for indexing.
aclweb.org
This paper proposes a method for automatically generating summaries taking into account the infor... more This paper proposes a method for automatically generating summaries taking into account the information in which users may be interested. Our approach relies on existing model summaries from tourist sites and captures from them the type of information humans use to describe places around the world. Relational patterns are first extracted and categorized by the type of information they encode. Then, we apply them to the collection of input documents to automatically extract the most relevant sentences and build the summaries. In order to evaluate the performance of our approach, we conduct two types of evaluation. On the one hand, we use ROUGE to assess the information contained in our summaries against existing human written summaries, whereas on the other hand, we carry out a human readability evaluation. Our results indicate that our approach achieves high performance both in ROUGE and manual evaluation.
Proceedings of LREC 2012, May 1, 2012
Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. Howeve... more Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages, parallel corpora are not readily available. To overcome this problem previous work has recognized the potential of using comparable corpora as training data. The process of obtaining such data usually involves (1) downloading a separate list of documents for each language,(2) matching the documents between two languages usually by comparing the document contents, and ...
Proceedings of LREC, 2012
The emergence of crowdsourcing as a commonly used approach to collect vast quantities of human as... more The emergence of crowdsourcing as a commonly used approach to collect vast quantities of human assessments on a variety of tasks represents nothing less than a paradigm shift. This is particularly true in academic research where it has suddenly become possible to collect (high-quality) annotations rapidly without the need of an expert. In this paper we investigate factors which can influence the quality of the results obtained through Amazon's Mechanical Turk crowdsourcing platform. We investigated the impact of different ...
This paper presents a detailed analysis of the use of crowdsourcing services for the Text Summari... more This paper presents a detailed analysis of the use of crowdsourcing services for the Text Summarization task in the context of the tourist domain. In particular, our aim is to retrieve relevant information about a place or an object pictured in an image in order to provide a short summary which will be of great help for a tourist. For tackling this task, we proposed a broad set of experiments using crowdsourcing services that could be useful as a reference for others who want to rely also on crowdsourcing. From the analysis carried out through our experimental setup and the results obtained, we can conclude that although crowdsourcing services were not good to simply gather gold-standard summaries (i.e., from the results obtained for experiments 1, 2 and 4), the encouraging results obtained in the third and sixth experiments motivate us to strongly believe that they can be successfully employed for finding some patterns of behaviour humans have when generating summaries, and for validating and checking other tasks. Furthermore, this analysis serves as a guideline for the types of experiments that might or might not work when using crowdsourcing in the context of text summarization.
In this article, we investigate what sorts of information humans request about geographical objec... more In this article, we investigate what sorts of information humans request about geographical objects of the same type. For example, Edinburgh Castle and Bodiam Castle are two objects of the same type: "castle." The question is whether specific information is requested for the object type "castle" and how this information differs for objects of other types (e.g., church, museum, or lake).
Product and service reviews are abundantly available online, but selecting relevant information f... more Product and service reviews are abundantly available online, but selecting relevant information from them is time consuming. Starlet solves this problem by extracting multidocument summaries that consider aspect rating distributions and language modeling.
In this paper we present a method for extracting bilingual terminologies from comparable corpora.... more In this paper we present a method for extracting bilingual terminologies from comparable corpora. In our approach we treat bilingual term extraction as a classification problem. For classification we use an SVM binary classifier and training data taken from the EUROVOC thesaurus. We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs. The performance of our classifier reaches the 100% precision level for many language pairs. We also perform manual evaluation on bilingual terms extracted from English-German term-tagged comparable corpora. The results of this manual evaluation showed 60-83% of the term pairs generated are exact translations and over 90% exact or partial translations.
Different people may describe the same object in different ways, and at varied levels of granular... more Different people may describe the same object in different ways, and at varied levels of granularity ("poodle", "dog", "pet" or "animal"?) In this paper, we propose the idea of 'granularityaware' groupings where semantically related concepts are grouped across different levels of granularity to capture the variation in how different people describe the same image content. The idea is demonstrated in the task of automatic image annotation, where these semantic groupings are used to alter the results of image annotation in a manner that affords different insights from its initial, category-independent rankings. The semantic groupings are also incorporated during evaluation against image descriptions written by humans. Our experiments show that semantic groupings result in image annotations that are more informative and flexible than without groupings, although being too flexible may result in image annotations that are less informative.
In this paper we describe and evaluate an approach to linking readers' comments to online news ar... more In this paper we describe and evaluate an approach to linking readers' comments to online news articles. For each comment that is linked based on its comment, we also determine whether the commenter agrees, disagrees or stays neutral with respect to what is stated in the article, as well as what the commenter's sentiment towards the article is. We use similarity features to link comments to relevant article segments and Support Vector Regression models for assigning argument structure and sentiment. Our results are compared to competing systems that took part in MultiLing OnForumS 2015 shared task, where we achieved best linking scores for English and second best for Italian.
Online commenting to news articles provides a communication channel between media professionals a... more Online commenting to news articles provides a communication channel between media professionals and readers offering a crucial tool for opinion exchange and freedom of expression. Currently, comments are detached from the news article and thus removed from the context that they were written for. In this work, we propose a method to connect readers' comments to the news article segments they refer to. We use similarity features to link comments to relevant article segments and evaluate both word-based and term-based vector spaces. Our results are comparable to state-of-theart topic modeling techniques when used for linking tasks. We demonstrate that article segments and comments representation are relevant to linking accuracy since we achieve better performances when similarity features are computed using similarity between terms rather than words.
Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dicti... more Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dictionaries contain a lot of noise, because of which the qualities of outputs of tools relying on the dictionaries are negatively affected. In this work, we present three different methods for cleaning noise from automatically generated bilingual dictionaries: LLR, pivot and transliteration based approach. We have applied these approaches on the GIZA++ dictionaries -dictionaries covering official EU languages -in order to remove noise. Our evaluation showed that all methods help to reduce noise. However, the best performance is achieved using the transliteration based approach. We provide all bilingual dictionaries (the original GIZA++ dictionaries and the cleaned ones) free for download. We also provide the cleaning tools and scripts for free download.
In this paper we investigate a number of questions relating to the identification of the domain o... more In this paper we investigate a number of questions relating to the identification of the domain of a term by domain classification of the document in which the term occurs. We propose and evaluate a straightforward method for domain classification of documents in 24 languages that exploits a multilingual thesaurus and Wikipedia. We investigate and provide quantitative results about the extent to which humans agree about the domain classification of documents and terms also the extent to which terms are likely to "inherit" the domain of their parent document.
Terminology extraction resources are needed for a wide range of human language technology applica... more Terminology extraction resources are needed for a wide range of human language technology applications, including knowledge management, information extraction, semantic search, cross-language information retrieval and automatic and assisted translation. We report a low cost method for creating terminology extraction resources for 21 non-English EU languages. Using parallel corpora and a projection method, we create a General POS Tagger for these languages. We also investigate the use of EuroVoc terms and Wikipedia to automatically create a term grammar for each language. Our results show that these automatically generated resources can assist the term extraction process, achieving similar performance to manually generated resources. All POS tagger and term grammar resources resulting from this work are freely available for download.
The availability of parallel corpora (ie documents that are exact translation of each other in di... more The availability of parallel corpora (ie documents that are exact translation of each other in different languages) has been a key factor in the development of the Machine Translation (MT) technology, in particular data driven (statistical and example-based) MT. Parallel corpora however are not always easily available. This is the case for certain underresourced languages or under-resourced domains. Nevertheless, the richness of the Web can allow the discovery of pieces of text in different languages that could be used to ...
Terminology extraction resources are needed for a wide range of human language technology applica... more Terminology extraction resources are needed for a wide range of human language technology applications, including knowledge management, information extraction, semantic search, cross-language information retrieval and automatic and assisted translation. We report a low cost method for creating terminology extraction resources for 21 non-English EU languages. Using parallel corpora and a projection method, we create a General POS Tagger for these languages. We also investigate the use of EuroVoc terms and Wikipedia to automatically create a term grammar for each language. Our results show that these automatically generated resources can assist the term extraction process, achieving similar performance to manually generated resources. All POS tagger and term grammar resources resulting from this work are freely available for download.
This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English,... more This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, Greek-English, Greek-Romanian, Croatian-English, Romanian-English, Romanian-German and Slovenian-English. The objective of this work was to gather texts from the same domains and genres and with a similar level of comparability in order to use them as a starting point in defining criteria and metrics of comparability. These criteria and metrics will be applied to comparable texts to determine their suitability for use in Statistical Machine Translation, particularly in the case where translation is performed from or into under-resourced languages for which substantial parallel corpora are unavailable. The size of collected corpora is about 1million words for each under-resourced language.
This paper reports an initial study that aims to assess the viability of a state-of-the-art multi... more This paper reports an initial study that aims to assess the viability of a state-of-the-art multi-document summarizer for automatic captioning of geo-referenced images. The automatic captioning procedure requires summarizing multiple web documents that contain information related to images' location. We use SUMMA (Saggion and Gaizauskas, 2005) to generate generic and query-based multi-document summaries and evaluate them using ROUGE evaluation metrics (Lin, 2004) relative to human generated summaries. Results show that, even though query-based summaries perform better than generic ones, they are still not selecting the information that human participants do. In particular, the areas of interest that human summaries display (history, travel information, etc.) are not contained in the query-based summaries. For our future work in automatic image captioning this result suggests that developing the query-based summarizer further and biasing it to account for user-specific requirements will prove worthwhile.