Simon Overell | Imperial College London (original) (raw)
Papers by Simon Overell
This paper presents the work of the MMIS group at ImageCLEF 2008. The results for three tasks are... more This paper presents the work of the MMIS group at ImageCLEF 2008. The results for three tasks are presented: Visual Concept Detection Task (VCDT), ImageCLEFphoto and ImageCLEFwiki. We combine image annotations, CBIR, textual relevance and a geographic filter using our generic data fusion method. We also compare methods for BRF and clustering. Our top performing method in the VCDT enhances supervised learning by modifying probabilities based on a matrix that shows how terms appear together. Although it occurred in the top quartile of submitted runs, the enhancement did not provide a statistically significant improvement. In the ImageCLEFphoto task we demonstrate that evidence from image retrieval can provide a contribution to retrieval; however we are yet to find a way of combining text and image evidence in a way to provide an improvement over the baseline. Due to the relative performances of difference evidences in ImageCLEFwiki and our failure to improve over a baseline we conclud...
In this paper we describe our Geographic Information Retrieval experiments with Forostar, our GIR... more In this paper we describe our Geographic Information Retrieval experiments with Forostar, our GIR application on the GeoCLEF 2007 corpus and query set. We compare the results from orthogonal text with no geographic entities and only geographic entities with standard text retrieval and combined text and geographic relevance methods. The text and named entity analysis and retrieval methods of Forostar are described in detail. We also detail our placename disambiguation and geographic relevance ranking methods. The paper concludes with an analysis of our results including significance testing where we show our baseline method, in fact, to be best. Finally we identify weaknesses in our approach and ways in which the system could be optimised and improved.
"My presentation will cover the evaluation of large-scale co-occurrence models for disambigu... more "My presentation will cover the evaluation of large-scale co-occurrence models for disambiguation. The data for the models is mined from Wikipedia and applied to the GeoCLEF corpus. The mining and application parts of the system are entirely independent to avoid bias. The specific problem I am applying co-occurrence models to is place name disambiguation (for example when “London” is referred to in text, is it “London, UK” or “London, Ontario”?). The motivation behind this problem is to make un-annotated data machine readable and ...
The motivation behind developing such a tool is to improve performance on Geographic Information ... more The motivation behind developing such a tool is to improve performance on Geographic Information Retrieval problems such as placename disambiguation (if “Sheffield” appears in text, which Sheffield is it?) and geographic relevance (if “Sheffield” appears in a query are “Yorkshire”, “Manchester” or “Derby” relevant?). The talk will cover the development of a geographic co-occurrence model mined from Wikipedia and similar user-generated content. The co-occurrence model is similar to a language model, however, contains only geographic entities. The accuracy ...
In this paper we describe the development of a geographic co-occurrence model and how it can be a... more In this paper we describe the development of a geographic co-occurrence model and how it can be applied to geographic information retrieval. The model consists of mining cooccurrences of placenames from Wikipedia, and then mapping these placenames to locations in the Getty Thesaurus of Geographical Names. We begin by quantifying the accuracy of our model and compute theoretical bounds for the accuracy achievable when applied to placename disambiguation in free text. We conclude with a discussion of the improvement such a model could provide ...
Wikipedia is the largest encyclopedia mankind has ever known. It contains over 10 million article... more Wikipedia is the largest encyclopedia mankind has ever known. It contains over 10 million articles across 250 languages and is now the 9th most visited site on the Internet. Wikipedia has led the way for user-generated-content sites such as Flickr and YouTube. In this talk, Simon will present his work on mining location and temporal references from Wikipedia, and will show that despite its best efforts at neutrality, Wikipedia still reflects the cultural biases of its contributers. By analysing different language versions of Wikipedia we can show how different locations and events ...
In this paper we present our Geographic Information Retrieval System, Forostar, and the results o... more In this paper we present our Geographic Information Retrieval System, Forostar, and the results of three experiments. We compare two data fusion methods, and show that a simple geographic filter outperforms a penalty based system. We compare context based disambiguation to a default gazetteer and show no significant difference. Finally we compare a unique geographic index to an ambiguous geographic index. The ambiguous index outperformed all other methods and was statistically significantly better than the baseline. ... Publication Name: Working ...
Academia.edu helps academics follow the latest research.
A method of creating a word game comprising receiving a seed value from a browser, obtaining from... more A method of creating a word game comprising receiving a seed value from a browser, obtaining from a media database a plurality of words associated with the seed value, creating a word game from at least a subset of the obtained plurality of words, integrating the word game into a browser interpretable document, and, returning the browser interpretable document to the browser. Some embodiments further comprise incorporating into the browser interpretable document an advertisement associated with the seed value and/or ...
Department of Computing, Jul 1, 2009
This thesis aims to augment the Geographic Information Retrieval process with information extract... more This thesis aims to augment the Geographic Information Retrieval process with information extracted from world knowledge. This aim is approached from three directions: classifying world knowledge, disambiguating placenames and modelling users. Geographic information is becoming ubiquitous across the Internet, with a significant proportion of web documents and web searches containing geographic entities, and the proliferation of Internet enabled mobile devices. Traditional information retrieval treats these geographic ...
An improved system and method for classifying tags of content using a hyperlinked corpus of class... more An improved system and method for classifying tags of content using a hyperlinked corpus of classified web pages is provided. An anchor text index may be searched to find anchor texts that may match text of the tag, documents referenced by the matching anchor texts may be found, and the documents referenced by the matching anchor texts may be grouped to disambiguate multiple classifications that result from matching the anchor texts with the categories of the reference documents. To resolve ambiguity between multiple ...
In this paper we test the hypothesis Given a piece of text describing an object or concept our co... more In this paper we test the hypothesis Given a piece of text describing an object or concept our combined disambiguation method can disambiguate whether it is a place and ground it to a Getty Thesaurus of Geographical Names unique identifier with significantly more accuracy than naive methods. We demonstrate a carefully engineered rule-based place name disambiguation system and give Wikipedia as a worked example with hand-generated ground truth and bench mark tests. This paper outlines our plans to apply the co-occurrence models generated with Wikipedia to solve the problem of disambiguating place names in text using supervised learning techniques.
image retrieval enables the user to search a database for visually similar images. In these scena... more image retrieval enables the user to search a database for visually similar images. In these scenarios, the user submits an example that is compared to the images in the database by their low-level characteristics such as colour, texture and shape. While visual similarity is essential for a vast number of applications, there are cases where a user needs to search for semantically similar images. For example, the user might want to find all images depicting bears on a river. This might be quite difficult using only low-level features, but using concept detectors for "bear" and "river" will produce results that are semantically closer to what the user requested. Following this idea, this paper studies a novel paradigm: query by semantic multimedia example. In this setting the user's query is processed at a semantic level: a vector of concept probabilities is inferred for each image and a similarity metric computes the distance between the concept vector of the q...
In this paper we describe the geographic information retrieval system developed by the Multimedia... more In this paper we describe the geographic information retrieval system developed by the Multimedia & Information Systems team for GeoCLEF 2006 and the results achieved. We detail our methods for generating and applying co-occurrence models for the purpose of place name disambiguation, our use of named entity recognition tools and text indexing applications. The presented system is split into two stages: a batch text & geographic indexer and a real time query engine. The query engine takes manually crafted queries where the text component is separated from the geographic component. Two monolingual runs were submitted for the GeoCLEF evaluation, the first constructed from the title and description, the second included the narrative also. We explain in detail our use of co-occurrence models for place name disambiguation using a model generated from Wikipedia. The paper concludes with a full description of future work and ways in which the system could be optimised.
This year, ImageCLEF2007 data provided multiple evidences that can be explored in many different ... more This year, ImageCLEF2007 data provided multiple evidences that can be explored in many different ways. In this paper we describe an i nformation retrieval framework that combines image, text and geographic data. Text analysis implements the vector space model based on non-geographic terms. Geographic analysis implements a placename disambiguation method and placenames are indexed by their Getty TGN Unique Id. Image analysis implements a query by semantic example model. The paper concludes with an analysis of our results. Fi nally we identify the weaknesses in our approach and ways in which the system could be optimised and improved.
Proceedings of the 4th ACM workshop on …, Nov 9, 2007
In this paper we describe the development of a geographic co-occurrence model and how it can be a... more In this paper we describe the development of a geographic co-occurrence model and how it can be applied to geographic information retrieval. The model consists of mining cooccurrences of placenames from Wikipedia, and then mapping these placenames to locations in the Getty Thesaurus of Geographical Names. We begin by quantifying the accuracy of our model and compute theoretical bounds for the accuracy achievable when applied to placename disambiguation in free text. We conclude with a discussion of the ...
This paper presents the work of the MMIS group at ImageCLEF 2008. The results for three tasks are... more This paper presents the work of the MMIS group at ImageCLEF 2008. The results for three tasks are presented: Visual Concept Detection Task (VCDT), ImageCLEFphoto and ImageCLEFwiki. We combine image annotations, CBIR, textual relevance and a geographic filter using our generic data fusion method. We also compare methods for BRF and clustering. Our top performing method in the VCDT enhances supervised learning by modifying probabilities based on a matrix that shows how terms appear together. Although it occurred in the top quartile of submitted runs, the enhancement did not provide a statistically significant improvement. In the ImageCLEFphoto task we demonstrate that evidence from image retrieval can provide a contribution to retrieval; however we are yet to find a way of combining text and image evidence in a way to provide an improvement over the baseline. Due to the relative performances of difference evidences in ImageCLEFwiki and our failure to improve over a baseline we conclud...
In this paper we describe our Geographic Information Retrieval experiments with Forostar, our GIR... more In this paper we describe our Geographic Information Retrieval experiments with Forostar, our GIR application on the GeoCLEF 2007 corpus and query set. We compare the results from orthogonal text with no geographic entities and only geographic entities with standard text retrieval and combined text and geographic relevance methods. The text and named entity analysis and retrieval methods of Forostar are described in detail. We also detail our placename disambiguation and geographic relevance ranking methods. The paper concludes with an analysis of our results including significance testing where we show our baseline method, in fact, to be best. Finally we identify weaknesses in our approach and ways in which the system could be optimised and improved.
"My presentation will cover the evaluation of large-scale co-occurrence models for disambigu... more "My presentation will cover the evaluation of large-scale co-occurrence models for disambiguation. The data for the models is mined from Wikipedia and applied to the GeoCLEF corpus. The mining and application parts of the system are entirely independent to avoid bias. The specific problem I am applying co-occurrence models to is place name disambiguation (for example when “London” is referred to in text, is it “London, UK” or “London, Ontario”?). The motivation behind this problem is to make un-annotated data machine readable and ...
The motivation behind developing such a tool is to improve performance on Geographic Information ... more The motivation behind developing such a tool is to improve performance on Geographic Information Retrieval problems such as placename disambiguation (if “Sheffield” appears in text, which Sheffield is it?) and geographic relevance (if “Sheffield” appears in a query are “Yorkshire”, “Manchester” or “Derby” relevant?). The talk will cover the development of a geographic co-occurrence model mined from Wikipedia and similar user-generated content. The co-occurrence model is similar to a language model, however, contains only geographic entities. The accuracy ...
In this paper we describe the development of a geographic co-occurrence model and how it can be a... more In this paper we describe the development of a geographic co-occurrence model and how it can be applied to geographic information retrieval. The model consists of mining cooccurrences of placenames from Wikipedia, and then mapping these placenames to locations in the Getty Thesaurus of Geographical Names. We begin by quantifying the accuracy of our model and compute theoretical bounds for the accuracy achievable when applied to placename disambiguation in free text. We conclude with a discussion of the improvement such a model could provide ...
Wikipedia is the largest encyclopedia mankind has ever known. It contains over 10 million article... more Wikipedia is the largest encyclopedia mankind has ever known. It contains over 10 million articles across 250 languages and is now the 9th most visited site on the Internet. Wikipedia has led the way for user-generated-content sites such as Flickr and YouTube. In this talk, Simon will present his work on mining location and temporal references from Wikipedia, and will show that despite its best efforts at neutrality, Wikipedia still reflects the cultural biases of its contributers. By analysing different language versions of Wikipedia we can show how different locations and events ...
In this paper we present our Geographic Information Retrieval System, Forostar, and the results o... more In this paper we present our Geographic Information Retrieval System, Forostar, and the results of three experiments. We compare two data fusion methods, and show that a simple geographic filter outperforms a penalty based system. We compare context based disambiguation to a default gazetteer and show no significant difference. Finally we compare a unique geographic index to an ambiguous geographic index. The ambiguous index outperformed all other methods and was statistically significantly better than the baseline. ... Publication Name: Working ...
Academia.edu helps academics follow the latest research.
A method of creating a word game comprising receiving a seed value from a browser, obtaining from... more A method of creating a word game comprising receiving a seed value from a browser, obtaining from a media database a plurality of words associated with the seed value, creating a word game from at least a subset of the obtained plurality of words, integrating the word game into a browser interpretable document, and, returning the browser interpretable document to the browser. Some embodiments further comprise incorporating into the browser interpretable document an advertisement associated with the seed value and/or ...
Department of Computing, Jul 1, 2009
This thesis aims to augment the Geographic Information Retrieval process with information extract... more This thesis aims to augment the Geographic Information Retrieval process with information extracted from world knowledge. This aim is approached from three directions: classifying world knowledge, disambiguating placenames and modelling users. Geographic information is becoming ubiquitous across the Internet, with a significant proportion of web documents and web searches containing geographic entities, and the proliferation of Internet enabled mobile devices. Traditional information retrieval treats these geographic ...
An improved system and method for classifying tags of content using a hyperlinked corpus of class... more An improved system and method for classifying tags of content using a hyperlinked corpus of classified web pages is provided. An anchor text index may be searched to find anchor texts that may match text of the tag, documents referenced by the matching anchor texts may be found, and the documents referenced by the matching anchor texts may be grouped to disambiguate multiple classifications that result from matching the anchor texts with the categories of the reference documents. To resolve ambiguity between multiple ...
In this paper we test the hypothesis Given a piece of text describing an object or concept our co... more In this paper we test the hypothesis Given a piece of text describing an object or concept our combined disambiguation method can disambiguate whether it is a place and ground it to a Getty Thesaurus of Geographical Names unique identifier with significantly more accuracy than naive methods. We demonstrate a carefully engineered rule-based place name disambiguation system and give Wikipedia as a worked example with hand-generated ground truth and bench mark tests. This paper outlines our plans to apply the co-occurrence models generated with Wikipedia to solve the problem of disambiguating place names in text using supervised learning techniques.
image retrieval enables the user to search a database for visually similar images. In these scena... more image retrieval enables the user to search a database for visually similar images. In these scenarios, the user submits an example that is compared to the images in the database by their low-level characteristics such as colour, texture and shape. While visual similarity is essential for a vast number of applications, there are cases where a user needs to search for semantically similar images. For example, the user might want to find all images depicting bears on a river. This might be quite difficult using only low-level features, but using concept detectors for "bear" and "river" will produce results that are semantically closer to what the user requested. Following this idea, this paper studies a novel paradigm: query by semantic multimedia example. In this setting the user's query is processed at a semantic level: a vector of concept probabilities is inferred for each image and a similarity metric computes the distance between the concept vector of the q...
In this paper we describe the geographic information retrieval system developed by the Multimedia... more In this paper we describe the geographic information retrieval system developed by the Multimedia & Information Systems team for GeoCLEF 2006 and the results achieved. We detail our methods for generating and applying co-occurrence models for the purpose of place name disambiguation, our use of named entity recognition tools and text indexing applications. The presented system is split into two stages: a batch text & geographic indexer and a real time query engine. The query engine takes manually crafted queries where the text component is separated from the geographic component. Two monolingual runs were submitted for the GeoCLEF evaluation, the first constructed from the title and description, the second included the narrative also. We explain in detail our use of co-occurrence models for place name disambiguation using a model generated from Wikipedia. The paper concludes with a full description of future work and ways in which the system could be optimised.
This year, ImageCLEF2007 data provided multiple evidences that can be explored in many different ... more This year, ImageCLEF2007 data provided multiple evidences that can be explored in many different ways. In this paper we describe an i nformation retrieval framework that combines image, text and geographic data. Text analysis implements the vector space model based on non-geographic terms. Geographic analysis implements a placename disambiguation method and placenames are indexed by their Getty TGN Unique Id. Image analysis implements a query by semantic example model. The paper concludes with an analysis of our results. Fi nally we identify the weaknesses in our approach and ways in which the system could be optimised and improved.
Proceedings of the 4th ACM workshop on …, Nov 9, 2007
In this paper we describe the development of a geographic co-occurrence model and how it can be a... more In this paper we describe the development of a geographic co-occurrence model and how it can be applied to geographic information retrieval. The model consists of mining cooccurrences of placenames from Wikipedia, and then mapping these placenames to locations in the Getty Thesaurus of Geographical Names. We begin by quantifying the accuracy of our model and compute theoretical bounds for the accuracy achievable when applied to placename disambiguation in free text. We conclude with a discussion of the ...
Wikipedia is the largest encyclopedia mankind has ever known. It contains over 10 million article... more Wikipedia is the largest encyclopedia mankind has ever known. It contains over 10 million articles across 250 languages and is now the 9th most visited site on the Internet. Wikipedia has led the way for user-generated-content sites such as Flickr and YouTube. In this talk, Simon will present his work on mining location and temporal references from Wikipedia, and will show that despite its best efforts at neutrality, Wikipedia still reflects the cultural biases of its contributers. By analysing different language versions of Wikipedia we can show how different locations and events have significance to different peoples. The talk will conclude with a summary of the applications of the work to Information Retrieval, Computer Science and beyond.
The motivation behind developing such a tool is to improve performance on Geographic Information ... more The motivation behind developing such a tool is to improve performance on Geographic Information Retrieval problems such as placename disambiguation (if “Sheffield” appears in text, which Sheffield is it?) and geographic relevance (if “Sheffield” appears in a query are “Yorkshire”, “Manchester” or “Derby” relevant?). The talk will cover the development of a geographic co-occurrence model mined from Wikipedia and similar user-generated content. The co-occurrence model is similar to a language model, however, contains only geographic entities. The accuracy and clarity of the co-occurrence model are also quantified. The talk will begin with a description of how Wikipedia can be mined for named-entity associations and the area Geographic Information Retrieval, followed by details of the co-occurrence model and its application. The talk will conclude with future directions and applying the described techniques to the CLEF corpora.
My presentation will cover the evaluation of large-scale co-occurrence models for disambiguation.... more My presentation will cover the evaluation of large-scale co-occurrence models for disambiguation. The data for the models is mined from Wikipedia and applied to the GeoCLEF corpus. The mining and application parts of the system are entirely independent to avoid bias. The specific problem I am applying co-occurrence models to is place name disambiguation (for example when “London” is referred to in text, is it “London, UK” or “London, Ontario”?). The motivation behind this problem is to make un-annotated data machine readable and allow users to query and browse data geographically. With the recent introduction of the geographic track to the Cross Language Evaluation Forum there is now a standardised way to test Geographic Information Systems.
I have evaluated three approaches to applying co-occurrence to place name disambiguation:
1. Assign a co-occurrence index to place triplets.
2. Infer co-occurrence classifiers from the ground truth.
3. Represent the places occurring in the training data as vectors in a high dimensional space. The talk will begin with a description of place name disambiguation techniques and the use of Wikipedia as a corpus. Then a description of my probabilistic models, using first and higher orders of co-occurrence. The talk will conclude with my intended future work: expansion beyond just place names to looking at all named entities.