Udo Kruschwitz | University of Essex (original) (raw)
Books by Udo Kruschwitz
Collections of digital documents can nowadays be found everywhere in institutions, universities o... more Collections of digital documents can nowadays be found everywhere in institutions, universities or companies. Examples are Web sites or intranets. But searching them for information can still be painful. Searches often return either large numbers of matches or no suitable matches at all.
Such document collections can vary a lot in size and how much structure they carry. What they have in common is that they typically do have some structure and that they cover a limited range of topics. The second point is significantly different from documents on the Web in general.
The type of search system that we propose in this book can suggest ways of refining or relaxing the query to assist a user in the search process. In order to suggest sensible query modifications we would need to know what the documents are about. Explicit knowledge about the document collection encoded in some electronic form is what we need. However, typically such knowledge is not available.
This book describes how that knowledge can be contructed automatically.
This book
* demonstrates how document markup structure can be used to construct domain models for collections of partially structured documents
* shows how such knowledge can be utilized when searching the document collections
* presents two implemented search systems which demonstrate the usefulness of this approach.
Papers by Udo Kruschwitz
Language resources are important for those working on computational methods to analyse and study ... more Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic
summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately
skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented.
Search engines have become much more interactive in recent years which has triggered a lot of wor... more Search engines have become much more interactive in recent years which has triggered a lot of work in automatically acquiring knowledge structures that can assist a user in navigating through a document collection. Query log analysis has emerged as one of the most promising research areas to automatically derive such structures. We explore a biologically inspired model based on ant colony optimisation applied to query logs as an adaptive learning process that addresses the problem of deriving query suggestions.
The Ypa project (De Roeck et al., 1998) is building a system to make the information in classifie... more The Ypa project (De Roeck et al., 1998) is building a system to make the information in classified directories more accessible. BT's Yellow Pages 1 provides an example of a classified database with which this work would be useful. Accessibility in this context means allowing users (or call center operators) to query the Yellow Pages system using Natural Language queries. For this to be possible there must be some method in the Ypa for converting theses queries into some form which allows the database to be queried.
The automatic acquisition of usable domain knowledgeis a challenging issue. Such knowledge can be... more The automatic acquisition of usable domain knowledgeis a challenging issue. Such knowledge can beemployed to assist a user in searching a document collection. This can be done by suggesting query modificationoptions based on the knowledge uncoveredby analyzing the document collection. We acquiresuch knowledge by simply exploiting the documents' markup structure. This gives us a domain model tailoredto the particular collection. But how good issuch a model?
Search in intranets and other collections of electronically available documents is the focus of t... more Search in intranets and other collections of electronically available documents is the focus of this research prototype. We will demonstrate UKSearch, a search system that incorporates term hierarchies derived from markup structure. The term hierarchies-our domain model-are automatically constructed for the entire document collection and then applied to assist a user in the search process. Details can be found in [1].
Abstract. This paper summarizes the scientific work presented at the 32nd European Conference on ... more Abstract. This paper summarizes the scientific work presented at the 32nd European Conference on Information Retrieval. It demonstrates that information retrieval (IR) as a research area continues to thrive with progress being made in three complementary sub-fields, namely IR theory and formal methods together with indexing and query representation issues, furthermore Web IR as a primary application area and finally research into evaluation methods and metrics.
Abstract Named entities (NEs) are textual references via proper names, such as people names, comp... more Abstract Named entities (NEs) are textual references via proper names, such as people names, company names, places and so on. The importance of NEs has been observed in intranet search engines, including university web sites. In this paper, a mechanism is built exclusively to recognize the three named entities, which are constantly referenced in the University of Essex domain: names, course codes, and room numbers.
A method and apparatus for generating an index entry for a record in a semi-structured database i... more A method and apparatus for generating an index entry for a record in a semi-structured database involves analysing each field to identify an entry within each field and to identify a sequence of characters having a format corresponding to a predetermined format. Thereafter, the method and apparatus operate to generate an index entry for the identified entry, and for at least one field, define any characters not identified as an entry as a free text entry.
Abstract. Attachment prediction is the task of automatically identifying email messages that shou... more Abstract. Attachment prediction is the task of automatically identifying email messages that should contain an attachment. This can be useful to tackle the problem of sending out emails but forgetting to include the relevant attachment (something that happens all too often). A common Information Retrieval (IR) approach in analyzing documents such as emails is to treat the entire document as a bag of words. Here we propose a finer-grained analysis to address the problem.
Abstract. This paper explores the use of implicit user feedback in adapting the underlying domain... more Abstract. This paper explores the use of implicit user feedback in adapting the underlying domain model of an intranet search system. The domain model, a Formal Concept Analysis (FCA) lattice, is used as an interactive interface to allow user exploration of the context of an intranet query. Implicit user feedback is harnessed here to surmount the difficulty of achieving optimum document descriptors, essential for a browsable lattice.
Abstract Modern Web search engines access large parts of the publicly indexable Web. Relevant sit... more Abstract Modern Web search engines access large parts of the publicly indexable Web. Relevant sites can be found easily thanks to advanced techniques such as Google's PageRank algorithm. However, a common problem remains the large number of matching documents being returned even for fairly specific queries. The same problem can be observed in domains that are more limited like intranets or local Web sites.
Recently, web collaboration (also known as crowd sourcing) has started to emerge as a viable alte... more Recently, web collaboration (also known as crowd sourcing) has started to emerge as a viable alternative for building the large resources that are needed to build and evaluate NLP systems. In this spirit, the Anawiki project (http://anawiki. essex. ac. uk/)[8] aimed at experimenting with Web collaboration and human computation as a solution to the problem of creating large-scale linguistically annotated corpora.
The increased availability of large amounts of data about user search behaviour in search engines... more The increased availability of large amounts of data about user search behaviour in search engines has triggered a lot of research in recent years. This includes developing machine learning methods to build knowledge structures that could be exploited for a number of tasks such as query recommendation. Query flow graphs are a successful example of these structures, they are generated from the sequence of queries typed in by a user in a search session.
Der Zugriff auf lexikalische Datenbanken beschränkt sich häufig auf SQL-ähnliche Anfragen. Die Fe... more Der Zugriff auf lexikalische Datenbanken beschränkt sich häufig auf SQL-ähnliche Anfragen. Die Feldstruktur der in der TELEX-Datenbasis erfaßten Einträge legt aber den Ansatz nahe,„intelligentere" Selektionsmöglichkeiten zu implementieren. Die hier dargestellte „spezieller-Relation" erlaubt eine intuitivere Arbeit mit den Datenbankeinträgen und ist gleichzeitig die Grundlage für die Definition einer Operation des Durchschnitts über zwei Einträgen.
In this paper we explore clustering for multi-document Arabic summarisation. For our evaluation w... more In this paper we explore clustering for multi-document Arabic summarisation. For our evaluation we use an Arabic version of the DUC-2002 dataset that we previously generated using Google Translate. We explore how clustering (at the sentence level) can be applied to multi-document summarisation as well as for redundancy elimination within this process. We use different parameter settings including the cluster size and the selection model applied in the extractive summarisation process.
Collections of digital documents can nowadays be found everywhere in institutions, universities o... more Collections of digital documents can nowadays be found everywhere in institutions, universities or companies. Examples are Web sites or intranets. But searching them for information can still be painful. Searches often return either large numbers of matches or no suitable matches at all.
Such document collections can vary a lot in size and how much structure they carry. What they have in common is that they typically do have some structure and that they cover a limited range of topics. The second point is significantly different from documents on the Web in general.
The type of search system that we propose in this book can suggest ways of refining or relaxing the query to assist a user in the search process. In order to suggest sensible query modifications we would need to know what the documents are about. Explicit knowledge about the document collection encoded in some electronic form is what we need. However, typically such knowledge is not available.
This book describes how that knowledge can be contructed automatically.
This book
* demonstrates how document markup structure can be used to construct domain models for collections of partially structured documents
* shows how such knowledge can be utilized when searching the document collections
* presents two implemented search systems which demonstrate the usefulness of this approach.
Language resources are important for those working on computational methods to analyse and study ... more Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic
summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately
skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented.
Search engines have become much more interactive in recent years which has triggered a lot of wor... more Search engines have become much more interactive in recent years which has triggered a lot of work in automatically acquiring knowledge structures that can assist a user in navigating through a document collection. Query log analysis has emerged as one of the most promising research areas to automatically derive such structures. We explore a biologically inspired model based on ant colony optimisation applied to query logs as an adaptive learning process that addresses the problem of deriving query suggestions.
The Ypa project (De Roeck et al., 1998) is building a system to make the information in classifie... more The Ypa project (De Roeck et al., 1998) is building a system to make the information in classified directories more accessible. BT's Yellow Pages 1 provides an example of a classified database with which this work would be useful. Accessibility in this context means allowing users (or call center operators) to query the Yellow Pages system using Natural Language queries. For this to be possible there must be some method in the Ypa for converting theses queries into some form which allows the database to be queried.
The automatic acquisition of usable domain knowledgeis a challenging issue. Such knowledge can be... more The automatic acquisition of usable domain knowledgeis a challenging issue. Such knowledge can beemployed to assist a user in searching a document collection. This can be done by suggesting query modificationoptions based on the knowledge uncoveredby analyzing the document collection. We acquiresuch knowledge by simply exploiting the documents' markup structure. This gives us a domain model tailoredto the particular collection. But how good issuch a model?
Search in intranets and other collections of electronically available documents is the focus of t... more Search in intranets and other collections of electronically available documents is the focus of this research prototype. We will demonstrate UKSearch, a search system that incorporates term hierarchies derived from markup structure. The term hierarchies-our domain model-are automatically constructed for the entire document collection and then applied to assist a user in the search process. Details can be found in [1].
Abstract. This paper summarizes the scientific work presented at the 32nd European Conference on ... more Abstract. This paper summarizes the scientific work presented at the 32nd European Conference on Information Retrieval. It demonstrates that information retrieval (IR) as a research area continues to thrive with progress being made in three complementary sub-fields, namely IR theory and formal methods together with indexing and query representation issues, furthermore Web IR as a primary application area and finally research into evaluation methods and metrics.
Abstract Named entities (NEs) are textual references via proper names, such as people names, comp... more Abstract Named entities (NEs) are textual references via proper names, such as people names, company names, places and so on. The importance of NEs has been observed in intranet search engines, including university web sites. In this paper, a mechanism is built exclusively to recognize the three named entities, which are constantly referenced in the University of Essex domain: names, course codes, and room numbers.
A method and apparatus for generating an index entry for a record in a semi-structured database i... more A method and apparatus for generating an index entry for a record in a semi-structured database involves analysing each field to identify an entry within each field and to identify a sequence of characters having a format corresponding to a predetermined format. Thereafter, the method and apparatus operate to generate an index entry for the identified entry, and for at least one field, define any characters not identified as an entry as a free text entry.
Abstract. Attachment prediction is the task of automatically identifying email messages that shou... more Abstract. Attachment prediction is the task of automatically identifying email messages that should contain an attachment. This can be useful to tackle the problem of sending out emails but forgetting to include the relevant attachment (something that happens all too often). A common Information Retrieval (IR) approach in analyzing documents such as emails is to treat the entire document as a bag of words. Here we propose a finer-grained analysis to address the problem.
Abstract. This paper explores the use of implicit user feedback in adapting the underlying domain... more Abstract. This paper explores the use of implicit user feedback in adapting the underlying domain model of an intranet search system. The domain model, a Formal Concept Analysis (FCA) lattice, is used as an interactive interface to allow user exploration of the context of an intranet query. Implicit user feedback is harnessed here to surmount the difficulty of achieving optimum document descriptors, essential for a browsable lattice.
Abstract Modern Web search engines access large parts of the publicly indexable Web. Relevant sit... more Abstract Modern Web search engines access large parts of the publicly indexable Web. Relevant sites can be found easily thanks to advanced techniques such as Google's PageRank algorithm. However, a common problem remains the large number of matching documents being returned even for fairly specific queries. The same problem can be observed in domains that are more limited like intranets or local Web sites.
Recently, web collaboration (also known as crowd sourcing) has started to emerge as a viable alte... more Recently, web collaboration (also known as crowd sourcing) has started to emerge as a viable alternative for building the large resources that are needed to build and evaluate NLP systems. In this spirit, the Anawiki project (http://anawiki. essex. ac. uk/)[8] aimed at experimenting with Web collaboration and human computation as a solution to the problem of creating large-scale linguistically annotated corpora.
The increased availability of large amounts of data about user search behaviour in search engines... more The increased availability of large amounts of data about user search behaviour in search engines has triggered a lot of research in recent years. This includes developing machine learning methods to build knowledge structures that could be exploited for a number of tasks such as query recommendation. Query flow graphs are a successful example of these structures, they are generated from the sequence of queries typed in by a user in a search session.
Der Zugriff auf lexikalische Datenbanken beschränkt sich häufig auf SQL-ähnliche Anfragen. Die Fe... more Der Zugriff auf lexikalische Datenbanken beschränkt sich häufig auf SQL-ähnliche Anfragen. Die Feldstruktur der in der TELEX-Datenbasis erfaßten Einträge legt aber den Ansatz nahe,„intelligentere" Selektionsmöglichkeiten zu implementieren. Die hier dargestellte „spezieller-Relation" erlaubt eine intuitivere Arbeit mit den Datenbankeinträgen und ist gleichzeitig die Grundlage für die Definition einer Operation des Durchschnitts über zwei Einträgen.
In this paper we explore clustering for multi-document Arabic summarisation. For our evaluation w... more In this paper we explore clustering for multi-document Arabic summarisation. For our evaluation we use an Arabic version of the DUC-2002 dataset that we previously generated using Google Translate. We explore how clustering (at the sentence level) can be applied to multi-document summarisation as well as for redundancy elimination within this process. We use different parameter settings including the cluster size and the selection model applied in the extractive summarisation process.
Abstract: Since the publication of the Government White Paper'Valuing People: a new strategy for ... more Abstract: Since the publication of the Government White Paper'Valuing People: a new strategy for learning disability for the 21st century', the responsibility for providing health care for people with learning disabilities has shifted rapidly to primary care. However, people with learning disabilities are supported by a disparate group of providers, from health care through local authorities to the voluntary sector, with resultant difficulties in providing seamless care.