Information Retrieval System Research Papers (original) (raw)
A common class of existing information retrieval system provides access to abstracts. For example Stanford University, through its FOLIO system, provides access to the INSPEC database of abstracts of the literature on physics, computer... more
A common class of existing information retrieval system provides access to abstracts. For example Stanford University, through its FOLIO system, provides access to the INSPEC database of abstracts of the literature on physics, computer science, electrical engineering, etc. In this paper this database is studied by using a trace-driven simulation. We focus on physical index design, inverted index caching, and database scaling in a distributed shared-nothing system. All three issues are shown to have a strong e ect on response time and throughput. Database scaling is explored in two ways. One way assumes an optimal" con guration for a single host and then linearly scales the database by duplicating the host architecture as needed. The second way determines the optimal number of hosts given a xed database size.
20 ans, plusieurs pays ont essayé de mettre en place un dossier médical standardisé centralisé (au niveau régional ou national). La plupart de ces tentatives ont échoué, principalement pour deux raisons liées aux difficultés de mise en... more
20 ans, plusieurs pays ont essayé de mettre en place un dossier médical standardisé centralisé (au niveau régional ou national). La plupart de ces tentatives ont échoué, principalement pour deux raisons liées aux difficultés de mise en place d'un identifiant unique pour le patient et à l'absence de standardisation des dossiers médicaux. Dans cet article nous discutons l'intérêt d'une solution pragmatique reposant sur l'accès aux données déjà collectées et stockées de façon décentralisée, grâce à la mise en place d'un système de recherche et d'accès aux données réparties du patient. L'originalité de cette procédure repose sur des avancées technologiques, notamment les techniques de grille et de tatouage. Les données déjà standardisées pourraient bénéficier d'une gestion centralisée. Ceci conduirait à mettre en place un système mixte (décentralisé pour les données standardisées et centralisé pour les données déjà standardisées). ABSTRACT. For more than 20 years, many countries have been trying to set up a standardized medical record (regional or national). Most have not reached this goal, essentially due to two main difficulties related to patient identification, and the standardization of medical records. We propose here the non-centralized management of medical records relying on a specific procedure that gives the patient access to his distributed medical data, wherever it is located. The originality of this procedure relies on new advances in technology, which make it possible to envisage access to medical records anywhere and anytime, thanks to Grid and watermarking methodologies. Of course, all existing standardised information could be more easily centralised. As a consequence, a mixed system (decentralised for unstructured data and centralised for already structured data) could be proposed.
only be answered by combining information from various articles. In this paper, a new algorithm is proposed for finding associations between related concepts present in literature. To this end, concepts are mapped to a multi-dimensional... more
only be answered by combining information from various articles. In this paper, a new algorithm is proposed for finding associations between related concepts present in literature. To this end, concepts are mapped to a multi-dimensional space by a Hebbian type of learning algorithm using co-occurrence data as input. The resulting concept space allows exploration of the neighborhood of a concept and finding potentially novel relationships between concepts. The obtained information retrieval system is useful for finding literature supporting hypotheses and for discovering hitherto unknown relationships between concepts. Tests on artificial data show the potential of the proposed methodology. In addition, preliminary tests on a set of Medline abstracts yield promising results.
Information retrieval technology has been central to the success of the Web. For semantic web documents or annotations to have an impact, they will have to be compatible with Web based indexing and retrieval technology. We discuss some of... more
Information retrieval technology has been central to the success of the Web. For semantic web documents or annotations to have an impact, they will have to be compatible with Web based indexing and retrieval technology. We discuss some of the underlying problems and issues central to extending information retrieval systems to handle annotations in semantic web languages. We also describe three prototype systems that we have implemented to explore these ideas.
In this paper, we describe a knowledge management framework that addresses the needs of multimedia analysis projects and provides a basis for information retrieval systems. The framework uses Semantic Web technologies to provide a shared... more
In this paper, we describe a knowledge management framework that addresses the needs of multimedia analysis projects and provides a basis for information retrieval systems. The framework uses Semantic Web technologies to provide a shared knowledge environment, and active Knowledge Machines, wrapping multimedia processing tools, to exploit and/or export knowledge to this environment. This framework is able to handle a wide range of use cases, from an enhanced workspace for researchers to end-user information access. As an illustration of how the proposed framework can be used, we present a case study of music analysis.
This paper provides an overview of the ePaper project. The project aims to provide an end-to-end solution for the future mobile personalized newspaper. The ePaper aggregates content (i.e., news items) from various news providers, and... more
This paper provides an overview of the ePaper project. The project aims to provide an end-to-end solution for the future mobile personalized newspaper. The ePaper aggregates content (i.e., news items) from various news providers, and delivers personalized newspapers on dedicated mobile, electronic newspaper-like, devices. The ePaper can provide to each subscribed user a personalized newspaper, according to the user's preferences, as well as a "standard edition" of a selected newspaper. The layout of the newspaper is adapted to the device's specifications and the user's preferences. The ePaper is expected to change the reading experience of newspapers and magazines, coupling innovative paper-like display with novel personalization algorithms, intuitive interface and new adaptation methods of content to device.
The Research Triangle Park (RTP) Particulate Matter (PM) Panel Study represented a 1-year investigation of personal, residential and ambient PM mass concentrations across distances as large as 70 km in central North Carolina. One of the... more
The Research Triangle Park (RTP) Particulate Matter (PM) Panel Study represented a 1-year investigation of personal, residential and ambient PM mass concentrations across distances as large as 70 km in central North Carolina. One of the primary goals of this effort was to estimate ambient PM 2.5 contributions to personal and indoor residential PM mass concentrations. Analyses indicated that data from the two distinct non-smoking subject populations totaling 38 individuals and 37 residences could be pooled. This resulted in nearly 800 data points for each variable. A total of 55 measurements believed to have been potentially influenced by personal or residential exposure to passive environmental tobacco smoke were not included in the analysis database. Variables to be examined included C ig (concentration of indoor generated PM), E ig (personal exposure to indoor generated PM), F inf (ambient PM infiltration factor), and F pex (personal exposure to PM of ambient origin factor). Daily air exchange rates (AER) were measured and statistical modeling to derive estimates of particle penetration ðPÞ and particle deposition ðkÞ factors was performed. Seasonality, cohort grouping, participant or combinations of these variables were determined not to be significant influences in estimating group infiltration factors. The mean (7std) mixed model slope estimates were AER=0.7270.63, P=0.7270.21, k=0.4270.19, and F inf =0.4570.21. These variables were then used in a number of mixed effects models having varying features of single, random or fixed intercepts and/or slopes to determine the most appropriate means of estimating ambient source contributions to personal and residential settings. A mixed model slope for F pex (7SE) was 0.4770.07 using the model with the highest degree of fit. Published by Elsevier Ltd.
This paper develops the multidimensional binary search tree (or k -d tree, where k is the dimensionality of the search space) as a data structure for storage of information to be retrieved by associative searches. The k -d tree is defined... more
This paper develops the multidimensional binary search tree (or k -d tree, where k is the dimensionality of the search space) as a data structure for storage of information to be retrieved by associative searches. The k -d tree is defined and examples are given. It is shown to be quite efficient in its storage requirements. A significant advantage of this structure is that a single data structure can handle many types of queries very efficiently. Various utility algorithms are developed; their proven average running times in an n record file are: insertion, O (log n ); deletion of the root, O ( n ( k -1)/ k ); deletion of a random node, O (log n ); and optimization (guarantees logarithmic performance of searches), O ( n log n ). Search algorithms are given for partial match queries with t keys specified [proven maximum running time of O ( n ( k - t )/ k )] and for nearest neighbor queries [empirically observed average running time of O (log n ).] These performances far surpass the b...
This paper introduces an interactive video system and its architecture where several systems cooperate to manage the services of interactive video. Each system is specialized according to the data it handles and the functionality it... more
This paper introduces an interactive video system and its architecture where several systems cooperate to manage the services of interactive video. Each system is specialized according to the data it handles and the functionality it performs. A system can be a database (for billing purposes) or just a video store system (to store the video data) lacking the typical features of a database or an information retrieval system to support indexing and querying of video data. Because quality of service is an important requirement for whole ...
An approach to managing the architecture of large software systems is presented. Dependencies are extracted from the code by a conventional static analysis, and shown in a tabular form known as the 'Dependency Structure Matrix' (DSM). A... more
An approach to managing the architecture of large software systems is presented. Dependencies are extracted from the code by a conventional static analysis, and shown in a tabular form known as the 'Dependency Structure Matrix' (DSM). A variety of algorithms are available to help organize the matrix in a form that reflects the architecture and highlights patterns and problematic dependencies. A hierarchical structure obtained in part by such algorithms, and in part by input from the user, then becomes the basis for 'design rules' that capture the architect's intent about which dependencies are acceptable. The design rules are applied repeatedly as the system evolves, to identify violations, and keep the code and its architecture in conformance with one another. The analysis has been implemented in a tool called LDM which has been applied in several commercial projects; in this paper, a case study application to Haystack, an information retrieval system, is described.
An algorithm, one that is economical and fast, for generating the convex polytope of a set S of points lying in an n-dimensional Euclidean space E" is described. In the existing brute force method for determining the convex hull of a set... more
An algorithm, one that is economical and fast, for generating the convex polytope of a set S of points lying in an n-dimensional Euclidean space E" is described. In the existing brute force method for determining the convex hull of a set of points lying in a two-dimensional space, one computes all possible straight lines joining each pair of points of S and tests whether the lines bound the given set S. This method can easily be generalized for computing the convex hull of a set S C E", n > 2. However, it turns out that this approach is not feasible due to excessive computer run time for a set of points lying in E n when n > 3. The algorithm described in this paper avoids all the unnecessary calculations, and the convex polytope of a set S C E n is generated by systematically computing the faces from the edges of the desired convex polytope. A numerical comparison indicates that this new approach is far superior to the existing brute force technique.
Dans cet article, nous proposons d'exploiter des liens sémantiques entre concepts pour améliorer la recherche d'information. Un thesaurus électronique de langue générale est utilisé pour la reformulation des requêtes utilisateurs en... more
Dans cet article, nous proposons d'exploiter des liens sémantiques entre concepts pour améliorer la recherche d'information. Un thesaurus électronique de langue générale est utilisé pour la reformulation des requêtes utilisateurs en procédant par un processus d'"expansion prudente" en amont d'un moteur de recherche. Ce processus, transparent à l'utilisateur, exploite d'abord la notion de concepts multitermes pour désambiguïser les mots de la requête. Il s'appuie ensuite sur les relations sémantiques entre concepts pour élargir la requête. L'ensemble conduit à une amélioration significative de la pertinence des réponses retournées par le moteur. Cette technique a été évaluée en utilisant le moteur Mercure développé à l'IRIT, WordNet comme base de données lexicales et Clef2001 comme collection de test.
Advances in the media and entertainment industries, for example streaming audio and digital TV, present new challenges for managing large audio-visual collections. Efficient and effective retrieval from large content collections forms an... more
Advances in the media and entertainment industries, for example streaming audio and digital TV, present new challenges for managing large audio-visual collections. Efficient and effective retrieval from large content collections forms an important component of the business models for content holders and this is driving a need for research in audio-visual search and retrieval. Current content management systems support retrieval using lowlevel features, such as motion, colour, texture, beat and loudness. However, low-level features often have little meaning for the human users of these systems, who much prefer to identify content using high-level semantic descriptions or concepts. This creates a gap between the system and the user that must be bridged for these systems to be used effectively. The research presented in this paper describes our approach to bridging this gap in a specific content domain, sports video. Our approach is based on a number of automatic techniques for feature detection used in combination with heuristic rules determined through manual observations of sports footage. This has led to a set of models for interesting sporting events -goal segments-that have been implemented as part of an information retrieval system. The paper also presents results comparing output of the system against manually identified goals.
Text Information Retrieval(TIR) is considered the heart of many applications such as Document Management System(DMS). TIR that used for DMS requires different techniques of data structure than that used in the search engine. Search... more
Text Information Retrieval(TIR) is considered the heart of many applications such as Document Management System(DMS). TIR that used for DMS requires different techniques of data structure than that used in the search engine. Search engine, requires special hardware (super computers with high memory) to perform information retrieval algorithms. In this paper, a new approach is developed to make it easy
The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information... more
The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or classification of Arabic text documents. It applies text classification to Arabic language text documents using stemming as part of the preprocessing steps. Results have showed that applying text classification without using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%.
Retrieval of relevant documents from a collection is a tedious task. As genetic algorithms (GA) are robust and efficient search and optimization techniques, they can be used to search the huge document search space. In this paper, a... more
Retrieval of relevant documents from a collection is a tedious task. As genetic algorithms (GA) are robust and efficient search and optimization techniques, they can be used to search the huge document search space. In this paper, a general frame work of information retrieval system is discussed. The applicability of genetic algorithms in the field of information retrieval is also
The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information... more
The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or classification of Arabic text documents. It applies text classification to Arabic language text documents using stemming as part of the preprocessing steps. Results have showed that applying text classification without using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%.
As we seek both to improve public school education in high technology areas and to link libraries and classrooms on the "information superhighway," we need to understand more about children's information searching abilities. We present... more
As we seek both to improve public school education in high technology areas and to link libraries and classrooms on the "information superhighway," we need to understand more about children's information searching abilities. We present results of four experiments conducted on four versions of the Science Library Catalog (SLC), a Dewey decimal-based hierarchical browsing system implemented in HyperCard without a keyboard. The experiments were conducted over a 3-year period at three sites, with four databases, and with comparisons to two different keyword online catalogs. Subjects were ethnically and culturally diverse children aged 9 through 12; with 32 to 34 children participating in each experiment. Children were provided explicit instruction and reference materials for the keyword systems but not for the SLC. The number of search topics matched was comparable across all systems and all experiments; search times were comparable, though they varied among the four SLC versions and between the two keyword online public access catalogs (OPACs). The SLC overall was robust to differences in age, sex, and computer experience. One of the keyword OPACs was subject to minor effects of age and computer experience; the other was not. We found relationships between search topic and system structure, such that the most difficult topics on the SLC were those hard to locate in the hierarchy, and those most difficult on the keyword OPACs were hard to spell or required children to generate their own search terms. The SLC approach overcomes problems with several searching features that are difficult for children in typical keyword OPAC systems: typing skills, spelling, vocabulary, and Boolean logic. Results have general implications for the design of information retrieval systems for children.
Most of the indexing models are based on simple independent words, also known as key words. This approach does not take account of the context as well as the relations between the words. Therefore, the precision of system is limited. In... more
Most of the indexing models are based on simple independent words, also known as key words. This approach does not take account of the context as well as the relations between the words. Therefore, the precision of system is limited. In this article, we present a structured indexing model based on noun phrases to increase the precision of an Information Retrieval System (IRS). In this model, we used a grammatical parser to extract and structure a noun phrase in determining the various roles of the words of a noun phrase and their syntactic relations. We represent the set of the index terms of query in the form of Bayesian networks which enables us to calculate the matching function between a query and a document. We carried out experiments to test this model. That the positive results obtained encourages us to continue in this direction.
This paper proposes to use two information retrieval system models (Boolean information retrieval model and extended Boolean (fuzzy) information retrieval model). These models differ by using Boolean queries or fuzzy weighted queries. It... more
This paper proposes to use two information retrieval system models (Boolean information retrieval model and extended Boolean (fuzzy) information retrieval model). These models differ by using Boolean queries or fuzzy weighted queries. It also proposes a way for optimizing user query for the two models by using genetic programming and fuzzy logic. And proposes to use more number of Boolean operators (AND, OR, XOR, OF, and NOT) instead of the standard Boolean operators (AND, OR, and NOT), and use weights for Boolean operators and for terms in fuzzy models.
This paper develops the multidimensional binary search tree (or k-d tree, where k is the dimensionality of the search space) as a data structure for storage of information to be retrieved by associative searches. The k-d tree is defined... more
This paper develops the multidimensional binary search tree (or k-d tree, where k is the dimensionality of the search space) as a data structure for storage of information to be retrieved by associative searches. The k-d tree is defined and examples are given. It is shown to be quite efficient in its storage requirements. A significant advantage of this structure is that a single data structure can handle many types of queries very efficiently. Various utility algorithms are developed; their proven average running times in an n record file are : insertion, O(log n); deletion of the root, 0 (n (k--1)/k) ; deletion of a random node, O(log n); and optimization (guarantees logarithmic performance of searches), 0 (n log n).
World Wide Web consists of more than 50 billion pages online. It is highly dynamic [6] i.e. the web continuously introduces new capabilities and attracts many people. Due to this explosion in size, the effective information retrieval... more
World Wide Web consists of more than 50 billion pages online. It is highly dynamic [6] i.e. the web continuously introduces new capabilities and attracts many people. Due to this explosion in size, the effective information retrieval system or search engine can be used to access the information. In this paper we have proposed the EPOW (Effective Performance of WebCrawler) architecture. It is a software agent whose main objective is to minimize the overload of a user locating needed information. We have designed the web crawler by considering the parallelization policy. Since our EPOW crawler has a highly optimized system it can download a large number of pages per second while being robust against crashes. We have also proposed to use the data structure concepts for implementation of scheduler & circular Queue to improve the performance of our web crawler. (Abstract)
An alternative way to tackle Information Retrieval, called Passage Retrieval, considers text fragments independently rather than assessing global relevance of documents. In such a context, the fact that relevant information is surrounded... more
An alternative way to tackle Information Retrieval, called Passage Retrieval, considers text fragments independently rather than assessing global relevance of documents. In such a context, the fact that relevant information is surrounded by parts of text deviating from the interesting topic does not penalize the document. In this paper, we propose to study the impact of the consideration of these text fragments on a document clustering process. The use of clustering in the field of Information Retrieval is mainly supported by the cluster hypothesis which states that relevant documents tend to be more similar one to each other than to non-relevant documents and hence a clustering process is likely to gather them. Previous experiments have shown that clustering the first retrieved documents as response to a user's query allows the Information Retrieval systems to improve their effectiveness. In the clustering process used in these studies, documents have been considered globally. Nevertheless, the assumption stating that a document can refer to more than one topic/concept may have also impacts on the document clustering process. Considering passages of the retrieved documents separately may allow to create more representative clusters of the addressed topics. Different approaches have been assessed and results show that using text fragments in the clustering process may turn out to be actually relevant.
As we seek both to improve public school education in high technology areas and to link libraries and classrooms on the "information superhighway," we need to understand more about children's information searching abilities. We present... more
As we seek both to improve public school education in high technology areas and to link libraries and classrooms on the "information superhighway," we need to understand more about children's information searching abilities. We present results of four experiments conducted on four versions of the Science Library Catalog (SLC), a Dewey decimal-based hierarchical browsing system implemented in HyperCard without a keyboard. The experiments were conducted over a 3-year period at three sites, with four databases, and with comparisons to two different keyword online catalogs. Subjects were ethnically and culturally diverse children aged 9 through 12; with 32 to 34 children participating in each experiment. Children were provided explicit instruction and reference materials for the keyword systems but not for the SLC. The number of search topics matched was comparable across all systems and all experiments; search times were comparable, though they varied among the four SLC versions and between the two keyword online public access catalogs (OPACs). The SLC overall was robust to differences in age, sex, and computer experience. One of the keyword OPACs was subject to minor effects of age and computer experience; the other was not. We found relationships between search topic and system structure, such that the most difficult topics on the SLC were those hard to locate in the hierarchy, and those most difficult on the keyword OPACs were hard to spell or required children to generate their own search terms. The SLC approach overcomes problems with several searching features that are difficult for children in typical keyword OPAC systems: typing skills, spelling, vocabulary, and Boolean logic. Results have general implications for the design of information retrieval systems for children.
This project proposed a system design for retrieving Quranic texts and any knowledge that derived or cites al-Quran. The objectives were to survey the websites offering access to Quranic texts on their structure and linkages, and to... more
This project proposed a system design for retrieving Quranic texts and any knowledge that derived or cites al-Quran. The objectives were to survey the websites offering access to Quranic texts on their structure and linkages, and to propose a system design for retrieving Quranic texts. A total of 125 websites offering access to Quranic texts were examined. Findings revealed that the websites offer texts and translation, recitation, excerpt ofexegesis, and link to other websites consisting ofnews, events, and related topics. A standard structure was not implemented by these websites. The proposed system design focuses on texts, translation, recitation, exegesis, al-Hadith, its topics and themes like stories of the prophets and places mentioned in al-Quran, and search feature. 0-7803-9521-2/06/$20.00 §2006 IEEE.
The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information... more
The information world is rich of documents in different formats or applications, such as databases, digital libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and information retrieval systems to deal with the large number of documents on the web. Many research papers, conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages, whereas fewer were applied to Arabic language. This paper addresses the issue of automatic ...
In this research we investigate the effect of search engine brand on the evaluation of searching performance. Our research is motivated by the large amount of search traffic directed to a handful of Web search engines, even though many... more
In this research we investigate the effect of search engine brand on the evaluation of searching performance. Our research is motivated by the large amount of search traffic directed to a handful of Web search engines, even though many have similar interfaces and performance. We conducted a laboratory experiment with 32 participants using a 4 2 factorial design confounded in four blocks to measure the effect of four search engine brands (Google, MSN, Yahoo!, and a locally developed search engine) while controlling for the quality and presentation of search engine results. We found brand indeed played a role in the searching process. Brand effect varied in different domains. Users seemed to place a high degree of trust in major search engine brands; however, they were more engaged in the searching process when using lesser-known search engines. It appears that branding affects overall Web search at four stages: (a) search engine selection, (b) search engine results page evaluation, (c) individual link evaluation, and (d) evaluation of the landing page. We discuss the implications for search engine marketing and the design of empirical studies measuring search engine performance.
An experimental comparison of a large number of different image descriptors for content-based image retrieval is presented. Many of the papers describing new techniques and descriptors for content-based image retrieval describe their... more
An experimental comparison of a large number of different image descriptors for content-based image retrieval is presented. Many of the papers describing new techniques and descriptors for content-based image retrieval describe their newly proposed methods as most appropriate without giving an in-depth comparison with all methods that were proposed earlier. In this paper , we first give an overview of a large variety of features for content-based image retrieval and compare them quantitatively on four different tasks: stock photo retrieval, personal photo collection retrieval, building retrieval, and medical image retrieval. For the experiments, five different, publicly available image databases are used and the retrieval performance of the features is analysed in detail. This allows for a direct comparison of all features considered in this work and furthermore will allow a comparison of newly proposed features to these in the future. Additionally, the correlation of the features is analysed, which opens the way for a simple and intuitive method to find an initial set of suitable features for a new task. The article concludes with recommendations which features perform well for what type of data. Interestingly, the often used, but very simple, colour histogram performs well in the comparison and thus can be recommended as a simple baseline for many applications.
CDS/ISIS is an Integrated Storage and Information retrieval System of United Nations Educational Scientific and Cultural Organization (UNESCO), which is widely used for managing bibliographical reference ensuring high quality content. The... more
CDS/ISIS is an Integrated Storage and Information retrieval System of United Nations Educational Scientific and Cultural Organization (UNESCO), which is widely used for managing bibliographical reference ensuring high quality content. The main purpose of this paper is to present the work recently carried out by the Food and Agriculture Organization of the United Nations (FAO) in collaboration with Associazione per la documentazione le biblioteche e gli archivi (DBA) in Italy to make Web CDS/ISIS based applications compliant to the OAI-PMH. After a brief evaluation of some of the existing solutions, the paper describes the methodology chosen and proposes an open source, easily parametrizable plugin tool, which can be adapted to expose metadata from a general structure CDS/ISIS database using the OAI-PMH protocol. It concludes with expressing the importance and implications of this work for the whole CDS/ISIS community and specifically for the participating centres from the AGRIS network. In addition, this work assures that semantically rich metadata for agricultural science and research publications based on the "AGRIS Application Profile" can be handled by the OAI protocol.
Pengindeksan semantik terpendam ialah satu varian daripada kaedah ruang vektor iaitu satu anggaran pangkat-rendah kepada perwakilan ruang vektor untuk pangkalan data digunakan. Idea utama dalam model pengindeksan semantik terpendam adalah... more
Pengindeksan semantik terpendam ialah satu varian daripada kaedah ruang vektor iaitu satu anggaran pangkat-rendah kepada perwakilan ruang vektor untuk pangkalan data digunakan. Idea utama dalam model pengindeksan semantik terpendam adalah untuk memetakan setiap vektor dokumen dan pertanyaan ke dalam satu ruang berdimensi lebih rendah yang berkaitan dengan konsep-konsep. Ini dilaksanakan dengan memetakan vektor-vektor istilah indeks ke dalam ruang berdimensi lebih rendah tersebut. Dakwaannya, capaian di dalam ruang yang dikecilkan mungkin lebih baik daripada capaian di dalam ruang istilah-istilah indeks. Dalam makalah ini, sebagai tambahan kepada kaedah ruang vektor, kaedah pengindeksan semantik terpendam digunakan untuk membina sistem dapatan semula maklumat bahasa Melayu. Kata Kunci: dapatan semula maklumat bahasa Melayu, pengindeksan semantik terpendam, dapatan semula maklumat, kaedah ruang vektor.
This paper proposes to use two information retrieval system models (Boolean information retrieval model and extended Boolean (fuzzy) information retrieval model). These models differ by using Boolean queries or fuzzy weighted queries. It... more
This paper proposes to use two information retrieval system models (Boolean information retrieval model and extended Boolean (fuzzy) information retrieval model). These models differ by using Boolean queries or fuzzy weighted queries. It also proposes a way for optimizing user query for the two models by using genetic programming and fuzzy logic. And proposes to use more number of Boolean operators (AND, OR, XOR, OF, and NOT) instead of the standard Boolean operators (AND, OR, and NOT), and use weights for Boolean operators and for terms in fuzzy models.
The constant improvement of both hardware and software related to mobile computing is enhancing the capabilities of mobile devices. The present day mobile phones can run rich stand alone applications as well as distributed client-server... more
The constant improvement of both hardware and software related to mobile computing is enhancing the capabilities of mobile devices. The present day mobile phones can run rich stand alone applications as well as distributed client-server applications that access information via a web gateway. This changed environment brings new opportunities as well as constraints for mobile application developers. A move towards open source software offers several advantages for application developers and operating system vendors. The objective of this paper is to demonstrate how voice enabled mobile applications can be deployed economically using only open source software to access information from the Web. Swar-Suchak is a voice enabled mobile application for information retrieval in multiple languages. We describe two applications running on Swar-Suchak using the open source Android platform. By linking a mobile phone to a voice gateway, built with open source software, we are able to develop voice enabled web applications which are accessible ubiquitously by anyone, anytime.
The construction and maintenance of a medical thesaurus is a non-trivial task, due to the inherent complexity of a proper medical terminology. We present a methodology for transaction-based anomaly detection in the process of thesaurus... more
The construction and maintenance of a medical thesaurus is a non-trivial task, due to the inherent complexity of a proper medical terminology. We present a methodology for transaction-based anomaly detection in the process of thesaurus maintenance. Our experiences are based on lexicographic work with the MorphoSaurus lexicons, which are the basis for a mono- and cross-lingual biomedical information retrieval system. Any "edit"or "delete" actions within these lexicons that undo an action defined earlier were defined as anomalous. We identify four types of such anomalies. We also analyzed to which extent the anomalous lexicon entries had been detected by an alternative, corpus-based approach.
- by Percy Nohama and +1
- •
- Publishing, Anomaly Detection, Quality Control, Methods
Due to the wide usage of mobile phones, many software developers have make use of the device to be the platform for their applications. This move is especially crucial when the applications are to be made available anytime, anywhere. In... more
Due to the wide usage of mobile phones, many software developers have make use of the device to be the platform for their applications. This move is especially crucial when the applications are to be made available anytime, anywhere. In this paper the development of a mobile application for administering the final examination exercise of a Malaysian private university using mobile phones is presented.
Welcome to the first Twente Data Management Workshop (TDM). We have set ourselves two goals for the workshop series:
This article provides an overview of recent developments relating to the application of thesauri in information organisation and retrieval on the World Wide Web. It describes some recent thesaurus projects undertaken to facilitate... more
This article provides an overview of recent developments relating to the application of thesauri in information organisation and retrieval on the World Wide Web. It describes some recent thesaurus projects undertaken to facilitate resource description and discovery and access to wide-ranging information resources on the Internet. Types of thesauri available on the Web, thesauri integrated in databases and information retrieval systems, and multiple-thesaurus systems for cross-database searching are also discussed. Collective efforts and events in addressing the standardisation and novel applications of thesauri are briefly reviewed.
Automatic summarization has been proposed to help manage the results of biomedical information retrieval systems. Semantic MEDLINE, for example, summarizes semantic predications representing assertions in MEDLINE citations. Results are... more
Automatic summarization has been proposed to help manage the results of biomedical information retrieval systems. Semantic MEDLINE, for example, summarizes semantic predications representing assertions in MEDLINE citations. Results are presented as a graph which maintains links to the original citations. Graphs summarizing more than 500 citations are hard to read and navigate, however. We exploit graph theory for focusing these large graphs. The method is based on degree centrality, which measures connectedness in a graph. Four categories of clinical concepts related to treatment of disease were identified and presented as a summary of input text. A baseline was created using term frequency of occurrence. The system was evaluated on summaries for treatment of five diseases compared to a reference standard produced manually by two physicians. The results showed that recall for system results was 72%, precision was 73%, and F-score was 0.72. The system F-score was considerably higher than that for the baseline (0.47).
Paw e ł K o wa l s k i M a r c i n Fa s t y n , J a k u b B a n a s i a k (Instytut Slawistyki Polskiej Akademii Nauk) wyKorzystanie zasobów bibliograficznych przez instytucJe Kultury czy ich integracJa (na przyKładzie systemu isybislaw)... more
Paw e ł K o wa l s k i M a r c i n Fa s t y n , J a k u b B a n a s i a k (Instytut Slawistyki Polskiej Akademii Nauk) wyKorzystanie zasobów bibliograficznych przez instytucJe Kultury czy ich integracJa (na przyKładzie systemu isybislaw) Słowa kluczowe baza bibliograficzna; system informacyjno-wyszukiwawczy; instytucja kultury; iSybislaw Nr 9 (11) / 2018, s. 95-108 Paweł Kowalski, Doktor nauk humanistycznych w zakresie językoznawstwa (filologia słowiańska). Adiunkt w Instytucie Slawistyki Polskiej Akademii Nauk. Zajmuje się językami Słowiańszczyzny południowej, słowotwórstwem, terminologią językoznawczą, językami mniejszościowymi i językami informacyjno-wyszukiwawczymi.
Abstnzct-Users need a new class of information retrieval systems to help them utilize effectively the increasingly vast selection of networked information resources becoming available on the Internet. These systems-usually called Network... more
Abstnzct-Users need a new class of information retrieval systems to help them utilize effectively the increasingly vast selection of networked information resources becoming available on the Internet. These systems-usually called Network Information Discovery and Retrieval (NIDR) systems-must operate in a highly demanding, very large-scale distributed environment that encompasses huge numbers of autonomously managed and extremely heterogeneous resources. The design of successful NIDR systems demands a synthesis of technologies and practices from computer science, computer-communications networking, information science, librarianship, and information management. This paper discusses the range of potential functional requirements for information resource discovery and selection, issues involved in describing and classifying network resources to support discovery and selection processes, and architectural frameworks for collecting and managing the information bases involved. It also includes a survey and analysis of selected operational prototypes and production systems.
Conceptualizing the bibliographic record as text implies that it needs to be treated as such in order to fully exploit its function in information retrieval activities, which affects how access to works can be achieved. A theoretical... more
Conceptualizing the bibliographic record as text implies that it needs to be treated as such in order to fully exploit its function in information retrieval activities, which affects how access to works can be achieved. A theoretical framework is outlined, including methodological consequences in terms of how to go about teaching students of knowledge organization and users of information retrieval systems the literate activity of using the bibliographic record as a text. For knowledge organization research, this implies that providing access to texts and the works they embody is not a technical matter, but rather a literate issue.
Nous travaillons à la conception d'un entrepôt de données de ressources documentaires dans un cadre pédagogique intégrant la modélisation de l'utilisateur. La description de ressources, en vue de leur réutilisation dans des parcours de... more
Nous travaillons à la conception d'un entrepôt de données de ressources documentaires dans un cadre pédagogique intégrant la modélisation de l'utilisateur. La description de ressources, en vue de leur réutilisation dans des parcours de formation, évoquent les difficultés rencontrées et formulent des propositions pour combler des manques dans les normes existantes et rendre plus opérationnels certains descriptifs. La modélisation des acteurs d'une part et des types de documents d'autre part, permettent d'élaborer des corrélations afin d'améliorer les réponses. La mise en relation des acteurs et des documents est possible par les méta-données de l'entrepôt de données et la méta-modélisation de l'entrepôt de données. Nous élaborons également les méta-données propres à l'entrepôt de données qui définissent les méta-données structurelles et d'accessibilité propres au système de pilotage. Afin de procéder au mieux au développement de notre contribution au système d'information stratégique, la méta-modélisation de l'entrepôt de données permet d'élaborer un schéma directeur pour la construction de l'entrepôt de données.
Motivation: As the sizes of three-dimensional (3D) protein structure databases are growing rapidly nowadays, exhaustive database searching, in which a 3D query structure is compared to each and every structure in the database, becomes... more
Motivation: As the sizes of three-dimensional (3D) protein structure databases are growing rapidly nowadays, exhaustive database searching, in which a 3D query structure is compared to each and every structure in the database, becomes inefficient. We propose a rapid 3D protein structure retrieval system named 'ProtDex2', in which we adopt the techniques used in information retrieval systems in order to perform rapid database searching without having access to every 3D structure in the database. The retrieval process is based on the inverted-file index constructed on the feature vectors of the relationships between the secondary structure elements (SSEs) of all the 3D protein structures in the database. ProtDex2 is a significant improvement, both in terms of speed and accuracy, upon its predecessor system, ProtDex.