Reduction in Searching Time of Inverted Index Using Bloom Filter (original) (raw)
Related papers
Enhance Inverted Index Using in Information Retrieval
This paper proposes a method to represent the first step in information retrieval (IR) (that prepare the document set (preprocessing), In Information retrieval systems, tokenization is an integral part whose prime objective is to identify the token and their count. In this paper, an effective tokenization approach which is based on proposed new method called enhance inverted index (EII). The result shows that efficiency/ effectiveness of the proposed algorithm. Tokenization on documents helps to satisfy user's information need more precisely and reduced search sharply, believed to be a part of information retrieval. Pre-processing of input document is an integral part of Tokenization, which involves preprocessing of documents and generates its respective tokens, which is the basis of these tokens. Probabilistic IR generates its scoring and gives reduced search space. The comparative analysis based on the two parameters; reduce the time of search space, Pre-processing time. Keywords: information retrieval (IR), enhance inverted index (EII). INTRODUCTION mount of operational data has been increasing exponentially from past few decades, the expectations of data-user is changing proportionally as well. The data-user expects more deep, exact, and detailed results. Retrieval of relevant results is always affected by the pattern, how they are stored indexed. There are various techniques are designed to index the documents, which is done on the token's identified with in documents, new techniques by using inverted index.[1] Information retrieval (IR) handles the representation, storage, organization, and access to information items. In IR, one of the main problems is to determine which documents are relevant and which are not to the user's needs. In practice, this problem usually mentioned as a ranking problem, which aims to solve according to the degree of relevance (matching) between all documents and the query of user [1] [2] [3].Which deals with information retrieval. General structure of information retrieval is as shown in Figure (1).
Building an Inverted Index at the DBMS Layer for Fast Full Text Search
2017
In order to make accurate and fast keywords and full text searches it is recommended to index the words in the corpus. One way to do this is to use an inverted index to maintain in a structured form the words occurrence in a set of documents. A stemming algorithm can be used to minimize the number of indexed words, so only the root word is kept for each term. This paper presents how to build an inverted index for documents stored in MongoDB and Oracle databases. Different approaches are presented in order to compare and determine which one has the best performance. These approaches take advantage of the frameworks and tools provided by the database systems to build the index: the MapReduce framework for MongoDB and Pipelined Table Functions for Oracle.
A parallel computational approach for similarity search using Bloom filters
Computational Intelligence, 2018
Finding similar items in a large and unstructured dataset is a challenging task in many applications of data science, such as searching, indexing, and retrieval. With the increasing data volume and demand for real time responses, similarity search has gained much consideration. In this paper, a parallel computational approach for similarity search using Bloom filters (PCASSB) has been proposed, which uses Bloom filter for the representation of features of document and comparison with user's query. Query features are stored in integer query array (I Q A), an array of integer. The PCASSB, an approximate similarity search technique, has been implemented on graphics processing unit with compute unified device architecture as the programming platform. To compute the similarity score between query and reference dataset, Dice coefficient has been used as a baseline method. The accuracy of the results generated by PCASSB is compared with the baseline method and other state-of-the-art methods. The experimental results show that the proposed technique is quite effective in processing large number of text documents as it takes less computational time.
Inverted indexes: Types and techniques
2011
There has been a s ubstantial amount of research on high performance inverted index because most web and search engines use an inverted index to execute queries. Documents are normally stored as lists of words, but inverted indexes invert this by storing for each word the list of documents that the word appears in, hence the name "inverted index". This paper presents the crucial research findings on inverted indexes, their types and techniques.
Inverted indexes for phrases and strings
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR '11, 2011
Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index has a shortcoming, in that only predefined pattern queries can be supported efficiently. In terms of string documents where word boundaries are undefined, if we have to index all the substrings of a given document, then the storage quickly becomes quadratic in the data size. Also, if we want to apply the same type of indexes for querying phrases or sequence of words, then the inverted index will end up storing redundant information. In this paper, we show the first set of inverted indexes which work naturally for strings as well as phrase searching. The central idea is to exclude document d in the inverted list of a string P if every occurrence of P in d is subsumed by another string of which P is a prefix. With this we show that our space utilization is close to the optimal. Techniques from succinct data structures are deployed to achieve compression while allowing fast access in terms of frequency and document id based retrieval. Compression and speed tradeoffs are evaluated for different variants of the proposed index. For phrase searching, we show that our indexes compare favorably against a typical inverted index deploying position-wise intersections. We also show efficient top-k based retrieval under relevance metrics like frequency and tf-idf.
Information Processing and Management, 2006
Similarity calculations and document ranking form the computationally expensive parts of query processing in ranking-based text retrieval. In this work, for these calculations, 11 alternative implementation techniques are presented under four different categories, and their asymptotic time and space complexities are investigated. To our knowledge, six of these techniques are not discussed in any other publication before. Furthermore, analytical experiments are carried out on a 30 GB document collection to evaluate the practical performance of different implementations in terms of query processing time and space consumption. Advantages and disadvantages of each technique are illustrated under different querying scenarios, and several experiments that investigate the scalability of the implementations are presented.
Efficient single‐pass index construction for text databases
Journal of the American Society for Information Science and Technology, 2003
Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this article, we review the principal approaches to inversion, analyze their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single‐pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single‐pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.
AN INVESTIGATIVE SCHEME FOR KEYWORD SEARCH USING INVERTED KEY TACTIC
Unverified classification of outlines that is data items, observations or feature vectors into groups is called as Clustering. To retrieve a document from a group of documents according to a set of keywords efficiently, Inverted lists are mostly used instead of Keyword Search. Keyword Search is one of the most important activities in Information Retrieval System. As study shows in latest years, Keyword search is widely used by users to access text data. Keyword search is appropriate for document gatherings as well as for accessing structured or semi-structured data, XML documents relational databases and relational tables which can also be regarded as sets of documents. This is done by a keyword as query user retrieve documents. To proficiently retrieve documents, a data structure is used, that maps each word in the dataset, to a list of IDs of documents in which the word appears. The inverted index for a document collection consists of a set of so-called inverted lists, known as posting lists.
Toward a multi-tier index for information retrieval system
international conference on telecommunications, 2005
Text Information Retrieval(TIR) is considered the heart of many applications such as Document Management System(DMS). TIR that used for DMS requires different techniques of data structure than that used in the search engine. Search engine, requires special hardware (super computers with high memory) to perform information retrieval algorithms. In this paper, a new approach is developed to make it easy for DMS to perform the retrieval process with high performance. Conventional approaches are based on single inverted file but our approach is based on object and multi-tier inverted index files structure.
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, 2013
Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: the inverted indexes. Currently, inverted indexes can be represented very efficiently using index compression schemes. Recent investigations also study how an optimized document ordering can be used to assign document identifiers (docIDs) to the document database. This yields important improvements in index compression and query processing time. In this paper we follow this line of research, yet from a different perspective. We propose a docID reassignment method that allows one to focus on a given subset of inverted lists to improve their performance. We then use run-length encoding to compress these lists (as many consecutive 1s are generated). We show that by using this approach, not only the performance of the particular subset of inverted lists is improved, but also that of the whole inverted index. Our experimental results indicate a reduction of about 10% in the space usage of the whole index (just regarding docIDs), and up to 30% if we regard only the particular subset of list on which the docID reassignment was focused. Also, decompression speed is up to 1.22 times faster if the runs must be explicitly decompressed and up to 4.58 times faster if implicit decompression of runs is allowed. Finally, we also improve the Document-at-a-Time query processing time of AND queries (by up to 12%), WAND queries (by up to 23%) and full (non-ranked) OR queries (by up to 86%).