Implementation Of Information Retrieval Indonesian Text Documents Using The Vector Space Model (original) (raw)

Information retrieval system in text-based skripsi document search file using vector space model method

Journal of physics, 2019

Speed and simplicity in the process of finding documents and information have become a necessity in the campus library. The speed in finding a skripsi document that is currently still very long to find the data and more simple in displaying what you are looking for then we aim to search and find skripsi documents in a digital library at a private university in Jakarta more quickly and display what which is searched precisely. An approach is needed using the VSM method that encourages without complete keywords, and to see the effectiveness of research searches. The results of the VSM method can be applied for the development of prototype applications. The prototype used to facilitate the process of searching or finding the documents and information needed is fast and simple.

Comparison of VSM, GVSM, and Lsi in Information Retrieval for Indonesian Text

Jurnal Teknologi, 2016

Vector space model (VSM) is an Information Retrieval (IR) system model that represents query and documents as n-dimension vector. GVSM is an expansion from VSM that represents the documents base on similarity value between query and minterm vector space of documents collection. Minterm vector is defined by the term in query. Therefore, in retrieving a document can be done base on word meaning inside the query. On the contrary, a document can consist the same information semantically. LSI is a method implemented in IR system to retrieve document base on overall meaning of users’ query input from a document, not based on each word translation. LSI uses a matrix algebra technique namely Singular Value Decomposition (SVD). This study discusses the performance of VSM, GVSM and LSI that are implemented on IR to retrieve Indonesian sentences document of .pdf, .doc and .docx extension type files, by using Nazief and Adriani stemming algorithm. Each method implemented either by thread or no-...

Implementation of LSI Method on Information Retrieval for Text Document in Bahasa Indonesia

Information retrieval system is a system that is used to obtain the information based on user's requirement. In this study, Latent Semantic Indexing (LSI) method is implemented in the system to search and to collect documents based on overall meaning of documents instead of individual's word. Typical of documents that needs to be retrieved in the system are text document in *doc, *.docx, or *.pdf formatted. In the text preprocessing phase, Nazief and Adriani Algorithm is used to eliminate the affix (prefix, suffix, etc.) of a word and then match them in database root word. To evaluate the quality of information retrieval performance, time response, values of recall and precision are measured. Implementation of multithreading from 'read document' to stemming process is required in order to improve time responses. The result shows by using multithreading, the greater number of term in document collection gives the more efficient in required time response. In term of the required time response, the document collection in docx format is the fastest, followed by doc and pdf format. For 80 documents and beyond, the system produces an error "OutOfMemoryError" at the matrix decomposition process. This means that the greater number of document in the collection, the greater memory is needed to execute retrieval process.

Improving The Effectiveness of Texts Retrieval using Knowledge-Based Approach

International Journal of Emerging Trends in Engineering Research, 2020

A predicate-based document query language is proposed to allow users to define the search criteria precisely and reliably, and their knowledge of the documents to be retrieved. A guided search tool is built as an intelligent user interface oriented to the natural language in order to help users formulate queries. Supported by a generator of intelligent questions, an inference engine, a query base. A problem is faced when using the modern IR systems. It's represented in the vocabulary problem. This problem is represented in the inconsistencies between the terms which are used to describe the terms and the documents that are used by the investigator for describing their need for knowledge. The researcher has an automated thesaurus. This device has been designed using the Vector Space Model (VSM). The researcher used the similarity calculation of Cosine in this method. He used the selected 242 abstract Arabic documents in this article. All these abstracts include the information and computer science process. This paper aimed at building and designing automated Arabic thesauri through the use of the term similarity which could be employed in any particular domain or field for improving the process of expansion and obtaining greater number of relevant documents for the user query. In terms of recall and precision rates, it was found that the similar thesaurus is more capable than the conventional information retrieval system to enhance the recall process and precision.

Computational model for the processing of documents and support to the decision making in systems of information retrieval

2017

Disposing or not, of the necessary information at the right time, can mean the success or failure of any operation.. The field of information retrieval since its inception in the year 1950, has provided tools that allow users to find answers to their needs and questions. Information retrieval systems are the most used internationally, since they have interfaces and functionalities easy to understand. The main function of these systems is track the web, store the information found and then respond to user queries. Due to the large amount of information that have search engines, are a rich source of knowledge and support decision-making on information published on the web. Companies like Google do not provide concrete information of which models they use to develop the components of their search engines. In addition the calculation of the relevance of their documents responds to commercial and governmental policies, reason why it is difficult to develop systems as complex as the search engines without owning a computational model that supports the process of development of the same. The present article gives the design of a computational model for document processing and support decision-making in information retrieval systems used to design, development and deployment of searchers at national and international level.

Documents Retrieval Using the Combination of Two Keywords

In the search engine, the NLP (Natural Language Processing) and statistically-based systems are used for making the query. The statistical system is recognizing the terms for searching and also it provides the stems and singular and plural forms of words. The statically based system may also provide the weights of every term. In the Natural Language processing system the parts of speech, identifies objects, verbs, subjects, agents and synonyms and alternating forms for appropriate nouns are tags. Then it is able for creating an unambiguous representation of submitted query and the term weights are computed. For the particular query request the list of the documents are retrieve on the search engine from the database. Using the keywords the search engine obtained the results for submitted query. The Stemming algorithms and Stop-lists/Stop-words are used for reducing the consuming of size of the disk. 'the', 'is', 'an' are the example of stop-words and 'reading', 'playing', 'watches' are the examples of stemming algorithms. In the Information Retrieval system the vector space model and the Boolean model are using for the documents ranking. The search engine optimization is started with submitting the keywords on the search engine that should be very clear and understanding for the query processing and also known that which keywords are more relevant and will performs well for better results. So, in this paper, for retrieving the documents from the database the new technique, combination of the 'two keywords' are proposed and rearranges the list of documents in the order of weight.

A Survey of Information Retrieval Techniques

Advances in Networks

The explosive growth of resources stored in various forms and transmitted over the internet has necessitated researches into information retrieval technologies. The major information retrieval mechanisms commonly employed include vector space model, Boolean model, Fuzzy Set model, and probabilistic retrieval model. These models are used to find similarities between the query and the documents to retrieve documents that reflect the query. These approaches are based on keyword , which uses lists of keywords to describe the information content. In this paper, a survey of these models is provided in order to understand their working mechanisms and shortcomings. This understanding is vital as it facilitates the choice of an information retrieval technique, based on the underlying requirements. The results of this survey revealed that the current information retrieval models fall short of the expectations in one way or the other. As such, they are not ideal for high precision information retrieval applications.

Improving the Effectiveness of Information Retrieval System

American Scientific Research Journal for Engineering, Technology, and Sciences, 2016

With the rapid growth of information and easy access of information, in particular the boom of the World Wide Web, the problem of finding useful information and knowledge becomes one of the most important topics in information and computer science. Information Retrieval (IR) systems, also called text retrieval systems, facilitate users to retrieve information which is relevant or close to their information needs. This research provides an effective IR system for retrieving not only relevant but also related documents. For retrieving relevant documents, Probabilistic Model is applied. For retrieving related documents, the related indexed table is built including extracted keywords and related documents lists. In constructing related index table in the database, Shannon’s entropy difference between intrinsic and extrinsic mode is used to extract the highly significant keywords. Entropy threshold value was assigned to 0.5 of normalized entropy difference square ( ) according to the an...