Yi-Reun Kim | Korea Advanced Institute of Science and Technology (original) (raw)
Papers by Yi-Reun Kim
Journal of KIISE:Computing Practices and Letters, 2008
As the amount of electronic documents increases rapidly with the growth of the Internet, a parall... more As the amount of electronic documents increases rapidly with the growth of the Internet, a parallel search engine capable of handling a large number of documents are becoming ever important. To implement a parallel search engine, we need to partition the inverted index and search through the partitioned index in parallel. There are two methods of partitioning the inverted index: 1) document-identifier based partitioning and 2) keyword-identifier based partitioning. However, each method alone has the following drawbacks. The former is convenient in inserting documents and has high throughput, but has poor performance for top h query processing. The latter has good performance for top-k query processing, but is inconvenient in inserting documents and has low throughput. In this paper, we propose a hybrid partitioning method to compensate for the drawback of each method. We design and implement a parallel search engine that supports the hybrid partitioning method using the Odysseus DBM...
Journal of KIISE:Databases, 2009
In a multiple-server DBMS using the share-disk model, when a server process updates data, the upd... more In a multiple-server DBMS using the share-disk model, when a server process updates data, the updated ones are not immediately reflected to the buffers of the other server processes. Thus, the other server processes may read invalid data. In this paper, we propose a novel method to solve this problem. In this method the server process stores the identifiers and timestamps of the pages that have been updated during a transaction into the coherency volume when the transaction commits. Then, the server process invalidates its buffers of the pages updated by the other server processes by accessing the coherency volume when the lock is acquired, and, subsequently, read the up-to-date versions of the pages from disk. This method needs only a very small coherency volume and shows a good performance because the amount of data that need to be accessed is very small.
As the size of the web is growing explosively, search engines are becoming increasingly important... more As the size of the web is growing explosively, search engines are becoming increasingly important as the primary means to retrieve information from the Internet. A search engine periodically downloads web pages and stores them in the database to provide readers with up-to-date search results. The web crawler is a program that downloads and stores web pages for this purpose. A large-scale search engines uses a parallel web crawler to retrieve the collection of web pages maximizing the download rate. However, the service architecture or experimental analysis of parallel web crawlers has not been fully discussed in the literature. In this paper, we propose an architecture of the parallel web crawler and discuss implementation issues in detail. The proposed parallel web crawler is based on the coordinator/agent model using multiple machines to download web pages in parallel. The coordinator/agent model consists of multiple agent machines to collect web pages and a single coordinator mac...
Journal of Computing Science and Engineering, 2008
We propose a new query expansion method in the extended Boolean model that improves precision wit... more We propose a new query expansion method in the extended Boolean model that improves precision without degrading recall. For improving precision, our method promotes the ranks of documents having more query terms since users typically prefer such documents. The proposed method consists of the following three steps: (1) expanding the query by adding new terms related to each term of the query, (2) further expanding the query by adding augmented terms, which are conjunctions of the terms, (3) assigning a weight on each term so that augmented terms have higher weights than the other terms. We conduct extensive experiments to show the effectiveness of the proposed method. The experimental results show that the proposed method improves precision by up to 102% for the TREC-6 data compared with the existing query expansion method using a thesaurus proposed by Kwon et al. [Kwon et al. 1994].
Proceedings of the Fourth BioASQ workshop, 2016
Information Sciences, 2009
IEICE Transactions on Information and Systems, 2009
Proceedings of the 2010 international conference on Management of data - SIGMOD '10, 2010
Flash memory is widely used as the secondary storage in lightweight computing devices due to its ... more Flash memory is widely used as the secondary storage in lightweight computing devices due to its outstanding advantages over magnetic disks. Flash memory has many access characteristics different from those of magnetic disks, and how to take advantage of them is becoming an important research issue. There are two existing approaches to storing data into flash memory: page-based and log-based. The former has good performance for read operations, but poor performance for write operations. In contrast, the latter has good performance for write operations when updates are light, but poor performance for read operations. In this paper, we propose a new method of storing data, called page-differential logging, for flash-based storage systems that solves the drawbacks of the two methods. The primary characteristics of our method are: (1) writing only the difference (which we
Journal of KIISE:Computing Practices and Letters, 2008
As the amount of electronic documents increases rapidly with the growth of the Internet, a parall... more As the amount of electronic documents increases rapidly with the growth of the Internet, a parallel search engine capable of handling a large number of documents are becoming ever important. To implement a parallel search engine, we need to partition the inverted index and search through the partitioned index in parallel. There are two methods of partitioning the inverted index: 1) document-identifier based partitioning and 2) keyword-identifier based partitioning. However, each method alone has the following drawbacks. The former is convenient in inserting documents and has high throughput, but has poor performance for top h query processing. The latter has good performance for top-k query processing, but is inconvenient in inserting documents and has low throughput. In this paper, we propose a hybrid partitioning method to compensate for the drawback of each method. We design and implement a parallel search engine that supports the hybrid partitioning method using the Odysseus DBM...
Journal of KIISE:Databases, 2009
In a multiple-server DBMS using the share-disk model, when a server process updates data, the upd... more In a multiple-server DBMS using the share-disk model, when a server process updates data, the updated ones are not immediately reflected to the buffers of the other server processes. Thus, the other server processes may read invalid data. In this paper, we propose a novel method to solve this problem. In this method the server process stores the identifiers and timestamps of the pages that have been updated during a transaction into the coherency volume when the transaction commits. Then, the server process invalidates its buffers of the pages updated by the other server processes by accessing the coherency volume when the lock is acquired, and, subsequently, read the up-to-date versions of the pages from disk. This method needs only a very small coherency volume and shows a good performance because the amount of data that need to be accessed is very small.
As the size of the web is growing explosively, search engines are becoming increasingly important... more As the size of the web is growing explosively, search engines are becoming increasingly important as the primary means to retrieve information from the Internet. A search engine periodically downloads web pages and stores them in the database to provide readers with up-to-date search results. The web crawler is a program that downloads and stores web pages for this purpose. A large-scale search engines uses a parallel web crawler to retrieve the collection of web pages maximizing the download rate. However, the service architecture or experimental analysis of parallel web crawlers has not been fully discussed in the literature. In this paper, we propose an architecture of the parallel web crawler and discuss implementation issues in detail. The proposed parallel web crawler is based on the coordinator/agent model using multiple machines to download web pages in parallel. The coordinator/agent model consists of multiple agent machines to collect web pages and a single coordinator mac...
Journal of Computing Science and Engineering, 2008
We propose a new query expansion method in the extended Boolean model that improves precision wit... more We propose a new query expansion method in the extended Boolean model that improves precision without degrading recall. For improving precision, our method promotes the ranks of documents having more query terms since users typically prefer such documents. The proposed method consists of the following three steps: (1) expanding the query by adding new terms related to each term of the query, (2) further expanding the query by adding augmented terms, which are conjunctions of the terms, (3) assigning a weight on each term so that augmented terms have higher weights than the other terms. We conduct extensive experiments to show the effectiveness of the proposed method. The experimental results show that the proposed method improves precision by up to 102% for the TREC-6 data compared with the existing query expansion method using a thesaurus proposed by Kwon et al. [Kwon et al. 1994].
Proceedings of the Fourth BioASQ workshop, 2016
Information Sciences, 2009
IEICE Transactions on Information and Systems, 2009
Proceedings of the 2010 international conference on Management of data - SIGMOD '10, 2010
Flash memory is widely used as the secondary storage in lightweight computing devices due to its ... more Flash memory is widely used as the secondary storage in lightweight computing devices due to its outstanding advantages over magnetic disks. Flash memory has many access characteristics different from those of magnetic disks, and how to take advantage of them is becoming an important research issue. There are two existing approaches to storing data into flash memory: page-based and log-based. The former has good performance for read operations, but poor performance for write operations. In contrast, the latter has good performance for write operations when updates are light, but poor performance for read operations. In this paper, we propose a new method of storing data, called page-differential logging, for flash-based storage systems that solves the drawbacks of the two methods. The primary characteristics of our method are: (1) writing only the difference (which we