Hyuk-Yoon Kwon - Academia.edu (original) (raw)
Papers by Hyuk-Yoon Kwon
The datasets consist of four Windows registry files collected from four different Windows PCs. Be... more The datasets consist of four Windows registry files collected from four different Windows PCs. Because Windows registry contains system- and applications- dependent critical information, it is useful in researching the potential system security issues and digital forensics. Because all the PCs were not currently used, the contained issues do not arise the actual security problems. All the files are extracted from Windows built-in command lines. The three files out of them are raw datasets extracted from Windows directly. The fourth file (i.e., registry 4) is transformed from the raw data in a key-value pair format for further processing, but it contains completely exact information with the others.
International Journal of Information Security, 2021
This paper deals with a well-known problem in the area of the smudge attacks: when a user draws a... more This paper deals with a well-known problem in the area of the smudge attacks: when a user draws a pattern to unlock the pattern lock on a smartphone screen, pattern extraction sometimes becomes difficult owing to the existence of the oily residuals around it. This is because the phone screen becomes obscured by these residuals, which significantly lower the guess rate of the pattern lock. To address this, this paper proposes a novel attack method based on a Convolutional Neural Network (CNN). CNNs are known to exhibit high accuracy in image classification. However, using only CNNs for the attack is not sufficient, because there are 389,112 possible patterns, and training the CNN for all the cases is difficult. We therefore propose two ideas to overcome the aforementioned problem. The first one is the application of ’Screen Segmentation,’ where we divide the screen into four segments to reduce the number of possible patterns to 1470 in each segment. The second is the use of pruning rules, which reduces the number of total pattern cases by combining the patterns in each segment. Furthermore, by applying the Android pattern lock constraints, we reduce the number of possible patterns. To validate the proposed idea, we collected 3500 image data by photographing the screen of Android smartphones and generated 367,500 image data based on their possible combinations. Using those data sets, we conducted an experiment whereby we investigated the success rate of our attack in various situations, dealing with different pattern lock lengths and type of prior application usage. The result shows that up to a pattern lock length of seven, the proposed method has on an average, 79% success rate, which is meaningful result in smudge attacks. In addition, in an ideal case where only the actual pattern lock is entered, without oily residuals, the proposed scheme supports an even higher performance, i.e., a 93% successful guess rate on an average.
We present key-value data sets where each data set is composed of various data types. We present ... more We present key-value data sets where each data set is composed of various data types. We present eight datasets including synthetic and real data sets for storing them in the key-value stores such as LevelDB of Google, RocksDB of Facebook, and Berkeley DB of Oracle. The key-value stores have a strength that can deal with various data types by assigning data of an arbitrary type as the value and the unique ID as the key. When we construct key-value data sets, we focus on various data types (i.e., variety) in real data sets and various sizes (i.e., volume) in synthetic data sets. We generate four synthetic data sets according to the various size of data set: (1) KVData1, (2) KVData2, (3) KVData3, and (4) KVData4. The total number of objects are varied from 10K to 10M. For each key-value pair, we generate a random string with a variable length and a unique ID for a key. For real datasets, we crawled user tweets and relevant information from Twitter using Tweepy library (https://www.tweepy.org/) and each data set consists of various data types: 1) Geo-location, 2) hashtag, 3) Tweets, and 4) the number of followers. That is, all the data sets are designed to have different data types such as geo-locations, texts, and integers. Table 2 shows the characteristics of the real data sets. We crawled four kinds of real data sets: (1) ID-Geo, consisting of the tweet ID and the location information of the tweet, (2) ID-Hashtag, consisting of the tweet ID and the hashtags in the tweet, (3) ID-Tweet data set, consisting of the tweet ID and the tweet text, and (4) User-Followers, consisting of the user ID and the number of followers of the user.
Sensors, 2021
Diabetic retinopathy (DR) is an eye disease that alters the blood vessels of a person suffering f... more Diabetic retinopathy (DR) is an eye disease that alters the blood vessels of a person suffering from diabetes. Diabetic macular edema (DME) occurs when DR affects the macula, which causes fluid accumulation in the macula. Efficient screening systems require experts to manually analyze images to recognize diseases. However, due to the challenging nature of the screening method and lack of trained human resources, devising effective screening-oriented treatment is an expensive task. Automated systems are trying to cope with these challenges; however, these methods do not generalize well to multiple diseases and real-world scenarios. To solve the aforementioned issues, we propose a new method comprising two main steps. The first involves dataset preparation and feature extraction and the other relates to improving a custom deep learning based CenterNet model trained for eye disease classification. Initially, we generate annotations for suspected samples to locate the precise region of ...
2022 IEEE International Conference on Big Data and Smart Computing (BigComp), 2022
In this study, we propose a distributed architecture that dynamically updates the model for class... more In this study, we propose a distributed architecture that dynamically updates the model for classifying tweet streams generated in real time. Our architecture ingests data streams through Apache Kafka and classifies them based on Apache Spark Streaming. In order to dynamically reflect input stream changes into the classification model, we design the classification model that can be dynamically updated by updating the tokenizer and classifier for new tweet streams. The proposed architecture can provide effective classification for data streams due to the dynamic update and can efficiently process through parallel processing of distributed environments. Through experiments using cyberattack-related tweets, we show that our classification model gradually improves the classification accuracy from 0.8869 when the initial 50,000 tweets are used to 0.9094 when 200,000 tweets are accumulated by F1-score.
A top-k spatial keyword query returns k objects having the highest (or lowest) scores with regard... more A top-k spatial keyword query returns k objects having the highest (or lowest) scores with regard to spatial proximity as well as text relevancy. Approaches for answering top-k spatial keyword queries can be classified into two categories: the separate index approach and the hybrid index approach. The separate index approach maintains the spatial index and the text index independently and can accommodate new data types. However, it is difficult to support top-k pruning and merging efficiently at the same time since it requires two different orders for clustering the objects: the first based on scores for top-k pruning and the second based on object IDs for efficient merging. In this paper, we propose a new separate index method called Rank-Aware Separate Index Method (RASIM) for top-k spatial keyword queries. RASIM supports both top-k pruning and efficient merging at the same time by clustering each separate index in two different orders through the partitioning technique. Specifica...
2021 IEEE International Conference on Big Data and Smart Computing (BigComp)
In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepS... more In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepScrap provides the complete scraping for the recent tweets that can be viewed on a specific user’s page and crawls with a fast speed that overcomes the limited rates in Twitter APIs. Especially, to improve the crawling speed of DeepScrap, we devise a multiprocessing architecture while assigning different IPs to the multiple processes to follow the robots.txt of Twitter. This allows us to maximize the parallelism of crawling in a machine. We show that DeepScrap can crawl the entire tweets that are crawled by Twitter standard APIs by analyzing the tweets on 97 users. Through extensive experiments, we show that DeepScrap can crawl the entire tweets of 97 users, which amounts to 222,194 tweets while Twitter standard API can crawl only 12,586 tweets of them because of the constraints. We also show that multiprocessing of DeepScrap improves single processing of DeepScrap by 2.97 times to crawl 222,194 tweets for 97 users when four processes are running simultaneously.
Journal of KIISE:Computing Practices and Letters, 2008
As the amount of electronic documents increases rapidly with the growth of the Internet, a parall... more As the amount of electronic documents increases rapidly with the growth of the Internet, a parallel search engine capable of handling a large number of documents are becoming ever important. To implement a parallel search engine, we need to partition the inverted index and search through the partitioned index in parallel. There are two methods of partitioning the inverted index: 1) document-identifier based partitioning and 2) keyword-identifier based partitioning. However, each method alone has the following drawbacks. The former is convenient in inserting documents and has high throughput, but has poor performance for top h query processing. The latter has good performance for top-k query processing, but is inconvenient in inserting documents and has low throughput. In this paper, we propose a hybrid partitioning method to compensate for the drawback of each method. We design and implement a parallel search engine that supports the hybrid partitioning method using the Odysseus DBM...
J. Inf. Sci. Eng., 2015
For efficient large-scale Web crawlers, URL duplication checking is an important technique since ... more For efficient large-scale Web crawlers, URL duplication checking is an important technique since it is a significant bottleneck. In this paper, we propose a new URL duplication checking technique for a parallel Web crawler; we call it full-coverage two level URL duplication checking (full-coverage-2L-UDC). Full-coverage-2L-UDC provides efficient URL duplication checking while ensuring maximum coverage. First, we propose two-level URL duplication checking (2L-UDC). It provides efficiency in URL duplication checking by communicating at the Web site level rather than at the Web page level. Second, we present a solution for the so-called coverage problem, which is directly related to the recall of the search engine. It is the first solution for the coverage problem in the centralized parallel architecture. Third, we propose an architecture, FC2LUDCbot, for a centralized parallel crawler using full-coverage-2L-UDC. We build a seven-agent FC2L-UDCbot for extensive experiments. We show tha...
In this paper, we present a parallel algorithm for SLIC on Apache Spark, which we call PSLIC-on-S... more In this paper, we present a parallel algorithm for SLIC on Apache Spark, which we call PSLIC-on-Spark. To this purpose, we have extended the original SLIC algorithm to use the operations in Apache Spark, supporting its parallel processing on multiple executors in the Apache Spark cluster. Then, we analyze the trade-off relationship of PSLIC-on-Spark between its processing speed and accuracy due to partitioning of the original image datasets. Through experiments, we verify the trade-off relationship. Specifically, we show that PSLIC-on-Spark using 8 CPU cores reduces the processing time of SLIC by 2.24–2.93 times while it reduces the boundary recall (BR) of SLIC by 1.54–6.32% and increases under-segmentation error (UE) by 1.79–6.2%. Then, we propose an improved algorithm of PSLIC-on-Spark that improves the accuracy of PSLIC-on-Spark, which we call PASLIC-on-Spark. We employ two important features for PASLIC-on-Spark. It contains two main features: (1) image partitioning considering t...
In this paper, we deal with the problem of judging the credibility of movie reviews. The problem ... more In this paper, we deal with the problem of judging the credibility of movie reviews. The problem is challenging because even experts cannot clearly and efficiently judge the credibility of a movie review and the number of movie reviews is very large. To attack this problem, we propose a weakly supervised learning method for fast annotation. In terms of predefined criteria for weakly supervised learning, we present a simple and clear criterion based on historical movie ratings associated with movie reviewers. The proposed method has the following two advantages. First, it is significantly efficient because we can annotate the entire data sets according to the predefined rule. Indeed, we show that the proposed method can annotate 8,000 movie reviews only in 0.712 seconds. Second, a criterion adapted for weakly supervised learning is simple but effective. We use as a comparison learning method that uses the helpfulness votes of other reviewers as the criterion to judge the credibility ...
With increasing numbers of GPS-equipped mobile devices, we are witnessing a deluge of spatial inf... more With increasing numbers of GPS-equipped mobile devices, we are witnessing a deluge of spatial information that needs to be effectively and efficiently managed. Even though there are several distributed spatial data processing systems such as GeoSpark (Apache Sedona), the effects of underlying storage engines have not been well studied for spatial data processing. In this paper, we evaluate the performance of various distributed storage engines for processing large-scale spatial data using GeoSpark, a state-of-the-art distributed spatial data processing system running on top of Apache Spark. For our performance evaluation, we choose three distributed storage engines having different characteristics: (1) HDFS, (2) MongoDB, and (3) Amazon S3. To conduct our experimental study on a real cloud computing environment, we utilize Amazon EMR instances (up to 6 instances) for distributed spatial data processing. For the evaluation of big spatial data processing, we generate data sets consider...
Recently, parallel search engines have been implemented based on scalable distributed file system... more Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS and demonstrate its commercial-level scalability and performance. In addition, we present a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We have built a five-node parallel search engine according to the proposed architecture using a DB-IR tightly-integrated DBMS. Through extensive experiments, we show the correctness of the model by comparing the projected output with the experimen...
The datasets consist of four Windows registry files collected from four different Windows PCs. Be... more The datasets consist of four Windows registry files collected from four different Windows PCs. Because Windows registry contains system- and applications- dependent critical information, it is useful in researching the potential system security issues and digital forensics. Because all the PCs were not currently used, the contained issues do not arise the actual security problems. All the files are extracted from Windows built-in command lines. The three files out of them are raw datasets extracted from Windows directly. The fourth file (i.e., registry 4) is transformed from the raw data in a key-value pair format for further processing, but it contains completely exact information with the others.
International Journal of Information Security, 2021
This paper deals with a well-known problem in the area of the smudge attacks: when a user draws a... more This paper deals with a well-known problem in the area of the smudge attacks: when a user draws a pattern to unlock the pattern lock on a smartphone screen, pattern extraction sometimes becomes difficult owing to the existence of the oily residuals around it. This is because the phone screen becomes obscured by these residuals, which significantly lower the guess rate of the pattern lock. To address this, this paper proposes a novel attack method based on a Convolutional Neural Network (CNN). CNNs are known to exhibit high accuracy in image classification. However, using only CNNs for the attack is not sufficient, because there are 389,112 possible patterns, and training the CNN for all the cases is difficult. We therefore propose two ideas to overcome the aforementioned problem. The first one is the application of ’Screen Segmentation,’ where we divide the screen into four segments to reduce the number of possible patterns to 1470 in each segment. The second is the use of pruning rules, which reduces the number of total pattern cases by combining the patterns in each segment. Furthermore, by applying the Android pattern lock constraints, we reduce the number of possible patterns. To validate the proposed idea, we collected 3500 image data by photographing the screen of Android smartphones and generated 367,500 image data based on their possible combinations. Using those data sets, we conducted an experiment whereby we investigated the success rate of our attack in various situations, dealing with different pattern lock lengths and type of prior application usage. The result shows that up to a pattern lock length of seven, the proposed method has on an average, 79% success rate, which is meaningful result in smudge attacks. In addition, in an ideal case where only the actual pattern lock is entered, without oily residuals, the proposed scheme supports an even higher performance, i.e., a 93% successful guess rate on an average.
We present key-value data sets where each data set is composed of various data types. We present ... more We present key-value data sets where each data set is composed of various data types. We present eight datasets including synthetic and real data sets for storing them in the key-value stores such as LevelDB of Google, RocksDB of Facebook, and Berkeley DB of Oracle. The key-value stores have a strength that can deal with various data types by assigning data of an arbitrary type as the value and the unique ID as the key. When we construct key-value data sets, we focus on various data types (i.e., variety) in real data sets and various sizes (i.e., volume) in synthetic data sets. We generate four synthetic data sets according to the various size of data set: (1) KVData1, (2) KVData2, (3) KVData3, and (4) KVData4. The total number of objects are varied from 10K to 10M. For each key-value pair, we generate a random string with a variable length and a unique ID for a key. For real datasets, we crawled user tweets and relevant information from Twitter using Tweepy library (https://www.tweepy.org/) and each data set consists of various data types: 1) Geo-location, 2) hashtag, 3) Tweets, and 4) the number of followers. That is, all the data sets are designed to have different data types such as geo-locations, texts, and integers. Table 2 shows the characteristics of the real data sets. We crawled four kinds of real data sets: (1) ID-Geo, consisting of the tweet ID and the location information of the tweet, (2) ID-Hashtag, consisting of the tweet ID and the hashtags in the tweet, (3) ID-Tweet data set, consisting of the tweet ID and the tweet text, and (4) User-Followers, consisting of the user ID and the number of followers of the user.
Sensors, 2021
Diabetic retinopathy (DR) is an eye disease that alters the blood vessels of a person suffering f... more Diabetic retinopathy (DR) is an eye disease that alters the blood vessels of a person suffering from diabetes. Diabetic macular edema (DME) occurs when DR affects the macula, which causes fluid accumulation in the macula. Efficient screening systems require experts to manually analyze images to recognize diseases. However, due to the challenging nature of the screening method and lack of trained human resources, devising effective screening-oriented treatment is an expensive task. Automated systems are trying to cope with these challenges; however, these methods do not generalize well to multiple diseases and real-world scenarios. To solve the aforementioned issues, we propose a new method comprising two main steps. The first involves dataset preparation and feature extraction and the other relates to improving a custom deep learning based CenterNet model trained for eye disease classification. Initially, we generate annotations for suspected samples to locate the precise region of ...
2022 IEEE International Conference on Big Data and Smart Computing (BigComp), 2022
In this study, we propose a distributed architecture that dynamically updates the model for class... more In this study, we propose a distributed architecture that dynamically updates the model for classifying tweet streams generated in real time. Our architecture ingests data streams through Apache Kafka and classifies them based on Apache Spark Streaming. In order to dynamically reflect input stream changes into the classification model, we design the classification model that can be dynamically updated by updating the tokenizer and classifier for new tweet streams. The proposed architecture can provide effective classification for data streams due to the dynamic update and can efficiently process through parallel processing of distributed environments. Through experiments using cyberattack-related tweets, we show that our classification model gradually improves the classification accuracy from 0.8869 when the initial 50,000 tweets are used to 0.9094 when 200,000 tweets are accumulated by F1-score.
A top-k spatial keyword query returns k objects having the highest (or lowest) scores with regard... more A top-k spatial keyword query returns k objects having the highest (or lowest) scores with regard to spatial proximity as well as text relevancy. Approaches for answering top-k spatial keyword queries can be classified into two categories: the separate index approach and the hybrid index approach. The separate index approach maintains the spatial index and the text index independently and can accommodate new data types. However, it is difficult to support top-k pruning and merging efficiently at the same time since it requires two different orders for clustering the objects: the first based on scores for top-k pruning and the second based on object IDs for efficient merging. In this paper, we propose a new separate index method called Rank-Aware Separate Index Method (RASIM) for top-k spatial keyword queries. RASIM supports both top-k pruning and efficient merging at the same time by clustering each separate index in two different orders through the partitioning technique. Specifica...
2021 IEEE International Conference on Big Data and Smart Computing (BigComp)
In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepS... more In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepScrap provides the complete scraping for the recent tweets that can be viewed on a specific user’s page and crawls with a fast speed that overcomes the limited rates in Twitter APIs. Especially, to improve the crawling speed of DeepScrap, we devise a multiprocessing architecture while assigning different IPs to the multiple processes to follow the robots.txt of Twitter. This allows us to maximize the parallelism of crawling in a machine. We show that DeepScrap can crawl the entire tweets that are crawled by Twitter standard APIs by analyzing the tweets on 97 users. Through extensive experiments, we show that DeepScrap can crawl the entire tweets of 97 users, which amounts to 222,194 tweets while Twitter standard API can crawl only 12,586 tweets of them because of the constraints. We also show that multiprocessing of DeepScrap improves single processing of DeepScrap by 2.97 times to crawl 222,194 tweets for 97 users when four processes are running simultaneously.
Journal of KIISE:Computing Practices and Letters, 2008
As the amount of electronic documents increases rapidly with the growth of the Internet, a parall... more As the amount of electronic documents increases rapidly with the growth of the Internet, a parallel search engine capable of handling a large number of documents are becoming ever important. To implement a parallel search engine, we need to partition the inverted index and search through the partitioned index in parallel. There are two methods of partitioning the inverted index: 1) document-identifier based partitioning and 2) keyword-identifier based partitioning. However, each method alone has the following drawbacks. The former is convenient in inserting documents and has high throughput, but has poor performance for top h query processing. The latter has good performance for top-k query processing, but is inconvenient in inserting documents and has low throughput. In this paper, we propose a hybrid partitioning method to compensate for the drawback of each method. We design and implement a parallel search engine that supports the hybrid partitioning method using the Odysseus DBM...
J. Inf. Sci. Eng., 2015
For efficient large-scale Web crawlers, URL duplication checking is an important technique since ... more For efficient large-scale Web crawlers, URL duplication checking is an important technique since it is a significant bottleneck. In this paper, we propose a new URL duplication checking technique for a parallel Web crawler; we call it full-coverage two level URL duplication checking (full-coverage-2L-UDC). Full-coverage-2L-UDC provides efficient URL duplication checking while ensuring maximum coverage. First, we propose two-level URL duplication checking (2L-UDC). It provides efficiency in URL duplication checking by communicating at the Web site level rather than at the Web page level. Second, we present a solution for the so-called coverage problem, which is directly related to the recall of the search engine. It is the first solution for the coverage problem in the centralized parallel architecture. Third, we propose an architecture, FC2LUDCbot, for a centralized parallel crawler using full-coverage-2L-UDC. We build a seven-agent FC2L-UDCbot for extensive experiments. We show tha...
In this paper, we present a parallel algorithm for SLIC on Apache Spark, which we call PSLIC-on-S... more In this paper, we present a parallel algorithm for SLIC on Apache Spark, which we call PSLIC-on-Spark. To this purpose, we have extended the original SLIC algorithm to use the operations in Apache Spark, supporting its parallel processing on multiple executors in the Apache Spark cluster. Then, we analyze the trade-off relationship of PSLIC-on-Spark between its processing speed and accuracy due to partitioning of the original image datasets. Through experiments, we verify the trade-off relationship. Specifically, we show that PSLIC-on-Spark using 8 CPU cores reduces the processing time of SLIC by 2.24–2.93 times while it reduces the boundary recall (BR) of SLIC by 1.54–6.32% and increases under-segmentation error (UE) by 1.79–6.2%. Then, we propose an improved algorithm of PSLIC-on-Spark that improves the accuracy of PSLIC-on-Spark, which we call PASLIC-on-Spark. We employ two important features for PASLIC-on-Spark. It contains two main features: (1) image partitioning considering t...
In this paper, we deal with the problem of judging the credibility of movie reviews. The problem ... more In this paper, we deal with the problem of judging the credibility of movie reviews. The problem is challenging because even experts cannot clearly and efficiently judge the credibility of a movie review and the number of movie reviews is very large. To attack this problem, we propose a weakly supervised learning method for fast annotation. In terms of predefined criteria for weakly supervised learning, we present a simple and clear criterion based on historical movie ratings associated with movie reviewers. The proposed method has the following two advantages. First, it is significantly efficient because we can annotate the entire data sets according to the predefined rule. Indeed, we show that the proposed method can annotate 8,000 movie reviews only in 0.712 seconds. Second, a criterion adapted for weakly supervised learning is simple but effective. We use as a comparison learning method that uses the helpfulness votes of other reviewers as the criterion to judge the credibility ...
With increasing numbers of GPS-equipped mobile devices, we are witnessing a deluge of spatial inf... more With increasing numbers of GPS-equipped mobile devices, we are witnessing a deluge of spatial information that needs to be effectively and efficiently managed. Even though there are several distributed spatial data processing systems such as GeoSpark (Apache Sedona), the effects of underlying storage engines have not been well studied for spatial data processing. In this paper, we evaluate the performance of various distributed storage engines for processing large-scale spatial data using GeoSpark, a state-of-the-art distributed spatial data processing system running on top of Apache Spark. For our performance evaluation, we choose three distributed storage engines having different characteristics: (1) HDFS, (2) MongoDB, and (3) Amazon S3. To conduct our experimental study on a real cloud computing environment, we utilize Amazon EMR instances (up to 6 instances) for distributed spatial data processing. For the evaluation of big spatial data processing, we generate data sets consider...
Recently, parallel search engines have been implemented based on scalable distributed file system... more Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS and demonstrate its commercial-level scalability and performance. In addition, we present a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We have built a five-node parallel search engine according to the proposed architecture using a DB-IR tightly-integrated DBMS. Through extensive experiments, we show the correctness of the model by comparing the projected output with the experimen...