Byoung Chul Ko - Academia.edu (original) (raw)

Papers by Byoung Chul Ko

Research paper thumbnail of Automatic Classification Algorithm for Raw Materials using Mean Shift Clustering and Stepwise Region Merging in Color

방송공학회논문지, May 30, 2016

In this paper, we propose a classification model by analyzing raw material images recorded using ... more In this paper, we propose a classification model by analyzing raw material images recorded using a color CCD camera to automatically classify good and defective agricultural products such as rice, coffee, and green tea, and raw materials. The current classifying agricultural products mainly depends on visual selection by skilled laborers. However, classification ability may drop owing to repeated labor for a long period of time. To resolve the problems of existing human dependant commercial products, we propose a vision based automatic raw material classification combining mean shift clustering and stepwise region merging algorithm. In this paper, the image is divided into N cluster regions by applying the mean-shift clustering algorithm to the foreground map image. Second, the representative regions among the N cluster regions are selected and stepwise region-merging method is applied to integrate similar cluster regions by comparing both color and positional proximity to neighboring regions. The merged raw material objects thereby are expressed in a 2D color distribution of RG, GB, and BR. Third, a threshold is used to detect good and defective products based on color distribution ellipse for merged material objects. From the results of carrying out an experiment with diverse raw material images using the proposed method, less artificial manipulation by the user is required compared to existing clustering and commercial methods, and classification accuracy on raw materials is improved.

Research paper thumbnail of Cross-Modal Learning with 3D Deformable Attention for Action Recognition

arXiv (Cornell University), Dec 11, 2022

An important challenge in vision-based action recognition is the embedding of spatiotemporal feat... more An important challenge in vision-based action recognition is the embedding of spatiotemporal features with two or more heterogeneous modalities into a single feature. In this study, we propose a new 3D deformable transformer for action recognition with adaptive spatiotemporal receptive fields and a cross-modal learning scheme. The 3D deformable transformer consists of three attention modules: 3D deformability, local joint stride, and temporal stride attention. The two cross-modal tokens are input into the 3D deformable attention module to create a cross-attention token with a reflected spatiotemporal correlation. Local joint stride attention is applied to spatially combine attention and pose tokens. Temporal stride attention temporally reduces the number of input tokens in the attention module and supports temporal expression learning without the simultaneous use of all tokens. The deformable transformer iterates L-times and combines the last cross-modal token for classification. The proposed 3D deformable transformer was tested on the NTU60, NTU120, FineGYM, and PennAction datasets, and showed results better than or similar to pretrained state-of-the-art methods even without a pre-training process. In addition, by visualizing important joints and correlations during action recognition through spatial joint and temporal stride attention, the possibility of achieving an explainable potential for action recognition is presented.

Research paper thumbnail of Image resizing using saliency strength map and seam carving for white blood cell analysis

Biomedical Engineering Online, Sep 20, 2010

Background: A new image-resizing method using seam carving and a Saliency Strength Map (SSM) is p... more Background: A new image-resizing method using seam carving and a Saliency Strength Map (SSM) is proposed to preserve important contents, such as white blood cells included in blood cell images. Methods: To apply seam carving to cell images, a SSM is initially generated using a visual attention model and the structural properties of white blood cells are then used to create an energy map for seam carving. As a result, the energy map maximizes the energies of the white blood cells, while minimizing the energies of the red blood cells and background. Thus, the use of a SSM allows the proposed method to reduce the image size efficiently, while preserving the important white blood cells. Results: Experimental results using the PSNR (Peak Signal-to-Noise Ratio) and ROD (Ratio of Distortion) of blood cell images confirm that the proposed method is able to produce better resizing results than conventional methods, as the seam carving is performed based on an SSM and energy map. Conclusions: For further improvement, a faster medical image resizing method is currently being investigated to reduce the computation time, while maintaining the same image quality.

Research paper thumbnail of Fast Depth Estimation in a Single Image Using Lightweight Efficient Neural Network

Sensors, Oct 13, 2019

Depth estimation is a crucial and fundamental problem in the computer vision field. Conventional ... more Depth estimation is a crucial and fundamental problem in the computer vision field. Conventional methods reconstruct scenes using feature points extracted from multiple images; however, these approaches require multiple images and thus are not easily implemented in various real-time applications. Moreover, the special equipment required by hardware-based approaches using 3D sensors is expensive. Therefore, software-based methods for estimating depth from a single image using machine learning or deep learning are emerging as new alternatives. In this paper, we propose an algorithm that generates a depth map in real time using a single image and an optimized lightweight efficient neural network (L-ENet) algorithm instead of physical equipment, such as an infrared sensor or multi-view camera. Because depth values have a continuous nature and can produce locally ambiguous results, pixel-wise prediction with ordinal depth range classification was applied in this study. In addition, in our method various convolution techniques are applied to extract a dense feature map, and the number of parameters is greatly reduced by reducing the network layer. By using the proposed L-ENet algorithm, an accurate depth map can be generated from a single image quickly and, in a comparison with the ground truth, we can produce depth values closer to those of the ground truth with small errors. Experiments confirmed that the proposed L-ENet can achieve a significantly improved estimation performance over the state-of-the-art algorithms in depth estimation based on a single image.

Research paper thumbnail of Estimation of Pedestrian Pose Orientation Using Soft Target Training Based on Teacher–Student Framework

Sensors, Mar 6, 2019

Semi-supervised learning is known to achieve better generalisation than a model learned solely fr... more Semi-supervised learning is known to achieve better generalisation than a model learned solely from labelled data. Therefore, we propose a new method for estimating a pedestrian pose orientation using a soft-target method, which is a type of semi-supervised learning method. Because a convolutional neural network (CNN) based pose orientation estimation requires large numbers of parameters and operations, we apply the teacher-student algorithm to generate a compressed student model with high accuracy and compactness resembling that of the teacher model by combining a deep network with a random forest. After the teacher model is generated using hard target data, the softened outputs (soft-target data) of the teacher model are used for training the student model. Moreover, the orientation of the pedestrian has specific shape patterns, and a wavelet transform is applied to the input image as a pre-processing step owing to its good spatial frequency localisation property and the ability to preserve both the spatial information and gradient information of an image. For a benchmark dataset considering real driving situations based on a single camera, we used the TUD and KITTI datasets. We applied the proposed algorithm to various driving images in the datasets, and the results indicate that its classification performance with regard to the pose orientation is better than that of other state-of-the-art methods based on a CNN. In addition, the computational speed of the proposed student model is faster than that of other deep CNNs owing to the shorter model structure with a smaller number of parameters.

Research paper thumbnail of Microscopic Cell Nuclei Segmentation Based on Adaptive Attention Window

Journal of Digital Imaging, Jun 17, 2008

This paper presents an adaptive attention window (AAW)based microscopic cell nuclei segmentation ... more This paper presents an adaptive attention window (AAW)based microscopic cell nuclei segmentation method. For semantic AAW detection, a luminance map is used to create an initial attention window, which is then reduced close to the size of the real region of interest (ROI) using a quad-tree. The purpose of the AAW is to facilitate background removal and reduce the ROI segmentation processing time. Region segmentation is performed within the AAW, followed by region clustering and removal to produce segmentation of only ROIs. Experimental results demonstrate that the proposed method can efficiently segment one or more ROIs and produce similar segmentation results to human perception. In future work, the proposed method will be used for supporting a regionbased medical image retrieval system that can generate a combined feature vector of segmented ROIs based on extraction and patient data.

Research paper thumbnail of Facial Expression Recognition in the Wild Using Face Graph and Attention

IEEE Access, 2023

Facial expression recognition (FER) in the wild from various viewpoints, lighting conditions, fac... more Facial expression recognition (FER) in the wild from various viewpoints, lighting conditions, face poses, scales, and occlusions is an extremely challenging field of research. In this study, we construct a face graph by selecting action units that play an important role in changing facial expressions, and we propose an algorithm for recognizing facial expressions using a graph convolutional network (GCN). We first generated an attention map that can highlight action units to extract important facial expression features from faces in the wild. After feature extraction, a face graph is constructed by combining the attention map with face patches, and changes in expression in the wild are recognized using a GCN. Through comparative experiments conducted using both lab-controlled and wild datasets, we prove that the proposed method is the most suitable FER approach for use with image datasets captured in the wild and those under well-controlled indoor conditions.

Research paper thumbnail of STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition

arXiv (Cornell University), Oct 14, 2022

In action recognition, although the combination of spatio-temporal videos and skeleton features c... more In action recognition, although the combination of spatio-temporal videos and skeleton features can improve the recognition performance, a separate model and balancing feature representation for cross-modal data are required. To solve these problems, we propose Spatio-TemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector. First, from the input video and skeleton sequence, video frames are output as global grid tokens and skeletons are output as joint map tokens, respectively. These tokens are then aggregated into multi-class tokens and input into STAR-transformer. The STAR-transformer encoder consists of a full spatio-temporal attention (FAttn) module and a proposed zigzag spatio-temporal attention (ZAttn) module. Similarly, the continuous decoder consists of a FAttn module and a proposed binary spatio-temporal attention (BAttn) module. STAR-transformer learns an efficient multi-feature representation of the spatio-temporal features by properly arranging pairings of the FAttn, ZAttn, and BAttn modules. Experimental results on the Penn-Action, NTU-RGB+D 60, and 120 datasets show that the proposed method achieves a promising improvement in performance in comparison to previous state-of-the-art methods.

Research paper thumbnail of STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

In action recognition, although the combination of spatio-temporal videos and skeleton features c... more In action recognition, although the combination of spatio-temporal videos and skeleton features can improve the recognition performance, a separate model and balancing feature representation for cross-modal data are required. To solve these problems, we propose Spatio-TemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector. First, from the input video and skeleton sequence, video frames are output as global grid tokens and skeletons are output as joint map tokens, respectively. These tokens are then aggregated into multi-class tokens and input into STAR-transformer. The STAR-transformer encoder consists of a full spatio-temporal attention (FAttn) module and a proposed zigzag spatio-temporal attention (ZAttn) module. Similarly, the continuous decoder consists of a FAttn module and a proposed binary spatio-temporal attention (BAttn) module. STAR-transformer learns an efficient multi-feature representation of the spatio-temporal features by properly arranging pairings of the FAttn, ZAttn, and BAttn modules. Experimental results on the Penn-Action, NTU-RGB+D 60, and 120 datasets show that the proposed method achieves a promising improvement in performance in comparison to previous state-of-the-art methods.

Research paper thumbnail of Real time speed-limit sign recognition invariant to image scale

요 약 본 논문에서는 MB-LBP(Multi-scale Block Local Binary Patterns)와 공간피라미드를 이용하여 생성된 특 징을 랜덤 포레스트(Random... more 요 약 본 논문에서는 MB-LBP(Multi-scale Block Local Binary Patterns)와 공간피라미드를 이용하여 생성된 특 징을 랜덤 포레스트(Random Forest) 분류기에 적용하여 영상내의 표지판 속도를 인식하는 알고리즘을 제안한다. 입력 영상에서 표지판 영역은 다양한 위치와 크기를 가지며 주위 배경이 후보 영역에 포함 되므로 먼저 입력 영상에 원형 Hough Transform을 적용하여 원형의 표지판 후보 영역만을 검출한다. 그 후 영상의 화질을 향상시키기 위해 히스토그램 평활화와 모폴로지 연산을 적용하여 표지판의 숫자 영역과 배경 영역의 대비를 높이도록 한다. 표지판의 크기 변화에 강건한 시스템의 구현을 위해 후보 영역에서 LBP(Local Binary Patterns)보다 우수한 성능을 보이는 MB-LBP를 적용하고, 다양한 크기의 속 도 표지판을 인식하기 위해 공간 피라미드를 사용하여 지역적 특징과 전역적 특징 모두를 추출하였다. 추출된 특징은 랜덤 포레스트(Random Forest)를 이용하여 각 9개의 속도 표지판으로 분류, 각 속도별 클래스에 대한 인식 성능을 측정하였다.

Research paper thumbnail of Estimation of Pedestrian Pose Orientation Using Soft Target Training Based on Teacher–Student Framework

Sensors, 2019

Semi-supervised learning is known to achieve better generalisation than a model learned solely fr... more Semi-supervised learning is known to achieve better generalisation than a model learned solely from labelled data. Therefore, we propose a new method for estimating a pedestrian pose orientation using a soft-target method, which is a type of semi-supervised learning method. Because a convolutional neural network (CNN) based pose orientation estimation requires large numbers of parameters and operations, we apply the teacher–student algorithm to generate a compressed student model with high accuracy and compactness resembling that of the teacher model by combining a deep network with a random forest. After the teacher model is generated using hard target data, the softened outputs (soft-target data) of the teacher model are used for training the student model. Moreover, the orientation of the pedestrian has specific shape patterns, and a wavelet transform is applied to the input image as a pre-processing step owing to its good spatial frequency localisation property and the ability t...

Research paper thumbnail of Automatic Classification Algorithm for Raw Materials using Mean Shift Clustering and Stepwise Region Merging in Color

Journal of Broadcast Engineering, 2016

In this paper, we propose a classification model by analyzing raw material images recorded using ... more In this paper, we propose a classification model by analyzing raw material images recorded using a color CCD camera to automatically classify good and defective agricultural products such as rice, coffee, and green tea, and raw materials. The current classifying agricultural products mainly depends on visual selection by skilled laborers. However, classification ability may drop owing to repeated labor for a long period of time. To resolve the problems of existing human dependant commercial products, we propose a vision based automatic raw material classification combining mean shift clustering and stepwise region merging algorithm. In this paper, the image is divided into N cluster regions by applying the mean-shift clustering algorithm to the foreground map image. Second, the representative regions among the N cluster regions are selected and stepwise region-merging method is applied to integrate similar cluster regions by comparing both color and positional proximity to neighboring regions. The merged raw material objects thereby are expressed in a 2D color distribution of RG, GB, and BR. Third, a threshold is used to detect good and defective products based on color distribution ellipse for merged material objects. From the results of carrying out an experiment with diverse raw material images using the proposed method, less artificial manipulation by the user is required compared to existing clustering and commercial methods, and classification accuracy on raw materials is improved.

Research paper thumbnail of Automatic Salient-Object Extraction Using the Contrast Map and Salient Points

Lecture Notes in Computer Science, 2004

In this paper, we propose a salient object extraction method using the contrast map and salient p... more In this paper, we propose a salient object extraction method using the contrast map and salient points for object-based image retrieval. In order to make the contrast map, we generate three-feature maps such as luminance map, color map and orientation map and extract salient points from an image. By using these features, we can decide the Attention Window (AW) location easily. The purpose of the AW is to remove the useless regions included in the image such as background as well as reducing the amount of image processing. To create the exact location and flexible size of the AW, we use above features with some proposed rules instead of using pre-assumptions or heuristic parameters. After determining of the AW, we apply the image segmentation to inner area of the AW and combine the candidate salient regions as one salient object.

Research paper thumbnail of X-Ray Image Classification and Retrieval Using Ensemble Combination of Visual Descriptors

Lecture Notes in Computer Science, 2009

In this paper, we propose a novel algorithm for the efficient classification and retrieval of med... more In this paper, we propose a novel algorithm for the efficient classification and retrieval of medical images, especially X-ray images. Since medical images have bright foreground against dark background, we extract MPEG-7 visual descriptor from only salient parts of foreground. For color descriptor, Color Structure Descriptor (H-CSD) is extracted from salient points, which are detected by Harris corner detector. For texture descriptor, Edge Histogram Descriptor (EHD) is extracted from global and local parts of images. Then extracted feature vector is applied to multi-class Support Vector Machine (SVM) to give membership scores for each image. From the membership scores of H-CSD and EHD, two membership scores are combined as one ensemble feature and it is used for similarity matching of our retrieval system, MISS (Medical Information Searching System). The experimental results using CLEF-Med2007 images show that our system can indeed improve retrieval performance compared to other global property-based or other classificationbased retrieval methods.

Research paper thumbnail of Robust Face Detection and Tracking for Real-Life Applications

International Journal of Pattern Recognition and Artificial Intelligence, 2003

In this paper, we propose a new face detection and tracking algorithm for real-life telecommunica... more In this paper, we propose a new face detection and tracking algorithm for real-life telecommunication applications, such as video conferencing, cellular phone and PDA. We combine template-based face detection and tracking method with color information to track a face regardless of various lighting conditions and complex backgrounds as well as the race. Based on our experiments, we generate robust face templates from wavelet-transformed lowpass and two highpass subimages at the second level low-resolution. However, since template matching is generally sensitive to the change of illumination conditions, we propose a new type of preprocessing method. Tracking method is applied to reduce the computation time and predict precise face candidate region even though the movement is not uniform. Facial components are also detected using k-means clustering and their geometrical properties. Finally, from the relative distance of two eyes, we verify the real face and estimate the size of facial ...

Research paper thumbnail of 시각장애인 보조를 위한 영상기반 휴먼 행동 인식 시스템

Journal of KIISE, 2015

In this paper we develop a novel human action recognition system based on communication between a... more In this paper we develop a novel human action recognition system based on communication between an ear-mounted Bluetooth camera and an action recognition server to aid scene recognition for the blind. First, if the blind capture an image of a specific location using the ear-mounted camera, the captured image is transmitted to the recognition server using a smartphone that is synchronized with the camera. The recognition server sequentially performs human detection, object detection and action recognition by analyzing human poses. The recognized action information is retransmitted to the smartphone and the user can hear the action information through the text-to-speech (TTS). Experimental results using the proposed system showed a 60.7% action recognition performance on the test data captured in indoor and outdoor environments.

Research paper thumbnail of Survey of computer vision–based natural disaster warning systems

Optical Engineering, 2012

With the rapid development of information technology, natural disaster prevention is growing as a... more With the rapid development of information technology, natural disaster prevention is growing as a new research field dealing with surveillance systems. To forecast and prevent the damage caused by natural disasters, the development of systems to analyze natural disasters using remote sensing geographic information systems (GIS), and vision sensors has been receiving widespread interest over the last decade. This paper provides an up-to-date review of five different types of natural disasters and their corresponding warning systems using computer vision and pattern recognition techniques such as wildfire smoke and flame detection, water level detection for flood prevention, coastal zone monitoring, and landslide detection. Finally, we conclude with some thoughts about future research directions.

Research paper thumbnail of Salient human detection for robot vision

Pattern Analysis and Applications, 2007

In this paper, we propose a salient human detection method that uses pre-attentive features and a... more In this paper, we propose a salient human detection method that uses pre-attentive features and a support vector machine (SVM) for robot vision. From three pre-attentive features (color, luminance and motion), we extracted three feature maps and combined them as a salience map. By using these features, we estimated a given object's location without pre-assumptions or semi-automatic interaction. We were able to choose the most salient object even if multiple objects existed. We also used the SVM to decide whether a given object was human (among the candidate object regions). For the SVM, we used a new feature extraction method to reduce the feature dimensions and reflect the variations of local features to classifiers by using an edged-mosaic image. The main advantage of the proposed method is that our algorithm was able to detect salient humans regardless of the amount of movement, and also distinguish salient humans from non-salient humans. The proposed algorithm can be easily applied to human robot interfaces for human-like vision systems.

Research paper thumbnail of Image resizing using saliency strength map and seam carving for white blood cell analysis

BioMedical Engineering OnLine, 2010

Background A new image-resizing method using seam carving and a Saliency Strength Map (SSM) is pr... more Background A new image-resizing method using seam carving and a Saliency Strength Map (SSM) is proposed to preserve important contents, such as white blood cells included in blood cell images. Methods To apply seam carving to cell images, a SSM is initially generated using a visual attention model and the structural properties of white blood cells are then used to create an energy map for seam carving. As a result, the energy map maximizes the energies of the white blood cells, while minimizing the energies of the red blood cells and background. Thus, the use of a SSM allows the proposed method to reduce the image size efficiently, while preserving the important white blood cells. Results Experimental results using the PSNR (Peak Signal-to-Noise Ratio) and ROD (Ratio of Distortion) of blood cell images confirm that the proposed method is able to produce better resizing results than conventional methods, as the seam carving is performed based on an SSM and energy map. Conclusions For...

Research paper thumbnail of Deep Coupling of Random Ferns

Computer Vision and Pattern Recognition, 2019

The purpose of this study is to design a new lightweight explainable deep model instead of deep n... more The purpose of this study is to design a new lightweight explainable deep model instead of deep neural networks (DNN) because of its high memory and processing resource requirement as well as black-box training although DNN is a powerful algorithm for classification and regression problems. This study propose a non-neural network style deep model based on combination of deep coupling random ferns (DCRF). In proposed DCRF, each neuron of a layer is replaced with the Fern and each layer consists of several type of Ferns. The proposed method showed a higher uniform performance in terms of the number of parameters and operations without a loss of accuracy compared to a few related studies including a DNN based model compression algorithm.

Research paper thumbnail of Automatic Classification Algorithm for Raw Materials using Mean Shift Clustering and Stepwise Region Merging in Color

방송공학회논문지, May 30, 2016

In this paper, we propose a classification model by analyzing raw material images recorded using ... more In this paper, we propose a classification model by analyzing raw material images recorded using a color CCD camera to automatically classify good and defective agricultural products such as rice, coffee, and green tea, and raw materials. The current classifying agricultural products mainly depends on visual selection by skilled laborers. However, classification ability may drop owing to repeated labor for a long period of time. To resolve the problems of existing human dependant commercial products, we propose a vision based automatic raw material classification combining mean shift clustering and stepwise region merging algorithm. In this paper, the image is divided into N cluster regions by applying the mean-shift clustering algorithm to the foreground map image. Second, the representative regions among the N cluster regions are selected and stepwise region-merging method is applied to integrate similar cluster regions by comparing both color and positional proximity to neighboring regions. The merged raw material objects thereby are expressed in a 2D color distribution of RG, GB, and BR. Third, a threshold is used to detect good and defective products based on color distribution ellipse for merged material objects. From the results of carrying out an experiment with diverse raw material images using the proposed method, less artificial manipulation by the user is required compared to existing clustering and commercial methods, and classification accuracy on raw materials is improved.

Research paper thumbnail of Cross-Modal Learning with 3D Deformable Attention for Action Recognition

arXiv (Cornell University), Dec 11, 2022

An important challenge in vision-based action recognition is the embedding of spatiotemporal feat... more An important challenge in vision-based action recognition is the embedding of spatiotemporal features with two or more heterogeneous modalities into a single feature. In this study, we propose a new 3D deformable transformer for action recognition with adaptive spatiotemporal receptive fields and a cross-modal learning scheme. The 3D deformable transformer consists of three attention modules: 3D deformability, local joint stride, and temporal stride attention. The two cross-modal tokens are input into the 3D deformable attention module to create a cross-attention token with a reflected spatiotemporal correlation. Local joint stride attention is applied to spatially combine attention and pose tokens. Temporal stride attention temporally reduces the number of input tokens in the attention module and supports temporal expression learning without the simultaneous use of all tokens. The deformable transformer iterates L-times and combines the last cross-modal token for classification. The proposed 3D deformable transformer was tested on the NTU60, NTU120, FineGYM, and PennAction datasets, and showed results better than or similar to pretrained state-of-the-art methods even without a pre-training process. In addition, by visualizing important joints and correlations during action recognition through spatial joint and temporal stride attention, the possibility of achieving an explainable potential for action recognition is presented.

Research paper thumbnail of Image resizing using saliency strength map and seam carving for white blood cell analysis

Biomedical Engineering Online, Sep 20, 2010

Background: A new image-resizing method using seam carving and a Saliency Strength Map (SSM) is p... more Background: A new image-resizing method using seam carving and a Saliency Strength Map (SSM) is proposed to preserve important contents, such as white blood cells included in blood cell images. Methods: To apply seam carving to cell images, a SSM is initially generated using a visual attention model and the structural properties of white blood cells are then used to create an energy map for seam carving. As a result, the energy map maximizes the energies of the white blood cells, while minimizing the energies of the red blood cells and background. Thus, the use of a SSM allows the proposed method to reduce the image size efficiently, while preserving the important white blood cells. Results: Experimental results using the PSNR (Peak Signal-to-Noise Ratio) and ROD (Ratio of Distortion) of blood cell images confirm that the proposed method is able to produce better resizing results than conventional methods, as the seam carving is performed based on an SSM and energy map. Conclusions: For further improvement, a faster medical image resizing method is currently being investigated to reduce the computation time, while maintaining the same image quality.

Research paper thumbnail of Fast Depth Estimation in a Single Image Using Lightweight Efficient Neural Network

Sensors, Oct 13, 2019

Depth estimation is a crucial and fundamental problem in the computer vision field. Conventional ... more Depth estimation is a crucial and fundamental problem in the computer vision field. Conventional methods reconstruct scenes using feature points extracted from multiple images; however, these approaches require multiple images and thus are not easily implemented in various real-time applications. Moreover, the special equipment required by hardware-based approaches using 3D sensors is expensive. Therefore, software-based methods for estimating depth from a single image using machine learning or deep learning are emerging as new alternatives. In this paper, we propose an algorithm that generates a depth map in real time using a single image and an optimized lightweight efficient neural network (L-ENet) algorithm instead of physical equipment, such as an infrared sensor or multi-view camera. Because depth values have a continuous nature and can produce locally ambiguous results, pixel-wise prediction with ordinal depth range classification was applied in this study. In addition, in our method various convolution techniques are applied to extract a dense feature map, and the number of parameters is greatly reduced by reducing the network layer. By using the proposed L-ENet algorithm, an accurate depth map can be generated from a single image quickly and, in a comparison with the ground truth, we can produce depth values closer to those of the ground truth with small errors. Experiments confirmed that the proposed L-ENet can achieve a significantly improved estimation performance over the state-of-the-art algorithms in depth estimation based on a single image.

Research paper thumbnail of Estimation of Pedestrian Pose Orientation Using Soft Target Training Based on Teacher–Student Framework

Sensors, Mar 6, 2019

Semi-supervised learning is known to achieve better generalisation than a model learned solely fr... more Semi-supervised learning is known to achieve better generalisation than a model learned solely from labelled data. Therefore, we propose a new method for estimating a pedestrian pose orientation using a soft-target method, which is a type of semi-supervised learning method. Because a convolutional neural network (CNN) based pose orientation estimation requires large numbers of parameters and operations, we apply the teacher-student algorithm to generate a compressed student model with high accuracy and compactness resembling that of the teacher model by combining a deep network with a random forest. After the teacher model is generated using hard target data, the softened outputs (soft-target data) of the teacher model are used for training the student model. Moreover, the orientation of the pedestrian has specific shape patterns, and a wavelet transform is applied to the input image as a pre-processing step owing to its good spatial frequency localisation property and the ability to preserve both the spatial information and gradient information of an image. For a benchmark dataset considering real driving situations based on a single camera, we used the TUD and KITTI datasets. We applied the proposed algorithm to various driving images in the datasets, and the results indicate that its classification performance with regard to the pose orientation is better than that of other state-of-the-art methods based on a CNN. In addition, the computational speed of the proposed student model is faster than that of other deep CNNs owing to the shorter model structure with a smaller number of parameters.

Research paper thumbnail of Microscopic Cell Nuclei Segmentation Based on Adaptive Attention Window

Journal of Digital Imaging, Jun 17, 2008

This paper presents an adaptive attention window (AAW)based microscopic cell nuclei segmentation ... more This paper presents an adaptive attention window (AAW)based microscopic cell nuclei segmentation method. For semantic AAW detection, a luminance map is used to create an initial attention window, which is then reduced close to the size of the real region of interest (ROI) using a quad-tree. The purpose of the AAW is to facilitate background removal and reduce the ROI segmentation processing time. Region segmentation is performed within the AAW, followed by region clustering and removal to produce segmentation of only ROIs. Experimental results demonstrate that the proposed method can efficiently segment one or more ROIs and produce similar segmentation results to human perception. In future work, the proposed method will be used for supporting a regionbased medical image retrieval system that can generate a combined feature vector of segmented ROIs based on extraction and patient data.

Research paper thumbnail of Facial Expression Recognition in the Wild Using Face Graph and Attention

IEEE Access, 2023

Facial expression recognition (FER) in the wild from various viewpoints, lighting conditions, fac... more Facial expression recognition (FER) in the wild from various viewpoints, lighting conditions, face poses, scales, and occlusions is an extremely challenging field of research. In this study, we construct a face graph by selecting action units that play an important role in changing facial expressions, and we propose an algorithm for recognizing facial expressions using a graph convolutional network (GCN). We first generated an attention map that can highlight action units to extract important facial expression features from faces in the wild. After feature extraction, a face graph is constructed by combining the attention map with face patches, and changes in expression in the wild are recognized using a GCN. Through comparative experiments conducted using both lab-controlled and wild datasets, we prove that the proposed method is the most suitable FER approach for use with image datasets captured in the wild and those under well-controlled indoor conditions.

Research paper thumbnail of STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition

arXiv (Cornell University), Oct 14, 2022

In action recognition, although the combination of spatio-temporal videos and skeleton features c... more In action recognition, although the combination of spatio-temporal videos and skeleton features can improve the recognition performance, a separate model and balancing feature representation for cross-modal data are required. To solve these problems, we propose Spatio-TemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector. First, from the input video and skeleton sequence, video frames are output as global grid tokens and skeletons are output as joint map tokens, respectively. These tokens are then aggregated into multi-class tokens and input into STAR-transformer. The STAR-transformer encoder consists of a full spatio-temporal attention (FAttn) module and a proposed zigzag spatio-temporal attention (ZAttn) module. Similarly, the continuous decoder consists of a FAttn module and a proposed binary spatio-temporal attention (BAttn) module. STAR-transformer learns an efficient multi-feature representation of the spatio-temporal features by properly arranging pairings of the FAttn, ZAttn, and BAttn modules. Experimental results on the Penn-Action, NTU-RGB+D 60, and 120 datasets show that the proposed method achieves a promising improvement in performance in comparison to previous state-of-the-art methods.

Research paper thumbnail of STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

In action recognition, although the combination of spatio-temporal videos and skeleton features c... more In action recognition, although the combination of spatio-temporal videos and skeleton features can improve the recognition performance, a separate model and balancing feature representation for cross-modal data are required. To solve these problems, we propose Spatio-TemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector. First, from the input video and skeleton sequence, video frames are output as global grid tokens and skeletons are output as joint map tokens, respectively. These tokens are then aggregated into multi-class tokens and input into STAR-transformer. The STAR-transformer encoder consists of a full spatio-temporal attention (FAttn) module and a proposed zigzag spatio-temporal attention (ZAttn) module. Similarly, the continuous decoder consists of a FAttn module and a proposed binary spatio-temporal attention (BAttn) module. STAR-transformer learns an efficient multi-feature representation of the spatio-temporal features by properly arranging pairings of the FAttn, ZAttn, and BAttn modules. Experimental results on the Penn-Action, NTU-RGB+D 60, and 120 datasets show that the proposed method achieves a promising improvement in performance in comparison to previous state-of-the-art methods.

Research paper thumbnail of Real time speed-limit sign recognition invariant to image scale

요 약 본 논문에서는 MB-LBP(Multi-scale Block Local Binary Patterns)와 공간피라미드를 이용하여 생성된 특 징을 랜덤 포레스트(Random... more 요 약 본 논문에서는 MB-LBP(Multi-scale Block Local Binary Patterns)와 공간피라미드를 이용하여 생성된 특 징을 랜덤 포레스트(Random Forest) 분류기에 적용하여 영상내의 표지판 속도를 인식하는 알고리즘을 제안한다. 입력 영상에서 표지판 영역은 다양한 위치와 크기를 가지며 주위 배경이 후보 영역에 포함 되므로 먼저 입력 영상에 원형 Hough Transform을 적용하여 원형의 표지판 후보 영역만을 검출한다. 그 후 영상의 화질을 향상시키기 위해 히스토그램 평활화와 모폴로지 연산을 적용하여 표지판의 숫자 영역과 배경 영역의 대비를 높이도록 한다. 표지판의 크기 변화에 강건한 시스템의 구현을 위해 후보 영역에서 LBP(Local Binary Patterns)보다 우수한 성능을 보이는 MB-LBP를 적용하고, 다양한 크기의 속 도 표지판을 인식하기 위해 공간 피라미드를 사용하여 지역적 특징과 전역적 특징 모두를 추출하였다. 추출된 특징은 랜덤 포레스트(Random Forest)를 이용하여 각 9개의 속도 표지판으로 분류, 각 속도별 클래스에 대한 인식 성능을 측정하였다.

Research paper thumbnail of Estimation of Pedestrian Pose Orientation Using Soft Target Training Based on Teacher–Student Framework

Sensors, 2019

Semi-supervised learning is known to achieve better generalisation than a model learned solely fr... more Semi-supervised learning is known to achieve better generalisation than a model learned solely from labelled data. Therefore, we propose a new method for estimating a pedestrian pose orientation using a soft-target method, which is a type of semi-supervised learning method. Because a convolutional neural network (CNN) based pose orientation estimation requires large numbers of parameters and operations, we apply the teacher–student algorithm to generate a compressed student model with high accuracy and compactness resembling that of the teacher model by combining a deep network with a random forest. After the teacher model is generated using hard target data, the softened outputs (soft-target data) of the teacher model are used for training the student model. Moreover, the orientation of the pedestrian has specific shape patterns, and a wavelet transform is applied to the input image as a pre-processing step owing to its good spatial frequency localisation property and the ability t...

Research paper thumbnail of Automatic Classification Algorithm for Raw Materials using Mean Shift Clustering and Stepwise Region Merging in Color

Journal of Broadcast Engineering, 2016

In this paper, we propose a classification model by analyzing raw material images recorded using ... more In this paper, we propose a classification model by analyzing raw material images recorded using a color CCD camera to automatically classify good and defective agricultural products such as rice, coffee, and green tea, and raw materials. The current classifying agricultural products mainly depends on visual selection by skilled laborers. However, classification ability may drop owing to repeated labor for a long period of time. To resolve the problems of existing human dependant commercial products, we propose a vision based automatic raw material classification combining mean shift clustering and stepwise region merging algorithm. In this paper, the image is divided into N cluster regions by applying the mean-shift clustering algorithm to the foreground map image. Second, the representative regions among the N cluster regions are selected and stepwise region-merging method is applied to integrate similar cluster regions by comparing both color and positional proximity to neighboring regions. The merged raw material objects thereby are expressed in a 2D color distribution of RG, GB, and BR. Third, a threshold is used to detect good and defective products based on color distribution ellipse for merged material objects. From the results of carrying out an experiment with diverse raw material images using the proposed method, less artificial manipulation by the user is required compared to existing clustering and commercial methods, and classification accuracy on raw materials is improved.

Research paper thumbnail of Automatic Salient-Object Extraction Using the Contrast Map and Salient Points

Lecture Notes in Computer Science, 2004

In this paper, we propose a salient object extraction method using the contrast map and salient p... more In this paper, we propose a salient object extraction method using the contrast map and salient points for object-based image retrieval. In order to make the contrast map, we generate three-feature maps such as luminance map, color map and orientation map and extract salient points from an image. By using these features, we can decide the Attention Window (AW) location easily. The purpose of the AW is to remove the useless regions included in the image such as background as well as reducing the amount of image processing. To create the exact location and flexible size of the AW, we use above features with some proposed rules instead of using pre-assumptions or heuristic parameters. After determining of the AW, we apply the image segmentation to inner area of the AW and combine the candidate salient regions as one salient object.

Research paper thumbnail of X-Ray Image Classification and Retrieval Using Ensemble Combination of Visual Descriptors

Lecture Notes in Computer Science, 2009

In this paper, we propose a novel algorithm for the efficient classification and retrieval of med... more In this paper, we propose a novel algorithm for the efficient classification and retrieval of medical images, especially X-ray images. Since medical images have bright foreground against dark background, we extract MPEG-7 visual descriptor from only salient parts of foreground. For color descriptor, Color Structure Descriptor (H-CSD) is extracted from salient points, which are detected by Harris corner detector. For texture descriptor, Edge Histogram Descriptor (EHD) is extracted from global and local parts of images. Then extracted feature vector is applied to multi-class Support Vector Machine (SVM) to give membership scores for each image. From the membership scores of H-CSD and EHD, two membership scores are combined as one ensemble feature and it is used for similarity matching of our retrieval system, MISS (Medical Information Searching System). The experimental results using CLEF-Med2007 images show that our system can indeed improve retrieval performance compared to other global property-based or other classificationbased retrieval methods.

Research paper thumbnail of Robust Face Detection and Tracking for Real-Life Applications

International Journal of Pattern Recognition and Artificial Intelligence, 2003

In this paper, we propose a new face detection and tracking algorithm for real-life telecommunica... more In this paper, we propose a new face detection and tracking algorithm for real-life telecommunication applications, such as video conferencing, cellular phone and PDA. We combine template-based face detection and tracking method with color information to track a face regardless of various lighting conditions and complex backgrounds as well as the race. Based on our experiments, we generate robust face templates from wavelet-transformed lowpass and two highpass subimages at the second level low-resolution. However, since template matching is generally sensitive to the change of illumination conditions, we propose a new type of preprocessing method. Tracking method is applied to reduce the computation time and predict precise face candidate region even though the movement is not uniform. Facial components are also detected using k-means clustering and their geometrical properties. Finally, from the relative distance of two eyes, we verify the real face and estimate the size of facial ...

Research paper thumbnail of 시각장애인 보조를 위한 영상기반 휴먼 행동 인식 시스템

Journal of KIISE, 2015

In this paper we develop a novel human action recognition system based on communication between a... more In this paper we develop a novel human action recognition system based on communication between an ear-mounted Bluetooth camera and an action recognition server to aid scene recognition for the blind. First, if the blind capture an image of a specific location using the ear-mounted camera, the captured image is transmitted to the recognition server using a smartphone that is synchronized with the camera. The recognition server sequentially performs human detection, object detection and action recognition by analyzing human poses. The recognized action information is retransmitted to the smartphone and the user can hear the action information through the text-to-speech (TTS). Experimental results using the proposed system showed a 60.7% action recognition performance on the test data captured in indoor and outdoor environments.

Research paper thumbnail of Survey of computer vision–based natural disaster warning systems

Optical Engineering, 2012

With the rapid development of information technology, natural disaster prevention is growing as a... more With the rapid development of information technology, natural disaster prevention is growing as a new research field dealing with surveillance systems. To forecast and prevent the damage caused by natural disasters, the development of systems to analyze natural disasters using remote sensing geographic information systems (GIS), and vision sensors has been receiving widespread interest over the last decade. This paper provides an up-to-date review of five different types of natural disasters and their corresponding warning systems using computer vision and pattern recognition techniques such as wildfire smoke and flame detection, water level detection for flood prevention, coastal zone monitoring, and landslide detection. Finally, we conclude with some thoughts about future research directions.

Research paper thumbnail of Salient human detection for robot vision

Pattern Analysis and Applications, 2007

In this paper, we propose a salient human detection method that uses pre-attentive features and a... more In this paper, we propose a salient human detection method that uses pre-attentive features and a support vector machine (SVM) for robot vision. From three pre-attentive features (color, luminance and motion), we extracted three feature maps and combined them as a salience map. By using these features, we estimated a given object's location without pre-assumptions or semi-automatic interaction. We were able to choose the most salient object even if multiple objects existed. We also used the SVM to decide whether a given object was human (among the candidate object regions). For the SVM, we used a new feature extraction method to reduce the feature dimensions and reflect the variations of local features to classifiers by using an edged-mosaic image. The main advantage of the proposed method is that our algorithm was able to detect salient humans regardless of the amount of movement, and also distinguish salient humans from non-salient humans. The proposed algorithm can be easily applied to human robot interfaces for human-like vision systems.

Research paper thumbnail of Image resizing using saliency strength map and seam carving for white blood cell analysis

BioMedical Engineering OnLine, 2010

Background A new image-resizing method using seam carving and a Saliency Strength Map (SSM) is pr... more Background A new image-resizing method using seam carving and a Saliency Strength Map (SSM) is proposed to preserve important contents, such as white blood cells included in blood cell images. Methods To apply seam carving to cell images, a SSM is initially generated using a visual attention model and the structural properties of white blood cells are then used to create an energy map for seam carving. As a result, the energy map maximizes the energies of the white blood cells, while minimizing the energies of the red blood cells and background. Thus, the use of a SSM allows the proposed method to reduce the image size efficiently, while preserving the important white blood cells. Results Experimental results using the PSNR (Peak Signal-to-Noise Ratio) and ROD (Ratio of Distortion) of blood cell images confirm that the proposed method is able to produce better resizing results than conventional methods, as the seam carving is performed based on an SSM and energy map. Conclusions For...

Research paper thumbnail of Deep Coupling of Random Ferns

Computer Vision and Pattern Recognition, 2019

The purpose of this study is to design a new lightweight explainable deep model instead of deep n... more The purpose of this study is to design a new lightweight explainable deep model instead of deep neural networks (DNN) because of its high memory and processing resource requirement as well as black-box training although DNN is a powerful algorithm for classification and regression problems. This study propose a non-neural network style deep model based on combination of deep coupling random ferns (DCRF). In proposed DCRF, each neuron of a layer is replaced with the Fern and each layer consists of several type of Ferns. The proposed method showed a higher uniform performance in terms of the number of parameters and operations without a loss of accuracy compared to a few related studies including a DNN based model compression algorithm.