Krung Sinapiromsaran - Academia.edu (original) (raw)

Papers by Krung Sinapiromsaran

Proceedings of the 2017 International Conference on Information Technology

Class imbalance problem is an important classification problem in machine learning which shows un... more Class imbalance problem is an important classification problem in machine learning which shows undesirable performance of a minority class. These minority instances have a tendency to be misclassified due to their tiny portion in a dataset. For a binary classification, they are labeled as positive while the rest are labeled as negative. This research proposes the novel parameter-free oversampling technique called the extreme anomalous oversampling technique, EXOT, based on an extreme anomalous score, EAS, and a negative anomalous score, NAS. This technique is used for rebalancing a data distribution that can be classified by any classifier. EAS of the instance p is the largest radius of an open ball centering at p containing only a single instance while NAS is the largest radius of an open ball centering at p without negative instances. EXOT synthesizes positive instances surrounding the original positive one using these two EAS and NAS. This work was conducted with three UCI datasets comparing SMOTE, borderline-SMOTE, safe-level SMOTE, and EXOT based on four classifiers which are C4.5, K-nearest neighbor classifier, multilayer perceptron, and naïve Bayes using precision, recall, F1-measure, and G-mean as performance measures. The results show the improvement of classification performances on a minority class on all three UCI datasets.

Outlier concept is one of the most significant topics in data mining. Many researches in outlier ... more Outlier concept is one of the most significant topics in data mining. Many researches in outlier detections address an algorithm to generate the outlier scores which can be used to measure the outlierness of an instance in a dataset. Ordered distance difference outlier factor (OOF) is the parameter-free outlier detection algorithm which was published in 2013. This thesis proposes a new parameter-free outlier detection algorithm called a weighted minimum consecutive pair of the extreme pole outlier factor (WOF). The new outlier score of an instance is generated along the extreme poles by considering the radial projection of this instance and its consecutive pair. The minimum on each side of the instance will be weighted and used to create the WOF. The WOF algorithm has the O(n2) time complexity. To compare the effectiveness and time, WOF algorithm was applied with generated synthetic datasets and three UCI datasets.

WSEAS TRANSACTIONS ON SYSTEMS AND CONTROL, 2020

The challenging endeavor of a time series forecast model is to predict the future time series dat... more The challenging endeavor of a time series forecast model is to predict the future time series data accurately. Traditionally, the fundamental forecasting model in time series analysis is the autoregressive integrated moving average model or the ARIMA model requiring a model identification of a three-component vector which are the autoregressive order, the differencing order, and the moving average order before fitting coefficients of the model via the Box-Jenkins method. A model identification is analyzed via the sample autocorrelation function and the sample partial autocorrelation function which are effective tools for identifying the ARMA order but it is quite difficult for analysts. Even though a likelihood based-method is presented to automate this process by varying the ARIMA order and choosing the best one with the smallest criteria, such as Akaike information criterion. Nevertheless the obtained ARIMA model may not pass the residual diagnostic test. This paper presents the r...

A decision tree is one of the famous classifiers based on a recursive partitioning algorithm. Thi... more A decision tree is one of the famous classifiers based on a recursive partitioning algorithm. This paper introduces the Boundary Expansion Algorithm (BEA) to improve a decision tree induction that deals with an imbalanced dataset. BEA utilizes all attributes to define non-splittable ranges. The computed means of all attributes for minority instances are used to find the nearest minority instance, which will be expanded along all attributes to cover a minority region. As a result, BEA can successfully cope with an imbalanced dataset comparing with C4.5, Gini, asymmetric entropy, top-down tree, and Hellinger distance decision tree on 25 imbalanced datasets from the UCI Repository.

SMOTE is an effective oversampling technique for a class imbalance problem due to its simplicity ... more SMOTE is an effective oversampling technique for a class imbalance problem due to its simplicity and relatively high recall value. One drawback of SMOTE is a requirement of the number of nearest neighbors as a key parameter to synthesize instances. This paper introduces a new adaptive algorithm called Adaptive neighbor Synthetic Minority Oversampling Technique (ANS) to dynamically adapt the number of neighbors needed for oversampling around different minority regions. This technique also defines a minority outcast as a minority instance having no minority class neighbors. Minority outcasts are neglected by most oversampling techniques but instead, an additional outcast handling method is proposed for the performance improvement via a 1-nearest neighbor model. Based on our experiments in UCI and PROMISE datasets, generated datasets from this technique have improved the accuracy performance of a classification, and the improvement can be verified statistically by the Wilcoxon signed-rank test.

2017 8th International Conference of Information and Communication Technology for Embedded Systems (IC-ICTES), 2017

Anomaly detection in time series is one of exciting topics in data mining. The aim is to find a d... more Anomaly detection in time series is one of exciting topics in data mining. The aim is to find a data point which is different from the majority, called an anomaly. In this paper, a novel anomaly score called Median-Difference Window subseries Score (MDWS) is proposed with its algorithm and the recommended window size for detecting the contextual anomalies on time series data. It is computed as the subtraction of the middle point with the median of all data points within the current window. The proposed MDWS algorithm is implemented as the median-update of the current window subseries to maintain the linear time complexity. Two anomaly thresholds are set as the mean plus/minus three standard deviation for extracting the anomalies. Furthermore, the suitable window size for detecting anomalies is investigated and suggests that it should be smaller than the seasonal period. The experimental results show that the MDWS has the highest accuracy performance on the benchmark datasets from Yahoo comparing with others existing anomaly detection methods.

Max-out-in pivot rule with cycling prevention for the simplex method

2017 9th International Conference on Knowledge and Smart Technology (KST), 2017

This paper proposes a new forecasting model for the amount of cash in ATM, called a multiple ARIM... more This paper proposes a new forecasting model for the amount of cash in ATM, called a multiple ARIMA subsequences aggregate time series model or MASA model. To assess the MASA model, the time series data is split into in-sample for building and out-sample data for evaluating. To build the MASA model, the aggregate in-sample data is subdivided into training time series data and validating time series data. The aggregate technique by its period is used to reduce a sway in the time series. A fixed number of subsequence patterns are generated and fitted using ARIMA models and then validated using the symmetric mean absolute percentage error or SMAPE to identify the best fitted model. The step of forecasting a future value in the aggregate group using the MASA model needs to identify the subsequence with the best fitted ARIMA model. The disaggregate process will distribute this group value into future values using a ratio from the past. The MASA model is compared using SMAPE with the SARIMA model and ETS exponential smoothing model. The results exhibit the better SMAPE than the SARIMA model and ETS exponential smoothing model.

Chiang Mai Journal of Science, 2016

Received: 10 June 2013 Accepted: 12 November 2014 ABSTRACT The redistribution of the target class... more Received: 10 June 2013 Accepted: 12 November 2014 ABSTRACT The redistribution of the target class by oversampling synthetic minority instances is one of the effective directions for a class imbalance problem. Safe-level SMOTE generates synthetic minority instances around original instances while avoiding nearby majority ones. However, despite of this intention, it is still possible that some synthetic instances can be placed too close to nearby majority instances which possibly confuse some classifiers. Moreover, Safe-Level SMOTE technically avoids using minority outcast instances for generating synthetic instances. This generated dataset may lose some precious information of minority class. Our paper aims to remedy these two drawbacks of Safe-Level SMOTE by combining two processes. The first one is checking and moving these synthetic instances away from possibly surrounding majority instances. The second is handling minority outcast with 1-nearest neighbor model. The empirical resu...

Proceedings of the 2017 International Conference on Information Technology - ICIT 2017, 2017

A clustering algorithm usually detects outliers as an aftermath of partitioning data points in a ... more A clustering algorithm usually detects outliers as an aftermath of partitioning data points in a finite dimensional continuous dataset such as AGNES, k-means, and DBSCAN. This research makes use of the extreme anomalous score which represents the outlierness of a data point based on the largest radius of a ball containing only that data point. The new clustering algorithm based on this score is proposed called the extreme anomalous score clustering algorithm (ESC). It searches for a cluster representative by combining two data points which are placed with the smallest extreme anomalous score. Then all extreme anomalous scores are updated and the algorithm stops when it reaches the number of clusters defined by a user. Otherwise, it continues to combine two data points having the smallest extreme anomalous scores. The experimental results on three groups of simulated datasets report the superior performance of ESC over AGNES, k-means, and DBSCAN based on the silhouette measurement and the homogeneity measurement.

2016 International Computer Science and Engineering Conference (ICSEC), 2016

Outlier concept is one significant topics in data mining. Many researches in outlier detections p... more Outlier concept is one significant topics in data mining. Many researches in outlier detections propose algorithm to generate the outlier scores which can be used to measure the outlierness of an instance in a dataset. This paper proposes a new parameter-free algorithm called a weighted minimum consecutive pair of the extreme pole outlier factor (WOF). The new outlier score of an instance is generated along the extreme poles by considering the projection of this instance and its consecutive pair. The minimum on each side of the instance will be weighted and used to create the WOF. The WOF algorithm is implemented and has the O(n2) time complexity. It has 100% accuracy on three sets of synthetic datasets.

Information Technology Journal, 2011

The winner determination problem (WDP) for a single object auction is a relatively easy problem t... more The winner determination problem (WDP) for a single object auction is a relatively easy problem to solve using the greedy algorithm. However, a booth auction is one of the nonidentical multiple-object auctions which is NP-hard problem. Formulation of the winner determination model for a linear arrangement of a multiple-object auction is explained in this study – the integer linear programming model. Moreover, this research improves the polynomial time complexity algorithm from the previous study of allocation in geometry-based structure. Finally, the comparison of a running time exhibits the advantage of our proposed algorithm. The simulation results are discussed.

The simplex algorithm is a widely used method for solving a linear programming problem (LP) which... more The simplex algorithm is a widely used method for solving a linear programming problem (LP) which is first presented by George B. Dantzig. One of the important steps of the simplex algorithm is applying an appropriate pivot rule, the rule to select the entering variable. An effective pivot rule can lead to the optimal solution of LP with the small number of iterations. In a minimization problem, Dantzig’s pivot rule selects an entering variable corresponding to the most negative reduced cost. The concept is to have the maximum improvement in the objective value per unit step of the entering variable. However, in some problems, Dantzig’s rule may visit a large number of extreme points before reaching the optimal solution. In this paper, we propose a pivot rule that could reduce the number of such iterations over the Dantzig’s pivot rule. The idea is to have the maximum improvement in the objective value function by trying to block a leaving variable that makes a little change in the ...

Thai Journal of Mathematics, 2019

A standard classifier acquired from a machine learning literature aims to categorize an instance ... more A standard classifier acquired from a machine learning literature aims to categorize an instance into a well-defined class having comparable number of instances while the data from real world problems tend to be imbalance. One way to deal with this imbalance problem is to modify the standard classification algorithm to capture minority instances and majority instances simultaneously. This work modifies the recursive partitioning algorithm based on a set of tubes, called the tube-tree algorithm. A tube-tree is a collection of tubes building from the combination of the input attributes where an internal node contains distinct class tubes corresponding to their respective classes. A tube composes of three components: a core vector, a tube length, and a tube radius built for each class regardless of its size which is suitable for imbalance. The forty six experiments are derived from the KEEL repository to compare the performance of the tube-tree with the support vector machine, the deci...

A C anomalous assemblage is a group of associated outliers having the number of instances less th... more A C anomalous assemblage is a group of associated outliers having the number of instances less than C percent of the total. Presently, known anomaly detection algorithms hardly detect these assemblages without appropriate setting. This paper proposes the new anomalous assemblage detection algorithm called CND which computes a score for an instance using a nearest neighbor distance. The algorithm computes the index k which is equal to the floor function of C percent times the total number of instances and uses the kth-nearest neighbor distance for representing an anomalous score. Moreover, a medcouple from a skew distribution is used to compute the upper threshold for detecting outliers. The algorithm is tested on two collections of datasets which are synthetic datasets and benchmark datasets from Multimedia Analysis and Data Mining website to evaluate the performance comparing with WOF and LOF. The performance of CND is similar to LOF, but it is better than the performance of WOF on...

Frontiers in Artificial Intelligence and Applications, 2021

A sampling method is one of the popular methods to deal with an imbalance problem appearing in ma... more A sampling method is one of the popular methods to deal with an imbalance problem appearing in machine learning. A dataset having an imbalance problem contains a noticeably different number of instances belonging to different classes. Three sampling techniques are used to solve this problem by balancing class distributions. The first one is an undersampling technique removing noises from a class having a large number of instances, called a majority class. The second one is an over-sampling technique synthesizing instances from a class having a small number of instances, called a minority class, and the third one is the combined technique of both undersampling and oversampling. This research applies the combined technique of both undersampling and oversampling via the mass ratio variance scores of instances from each individual class. For the majority class, instances with high mass ratio variances are removed whereas for the minority class, instances with high mass ratio variances a...

2021 18th International Joint Conference on Computer Science and Software Engineering (JCSSE), 2021

An outlier of a finite dataset in statistics is defined as a data point that differs significantl... more An outlier of a finite dataset in statistics is defined as a data point that differs significantly from others. It is normally surrounded by a few data points while normal ones are engulfed by others. This behavior leads to the proposed outlier factor called Mass-ratio-variance Outlier Factor (MOF). A score is assigned to a data point from the variance of the mass-ratio distribution from the rest of data points. Within a sphere of an outlier, there will be few data points compared with a normal one. So, the mass-ratio of an outlier will be different from that of a normal data point. The algorithm to generate MOF requires no parameter and embraces the density concept. Experimental results show that top-10 highest scores from MOF could identify all outliers from synthesized datasets similar to those scores from the state-of-the-art outlier scoring methods such as LOF and FastABOD. Moreover, it could retrieve more outliers from three real World datasets.

WSEAS TRANSACTIONS ON SYSTEMS AND CONTROL, 2021

Building an effective classifier that could classify a target or class of instances in a dataset ... more Building an effective classifier that could classify a target or class of instances in a dataset from historical data has played an important role in machine learning for a decade. The standard classification algorithm has difficulty generating an appropriate classifier when faced with an imbalanced dataset. In 2019, the efficient splitting measure, minority condensation entropy (MCE) [1] is proposed that could build a decision tree to classify minority instances. The aim of this research is to extend the concept of a random forest to use both decision trees and minority condensation trees. The algorithm will build a minority condensation tree from a bootstrapped dataset maintaining all minorities while it will build a decision tree from a bootstrapped dataset of a balanced dataset. The experimental results on synthetic datasets apparent the results that confirm this proposed algorithm compared with the standard random forest are suitable for dealing with the binary-class imbalanced...

International Journal of Machine Learning and Computing, 2020

The problem of handling a class imbalanced problem by modifying decision tree algorithm has recei... more The problem of handling a class imbalanced problem by modifying decision tree algorithm has received widespread attention in recent years. A new splitting measure, called class overlapping-balancing entropy (OBE), is introduced in this paper that essentially pay attentions to all classes equally. Each step, the proportion of each class is balanced via the assigned weighted values. They not only depend on equalizing each class, but they also take into account the overlapping region between classes. The proportion of weighted values corresponding to each class is used as the component of Shannon's entropy for splitting the current dataset. From the experimental results, OBE significantly outperforms the conventional splitting measures like Gini index, gain ratio and DCSM, which are used in the well-known decision tree algorithms. It also exhibits superior performance compared to AE and ME that are designed for handling the class imbalanced problem specifically.