Tossapon Boongoen | Royal Thai Air Force Academy (original) (raw)
Papers by Tossapon Boongoen
IEEE Access
Electronic Government (e-Government) systems constantly provide greater services to people, busin... more Electronic Government (e-Government) systems constantly provide greater services to people, businesses, organisations, and societies by offering more information, opportunities, and platforms with the support of advances in information and communications technologies. This usually results in increased system complexity and sensitivity, necessitating stricter security and privacy-protection measures. The majority of the existing e-Government systems are centralised, making them vulnerable to privacy and security threats, in addition to suffering from a single point of failure. This study proposes a decentralised e-Government framework with integrated threat detection features to address the aforementioned challenges. In particular, the privacy and security of the proposed e-Government system are realised by the encryption, validation, and immutable mechanisms provided by Blockchain. The insider and external threats associated with blockchain transactions are minimised by the employment of an artificial immune system, which effectively protects the integrity of the Blockchain. The proposed e-Government system was validated and evaluated by using the framework of Ethereum Visualisations of Interactive, Blockchain, Extended Simulations (i.e. eVIBES simulator) with two publicly available datasets. The experimental results show the efficacy of the proposed framework in that it can mitigate insider and external threats in e-Government systems whilst simultaneously preserving the privacy of information.
Computers, Materials & Continua
As more business transactions and information services have been implemented via communication ne... more As more business transactions and information services have been implemented via communication networks, both personal and organization assets encounter a higher risk of attacks. To safeguard these, a perimeter defence like NIDS (network-based intrusion detection system) can be effective for known intrusions. There has been a great deal of attention within the joint community of security and data science to improve machine-learning based NIDS such that it becomes more accurate for adversarial attacks, where obfuscation techniques are applied to disguise patterns of intrusive traffics. The current research focuses on non-payload connections at the TCP (transmission control protocol) stack level that is applicable to different network applications. In contrary to the wrapper method introduced with the benchmark dataset, three new filter models are proposed to transform the feature space without knowledge of class labels. These ECT (ensemble clustering based transformation) techniques, i.e., ECT-Subspace, ECT-Noise and ECT-Combined, are developed using the concept of ensemble clustering and three different ensemble generation strategies, i.e., random feature subspace, feature noise injection and their combinations. Based on the empirical study with published dataset and four classification algorithms, new models usually outperform that original wrapper and other filter alternatives found in the literature. This is similarly summarized from the first experiment with basic classification of legitimate and direct attacks, and the second that focuses on recognizing obfuscated intrusions. In addition, analysis of algorithmic parameters, i.e., ensemble size and level of noise, is provided as a guideline for a practical use.
Computers, Materials & Continua
Attempts to determine characters of astronomical objects have been one of major and vibrant activ... more Attempts to determine characters of astronomical objects have been one of major and vibrant activities in both astronomy and data science fields. Instead of a manual inspection, various automated systems are invented to satisfy the need, including the classification of light curve profiles. A specific Kaggle competition, namely Photometric LSST Astronomical Time-Series Classification Challenge (PLAsTiCC), is launched to gather new ideas of tackling the abovementioned task using the data set collected from the Large Synoptic Survey Telescope (LSST) project. Almost all proposed methods fall into the supervised family with a common aim to categorize each object into one of pre-defined types. As this challenge focuses on developing a predictive model that is robust to classifying unseen data, those previous attempts similarly encounter the lack of discriminate features, since distribution of training and actual test datasets are largely different. As a result, well-known classification algorithms prove to be sub-optimal, while more complicated feature extraction techniques may help to slightly boost the predictive performance. Given such a burden, this research is set to explore an unsupervised alternative to the difficult quest, where common classifiers fail to reach the 50% accuracy mark. A clustering technique is exploited to transform the space of training data, from which a more accurate classifier can be built. In addition to a single clustering framework that provides a comparable accuracy to the front runners of supervised learning, a multiple-clustering alternative is also introduced with improved performance. In fact, it is able to yield a higher accuracy rate of 58.32% from 51.36% that is obtained using a simple clustering. For this difficult problem, it is rather good considering for those achieved by well-known models like support vector machine (SVM) with 51.80% and Naïve Bayes (NB) with only 2.92%.
Computational Methods with Applications in Bioinformatics Analysis, 2017
Information Processing & Management, 2022
The work presented in this paper aims to develop a new imputation method to better handle missing... more The work presented in this paper aims to develop a new imputation method to better handle missing values encountered in astronomical data analysis, especially the classification of transient events in a sky survey from the GOTO project. In particular, the framework of cluster directed selection of neighbors that has proven effective for benchmark local imputation techniques of KNNimpute and LLSimpute are extended to new multi-stage models. These combinations, namely Iterative-CKNN and Iterative-CLLS, are organic with an original application to analyze sky survey data. They bring out advantages from both local approaches, where estimates are summarized from neighbors in the same data cluster, within the iterative process to refine previous guesses. Based on experiments with simulated datasets corresponding to different survey sizes and missing rations between 1 to 20%, they usually outperform baseline models and BPCA, which is the well-known global technique. For instance, at 10% missing rate, Iterative-CLLS appears to be the most accurate with NRMSE score of 0.190, while BPCA and the best among its baseline models reaches 0.351 and 0.249, respectively. For their practical implications, these methods have also proven effective for classifying transients, using common algorithms like KNN, Naive Bayes and Random Forest.
2017 Twelfth International Conference on Digital Information Management (ICDIM), 2017
The crime problems become critical issues for national security especially the security of border... more The crime problems become critical issues for national security especially the security of border and intelligent transportation systems (ITSs). These affect the economy, investment, tourism, and society. As a result, the automatic suspect vehicle detection emerges as one of effective tools to tackle the problems. However, the traditional process normally uses criminal vehicle data in blacklist comparing with vehicle data gathering from various sensors. This comparison is not effective and accurate that might be from not up-to-date data in the blacklist. Sometimes the blacklist is not available. This paper proposes the criminal behavior analysis method to detect suspect vehicles that are potentially involved in criminal activity. It must not rely on the blacklist. The analysis is conditional on journey path and the involvement of criminal activities. In additional, public officials believe that the suspect vehicle will choose the journey path without a checkpoint. Therefore, we used the journey path analysis techniques together with the association rule mining to analyze such criminal behavior. From extensive experiments, the results show that the proposed method can increase the suspect detection accuracy rate 17.24% beyond the traditional counterpart.
Q. Shen, T. Boongoen and C. Price. A fuzzy order-of-magnitude approach to qualitative link analys... more Q. Shen, T. Boongoen and C. Price. A fuzzy order-of-magnitude approach to qualitative link analysis. Proceedings of the 25th International Workshop on Qualitative Reasoning, pp. 147-158, 2011.
T. Boongoen and Q. Shen. 'Detecting False Identity through Behavioural Patterns', In Proc... more T. Boongoen and Q. Shen. 'Detecting False Identity through Behavioural Patterns', In Proceedings of International Crime Science Conference, British Library, London UK, 2008. Publisher's online version forthcoming.;The full text is currently unavailable in CADAIR pending approval by the publisher. Sponsorship: UK EPSRC grant EP/D057086
2015 International Carnahan Conference on Security Technology (ICCST), 2015
For modern-age security, many have turn to biometrics such as face classification to verify autho... more For modern-age security, many have turn to biometrics such as face classification to verify authority. Despite this, the accuracy of existing classifiers have been constrained by the curse of dimensionality typically observed in face images. In order to simplify the task, one may reduce the original data to a more compact variation, where only key feature components are included in the classification process. Unlike conventional feature reduction techniques found in the literature, this paper presents a novel method that makes use of cluster ensemble, specifically the summarizing information matrix, as the transformed data for a supervised learning step. Among different state-of-the-art methods, link-based cluster ensemble approach (LCE) provides a highly accurate clustering, and thus particularly employed here. The performance of this transformation model is evaluated on published face dataset and its noise-added variations, using different classifiers. The findings suggest that the new model can improve the classification accuracy beyond those of other benchmark methods investigated in this empirical study.
2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2012
ABSTRACT Gene expressions measured using microarrays usually encounter the problem of missing val... more ABSTRACT Gene expressions measured using microarrays usually encounter the problem of missing values. Leaving this unsolved may critically degrade the reliability of any consequent down-stream analysis or medical application. Yet, a further study of microarray data might be impossible with many analysis methods requiring a complete data set. This paper introduces a new methodology to impute missing values in microarray data. The proposed algorithm, CKNN impute, is an extension of k nearest neighbor imputation with local data clustering being incorporated for improved quality and efficiency. Gene expression data is typically represented as a matrix whose rows and columns correspond to genes and experiments, respectively. CKNN kicks off by finding a complete dataset via the removal of rows with missing value(s). Then, k clusters and their corresponding centroids are obtained by applying a clustering technique on the complete dataset. A set of similar genes of the target gene (with missing values) are those belonging to the cluster, whose centroid is the closest the target. Having known this, the target gene is imputed by applying k nearest neighbor method with similar genes previously determined. Empirical evaluation with published gene expression datasets suggest that the proposed technique performs better than the classical k nearest neighbor method and its extension found in the literature.
2013 13th International Symposium on Communications and Information Technologies (ISCIT), 2013
Gene expressions measured during a microarray experiment usually encounter the native problem of ... more Gene expressions measured during a microarray experiment usually encounter the native problem of missing values. These are due to possible errors occurring in the primary experiments, image acquisition and interpretation processes. Leaving this unsolved may critically degrade the reliability of any consequent downstream analysis or medical application. Yet, a further study of microarray data may not be possible with many standard analysis methods that require a complete data set. This paper introduces a new method to impute missing values in microarray data. The proposed algorithm, CLLS impute, is an extension of local least squares imputation with local data clustering being incorporated for improved quality and efficiency. Gene expression data is typically represented as a matrix whose rows and columns corresponds to genes and experiments, respectively. CLLS kicks off by finding a complete dataset via the removal of rows with missing value(s). Then, gene clusters and their corresponding centroids are obtained by applying a clustering technique on the complete dataset. A set of similar genes of the target gene (with missing values) are those belonging to the cluster, whose centroid is the closest to the target. Having known this, the target gene is imputed by applying regression analysis with similar genes previously determined. Empirical evaluation with several published gene expression datasets suggest that the proposed technique performs better than the classical local least square method and recently developed techniques found in the literature.
IEEE Transactions on Knowledge and Data Engineering, 2012
Although attempts have been made to solve the problem of clustering categorical data via cluster ... more Although attempts have been made to solve the problem of clustering categorical data via cluster ensembles, with the results being competitive to conventional algorithms, it is observed that these techniques unfortunately generate a final data partition based on incomplete information. The underlying ensemble-information matrix presents only cluster-data point relations, with many entries being left unknown. The paper presents an analysis that suggests this problem degrades the quality of the clustering result, and it presents a new link-based approach, which improves the conventional matrix by discovering unknown entries through similarity between clusters in an ensemble. In particular, an efficient link-based algorithm is proposed for the underlying similarity assessment. Afterward, to obtain the final clustering result, a graph partitioning technique is applied to a weighted bipartite graph that is formulated from the refined matrix. Experimental results on multiple real data sets suggest that the proposed link-based method almost always outperforms both conventional clustering algorithms for categorical data and well-known cluster ensemble techniques.
Computers, Materials & Continua, 2022
A mix between numerical and nominal data types commonly presents many modern-age data collections... more A mix between numerical and nominal data types commonly presents many modern-age data collections. Examples of these include banking data, sales history and healthcare records, where both continuous attributes like age and nominal ones like blood type are exploited to characterize account details, business transactions or individuals. However, only a few standard clustering techniques and consensus clustering methods are provided to examine such a data thus far. Given this insight, the paper introduces novel extensions of link-based cluster ensemble, LCE WCT and LCE WTQ that are accurate for analyzing mixed-type data. They promote diversity within an ensemble through different initializations of the k-prototypes algorithm as base clusterings and then refine the summarized data using a link-based approach. Based on the evaluation metric of NMI (Normalized Mutual Information) that is averaged across different combinations of benchmark datasets and experimental settings, these new models reach the improved level of 0.34, while the best model found in the literature obtains only around the mark of 0.24. Besides, parameter analysis included herein helps to enhance their performance even further, given relations of clustering quality and algorithmic variables specific to the underlying link-based models. Moreover, another significant factor of ensemble size is examined in such a way to justify a tradeoff between complexity and accuracy.
Wireless Networks, 2021
Cloud computing enables ubiquitous and efficient on-demand access to information, data, and compu... more Cloud computing enables ubiquitous and efficient on-demand access to information, data, and computational resources with the support of modern wired and wireless communication technologies. Cloud computing has been very widely used in education, autonomous vehicles, smart cities/homes, renewable energy, healthcare, engineering, business, and telecommunications, amongst others, with the support of the advances in Artificial Intelligence, Internet of Things (IoT), and Data Science. Such technologies and their applications have made significant impact to the way people live and do business by offering online services and instant communications. Despite of the intensive research effort in the field, some open challenges remain, such as inefficient load balancing and energy management in cloud data centres, high cost to access to the facility, insufficient security and privacy, and low speed big data stream processing. This special issue focuses on recent advances in addressing these challenges and investigating all the aspects of emerging trends, in addition to reporting innovative real-world applications of cloud computing to deliver effective and efficient solutions & Longzhi Yang
PeerJ, 2020
Background Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire gen... more Background Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings. Materials and Methods Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly alig...
International Journal of Data Mining and Bioinformatics, 2016
DNA microarray has been the most widely used functional genomics approach in bioinformatics. Howe... more DNA microarray has been the most widely used functional genomics approach in bioinformatics. However, microarray data suffer from frequent missing values due to various experimental and data handling reasons. Leaving this unsolved may degrade the reliability of any consequent downstream analysis. As such, missing value imputation has been recognised as an important pre-processing step, which can yield the quality of data and its interpretation. Several techniques found in the literature have successfully exploited the characteristics and relations among a set of genes closest to the one under examination. However, the selection of so-called nearest neighbours is based simply on proximity between gene pairs, without taking the structural or grouping information into account. In response, this paper proposes a novel cluster-directed framework (CFNI: Cluster-directed Framework for Neighbourbased Imputation), in which data clustering is uniquely used to guide the identification of nearest neighbours. This allows a more accurate imputed value to be derived. Not only it performs better than several benchmark methods on published microarray data sets; it is also generalised such that any neighbourbased imputation technique can be coupled with the proposed model. This has been successfully demonstrated with both single pass and iterative models.
2014 Joint 7th International Conference on Soft Computing and Intelligent Systems (SCIS) and 15th International Symposium on Advanced Intelligent Systems (ISIS), 2014
Dropout or ceasing study prematurely has been widely recognized as a serious issue, especially in... more Dropout or ceasing study prematurely has been widely recognized as a serious issue, especially in the university level. A large number of higher education institutes are facing the common difficulty with low rate of graduations in comparison to the number of enrollment. As compared to western countries, this subject has attracted only a few studies in Thai university, with educational data mining being limited to the use of conventional classification models. This paper presents the most recent investigation of student dropout at Mae Fah Luang University, Thailand, and the novel reuse of link-based cluster ensemble as a data transformation framework for more accurate prediction. The empirical study on students' personal, academic performance and enrollment data, suggests that the proposed approach is usually more effective than several benchmark transformation techniques, across different classifiers.
Techniques and Applications
In the wake of recent terrorist atrocities, intelligence experts have commented that failures in ... more In the wake of recent terrorist atrocities, intelligence experts have commented that failures in detecting terrorist and criminal activities are not so much due to a lack of data, as they are due to difficulties in relating and interpreting the available intelligence. An intelligent tool for monitoring and interpreting intelligence data will provide a helpful means for intelligence analysts to consider emerging scenarios of plausible threats, thereby offering useful assistance in devising and deploying preventive measures against such possibilities. One of the major problems in need of such attention is detecting false identity that has become the common denominator of all serious crime, especially terrorism. Typical approaches to this problem rely on the similarity measure of textual and other content-based characteristics, which are usually not applicable in the case of deceptive and erroneous description. This barrier may be overcome through link information presented in communic...
2020 IEEE Symposium Series on Computational Intelligence (SSCI)
Cybercriminals are becoming more sophisticated wearing a mask of anonymity and unleashing more de... more Cybercriminals are becoming more sophisticated wearing a mask of anonymity and unleashing more destructive malware on a daily basis. The biggest challenge is coping with the abundance of malware created and filtering targeted samples of destructive malware for further investigation and analysis whilst discarding any inert samples, thus optimising the analysis by saving time, effort and resources. The most common technique is malware triaging to separate likely malware and unlikely malware samples. One such triaging technique is YARA rules, commonly used to detect and classify malware based on string and pattern matching, rules are triggered and alerted when their condition is satisfied. This pattern matching technique used by YARA rules and its detection rate can be improved in several ways, however, it can lead to bulky and complex rules that affect the performance of YARA rules. This paper proposes a fuzzy hashing aided enhanced YARA rules to improve the detection rate of YARA rules without significantly increasing the complexity and overheads inherent in the process. This proposed approach only uses an additional fuzzy hashing alongside basic YARA rules to complement each other, so that when one method cannot detect a match, then the other technique can. This work employs three triaging methods fuzzy hashing, import hashing and YARA rules to perform extensive experiments on the collected malware samples. The detection rate of enhanced YARA rules is compared against the detection rate of the employed triaging methods to demonstrate the improvement in the overall triaging results.
IEEE Access
Electronic Government (e-Government) systems constantly provide greater services to people, busin... more Electronic Government (e-Government) systems constantly provide greater services to people, businesses, organisations, and societies by offering more information, opportunities, and platforms with the support of advances in information and communications technologies. This usually results in increased system complexity and sensitivity, necessitating stricter security and privacy-protection measures. The majority of the existing e-Government systems are centralised, making them vulnerable to privacy and security threats, in addition to suffering from a single point of failure. This study proposes a decentralised e-Government framework with integrated threat detection features to address the aforementioned challenges. In particular, the privacy and security of the proposed e-Government system are realised by the encryption, validation, and immutable mechanisms provided by Blockchain. The insider and external threats associated with blockchain transactions are minimised by the employment of an artificial immune system, which effectively protects the integrity of the Blockchain. The proposed e-Government system was validated and evaluated by using the framework of Ethereum Visualisations of Interactive, Blockchain, Extended Simulations (i.e. eVIBES simulator) with two publicly available datasets. The experimental results show the efficacy of the proposed framework in that it can mitigate insider and external threats in e-Government systems whilst simultaneously preserving the privacy of information.
Computers, Materials & Continua
As more business transactions and information services have been implemented via communication ne... more As more business transactions and information services have been implemented via communication networks, both personal and organization assets encounter a higher risk of attacks. To safeguard these, a perimeter defence like NIDS (network-based intrusion detection system) can be effective for known intrusions. There has been a great deal of attention within the joint community of security and data science to improve machine-learning based NIDS such that it becomes more accurate for adversarial attacks, where obfuscation techniques are applied to disguise patterns of intrusive traffics. The current research focuses on non-payload connections at the TCP (transmission control protocol) stack level that is applicable to different network applications. In contrary to the wrapper method introduced with the benchmark dataset, three new filter models are proposed to transform the feature space without knowledge of class labels. These ECT (ensemble clustering based transformation) techniques, i.e., ECT-Subspace, ECT-Noise and ECT-Combined, are developed using the concept of ensemble clustering and three different ensemble generation strategies, i.e., random feature subspace, feature noise injection and their combinations. Based on the empirical study with published dataset and four classification algorithms, new models usually outperform that original wrapper and other filter alternatives found in the literature. This is similarly summarized from the first experiment with basic classification of legitimate and direct attacks, and the second that focuses on recognizing obfuscated intrusions. In addition, analysis of algorithmic parameters, i.e., ensemble size and level of noise, is provided as a guideline for a practical use.
Computers, Materials & Continua
Attempts to determine characters of astronomical objects have been one of major and vibrant activ... more Attempts to determine characters of astronomical objects have been one of major and vibrant activities in both astronomy and data science fields. Instead of a manual inspection, various automated systems are invented to satisfy the need, including the classification of light curve profiles. A specific Kaggle competition, namely Photometric LSST Astronomical Time-Series Classification Challenge (PLAsTiCC), is launched to gather new ideas of tackling the abovementioned task using the data set collected from the Large Synoptic Survey Telescope (LSST) project. Almost all proposed methods fall into the supervised family with a common aim to categorize each object into one of pre-defined types. As this challenge focuses on developing a predictive model that is robust to classifying unseen data, those previous attempts similarly encounter the lack of discriminate features, since distribution of training and actual test datasets are largely different. As a result, well-known classification algorithms prove to be sub-optimal, while more complicated feature extraction techniques may help to slightly boost the predictive performance. Given such a burden, this research is set to explore an unsupervised alternative to the difficult quest, where common classifiers fail to reach the 50% accuracy mark. A clustering technique is exploited to transform the space of training data, from which a more accurate classifier can be built. In addition to a single clustering framework that provides a comparable accuracy to the front runners of supervised learning, a multiple-clustering alternative is also introduced with improved performance. In fact, it is able to yield a higher accuracy rate of 58.32% from 51.36% that is obtained using a simple clustering. For this difficult problem, it is rather good considering for those achieved by well-known models like support vector machine (SVM) with 51.80% and Naïve Bayes (NB) with only 2.92%.
Computational Methods with Applications in Bioinformatics Analysis, 2017
Information Processing & Management, 2022
The work presented in this paper aims to develop a new imputation method to better handle missing... more The work presented in this paper aims to develop a new imputation method to better handle missing values encountered in astronomical data analysis, especially the classification of transient events in a sky survey from the GOTO project. In particular, the framework of cluster directed selection of neighbors that has proven effective for benchmark local imputation techniques of KNNimpute and LLSimpute are extended to new multi-stage models. These combinations, namely Iterative-CKNN and Iterative-CLLS, are organic with an original application to analyze sky survey data. They bring out advantages from both local approaches, where estimates are summarized from neighbors in the same data cluster, within the iterative process to refine previous guesses. Based on experiments with simulated datasets corresponding to different survey sizes and missing rations between 1 to 20%, they usually outperform baseline models and BPCA, which is the well-known global technique. For instance, at 10% missing rate, Iterative-CLLS appears to be the most accurate with NRMSE score of 0.190, while BPCA and the best among its baseline models reaches 0.351 and 0.249, respectively. For their practical implications, these methods have also proven effective for classifying transients, using common algorithms like KNN, Naive Bayes and Random Forest.
2017 Twelfth International Conference on Digital Information Management (ICDIM), 2017
The crime problems become critical issues for national security especially the security of border... more The crime problems become critical issues for national security especially the security of border and intelligent transportation systems (ITSs). These affect the economy, investment, tourism, and society. As a result, the automatic suspect vehicle detection emerges as one of effective tools to tackle the problems. However, the traditional process normally uses criminal vehicle data in blacklist comparing with vehicle data gathering from various sensors. This comparison is not effective and accurate that might be from not up-to-date data in the blacklist. Sometimes the blacklist is not available. This paper proposes the criminal behavior analysis method to detect suspect vehicles that are potentially involved in criminal activity. It must not rely on the blacklist. The analysis is conditional on journey path and the involvement of criminal activities. In additional, public officials believe that the suspect vehicle will choose the journey path without a checkpoint. Therefore, we used the journey path analysis techniques together with the association rule mining to analyze such criminal behavior. From extensive experiments, the results show that the proposed method can increase the suspect detection accuracy rate 17.24% beyond the traditional counterpart.
Q. Shen, T. Boongoen and C. Price. A fuzzy order-of-magnitude approach to qualitative link analys... more Q. Shen, T. Boongoen and C. Price. A fuzzy order-of-magnitude approach to qualitative link analysis. Proceedings of the 25th International Workshop on Qualitative Reasoning, pp. 147-158, 2011.
T. Boongoen and Q. Shen. 'Detecting False Identity through Behavioural Patterns', In Proc... more T. Boongoen and Q. Shen. 'Detecting False Identity through Behavioural Patterns', In Proceedings of International Crime Science Conference, British Library, London UK, 2008. Publisher's online version forthcoming.;The full text is currently unavailable in CADAIR pending approval by the publisher. Sponsorship: UK EPSRC grant EP/D057086
2015 International Carnahan Conference on Security Technology (ICCST), 2015
For modern-age security, many have turn to biometrics such as face classification to verify autho... more For modern-age security, many have turn to biometrics such as face classification to verify authority. Despite this, the accuracy of existing classifiers have been constrained by the curse of dimensionality typically observed in face images. In order to simplify the task, one may reduce the original data to a more compact variation, where only key feature components are included in the classification process. Unlike conventional feature reduction techniques found in the literature, this paper presents a novel method that makes use of cluster ensemble, specifically the summarizing information matrix, as the transformed data for a supervised learning step. Among different state-of-the-art methods, link-based cluster ensemble approach (LCE) provides a highly accurate clustering, and thus particularly employed here. The performance of this transformation model is evaluated on published face dataset and its noise-added variations, using different classifiers. The findings suggest that the new model can improve the classification accuracy beyond those of other benchmark methods investigated in this empirical study.
2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2012
ABSTRACT Gene expressions measured using microarrays usually encounter the problem of missing val... more ABSTRACT Gene expressions measured using microarrays usually encounter the problem of missing values. Leaving this unsolved may critically degrade the reliability of any consequent down-stream analysis or medical application. Yet, a further study of microarray data might be impossible with many analysis methods requiring a complete data set. This paper introduces a new methodology to impute missing values in microarray data. The proposed algorithm, CKNN impute, is an extension of k nearest neighbor imputation with local data clustering being incorporated for improved quality and efficiency. Gene expression data is typically represented as a matrix whose rows and columns correspond to genes and experiments, respectively. CKNN kicks off by finding a complete dataset via the removal of rows with missing value(s). Then, k clusters and their corresponding centroids are obtained by applying a clustering technique on the complete dataset. A set of similar genes of the target gene (with missing values) are those belonging to the cluster, whose centroid is the closest the target. Having known this, the target gene is imputed by applying k nearest neighbor method with similar genes previously determined. Empirical evaluation with published gene expression datasets suggest that the proposed technique performs better than the classical k nearest neighbor method and its extension found in the literature.
2013 13th International Symposium on Communications and Information Technologies (ISCIT), 2013
Gene expressions measured during a microarray experiment usually encounter the native problem of ... more Gene expressions measured during a microarray experiment usually encounter the native problem of missing values. These are due to possible errors occurring in the primary experiments, image acquisition and interpretation processes. Leaving this unsolved may critically degrade the reliability of any consequent downstream analysis or medical application. Yet, a further study of microarray data may not be possible with many standard analysis methods that require a complete data set. This paper introduces a new method to impute missing values in microarray data. The proposed algorithm, CLLS impute, is an extension of local least squares imputation with local data clustering being incorporated for improved quality and efficiency. Gene expression data is typically represented as a matrix whose rows and columns corresponds to genes and experiments, respectively. CLLS kicks off by finding a complete dataset via the removal of rows with missing value(s). Then, gene clusters and their corresponding centroids are obtained by applying a clustering technique on the complete dataset. A set of similar genes of the target gene (with missing values) are those belonging to the cluster, whose centroid is the closest to the target. Having known this, the target gene is imputed by applying regression analysis with similar genes previously determined. Empirical evaluation with several published gene expression datasets suggest that the proposed technique performs better than the classical local least square method and recently developed techniques found in the literature.
IEEE Transactions on Knowledge and Data Engineering, 2012
Although attempts have been made to solve the problem of clustering categorical data via cluster ... more Although attempts have been made to solve the problem of clustering categorical data via cluster ensembles, with the results being competitive to conventional algorithms, it is observed that these techniques unfortunately generate a final data partition based on incomplete information. The underlying ensemble-information matrix presents only cluster-data point relations, with many entries being left unknown. The paper presents an analysis that suggests this problem degrades the quality of the clustering result, and it presents a new link-based approach, which improves the conventional matrix by discovering unknown entries through similarity between clusters in an ensemble. In particular, an efficient link-based algorithm is proposed for the underlying similarity assessment. Afterward, to obtain the final clustering result, a graph partitioning technique is applied to a weighted bipartite graph that is formulated from the refined matrix. Experimental results on multiple real data sets suggest that the proposed link-based method almost always outperforms both conventional clustering algorithms for categorical data and well-known cluster ensemble techniques.
Computers, Materials & Continua, 2022
A mix between numerical and nominal data types commonly presents many modern-age data collections... more A mix between numerical and nominal data types commonly presents many modern-age data collections. Examples of these include banking data, sales history and healthcare records, where both continuous attributes like age and nominal ones like blood type are exploited to characterize account details, business transactions or individuals. However, only a few standard clustering techniques and consensus clustering methods are provided to examine such a data thus far. Given this insight, the paper introduces novel extensions of link-based cluster ensemble, LCE WCT and LCE WTQ that are accurate for analyzing mixed-type data. They promote diversity within an ensemble through different initializations of the k-prototypes algorithm as base clusterings and then refine the summarized data using a link-based approach. Based on the evaluation metric of NMI (Normalized Mutual Information) that is averaged across different combinations of benchmark datasets and experimental settings, these new models reach the improved level of 0.34, while the best model found in the literature obtains only around the mark of 0.24. Besides, parameter analysis included herein helps to enhance their performance even further, given relations of clustering quality and algorithmic variables specific to the underlying link-based models. Moreover, another significant factor of ensemble size is examined in such a way to justify a tradeoff between complexity and accuracy.
Wireless Networks, 2021
Cloud computing enables ubiquitous and efficient on-demand access to information, data, and compu... more Cloud computing enables ubiquitous and efficient on-demand access to information, data, and computational resources with the support of modern wired and wireless communication technologies. Cloud computing has been very widely used in education, autonomous vehicles, smart cities/homes, renewable energy, healthcare, engineering, business, and telecommunications, amongst others, with the support of the advances in Artificial Intelligence, Internet of Things (IoT), and Data Science. Such technologies and their applications have made significant impact to the way people live and do business by offering online services and instant communications. Despite of the intensive research effort in the field, some open challenges remain, such as inefficient load balancing and energy management in cloud data centres, high cost to access to the facility, insufficient security and privacy, and low speed big data stream processing. This special issue focuses on recent advances in addressing these challenges and investigating all the aspects of emerging trends, in addition to reporting innovative real-world applications of cloud computing to deliver effective and efficient solutions & Longzhi Yang
PeerJ, 2020
Background Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire gen... more Background Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings. Materials and Methods Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly alig...
International Journal of Data Mining and Bioinformatics, 2016
DNA microarray has been the most widely used functional genomics approach in bioinformatics. Howe... more DNA microarray has been the most widely used functional genomics approach in bioinformatics. However, microarray data suffer from frequent missing values due to various experimental and data handling reasons. Leaving this unsolved may degrade the reliability of any consequent downstream analysis. As such, missing value imputation has been recognised as an important pre-processing step, which can yield the quality of data and its interpretation. Several techniques found in the literature have successfully exploited the characteristics and relations among a set of genes closest to the one under examination. However, the selection of so-called nearest neighbours is based simply on proximity between gene pairs, without taking the structural or grouping information into account. In response, this paper proposes a novel cluster-directed framework (CFNI: Cluster-directed Framework for Neighbourbased Imputation), in which data clustering is uniquely used to guide the identification of nearest neighbours. This allows a more accurate imputed value to be derived. Not only it performs better than several benchmark methods on published microarray data sets; it is also generalised such that any neighbourbased imputation technique can be coupled with the proposed model. This has been successfully demonstrated with both single pass and iterative models.
2014 Joint 7th International Conference on Soft Computing and Intelligent Systems (SCIS) and 15th International Symposium on Advanced Intelligent Systems (ISIS), 2014
Dropout or ceasing study prematurely has been widely recognized as a serious issue, especially in... more Dropout or ceasing study prematurely has been widely recognized as a serious issue, especially in the university level. A large number of higher education institutes are facing the common difficulty with low rate of graduations in comparison to the number of enrollment. As compared to western countries, this subject has attracted only a few studies in Thai university, with educational data mining being limited to the use of conventional classification models. This paper presents the most recent investigation of student dropout at Mae Fah Luang University, Thailand, and the novel reuse of link-based cluster ensemble as a data transformation framework for more accurate prediction. The empirical study on students' personal, academic performance and enrollment data, suggests that the proposed approach is usually more effective than several benchmark transformation techniques, across different classifiers.
Techniques and Applications
In the wake of recent terrorist atrocities, intelligence experts have commented that failures in ... more In the wake of recent terrorist atrocities, intelligence experts have commented that failures in detecting terrorist and criminal activities are not so much due to a lack of data, as they are due to difficulties in relating and interpreting the available intelligence. An intelligent tool for monitoring and interpreting intelligence data will provide a helpful means for intelligence analysts to consider emerging scenarios of plausible threats, thereby offering useful assistance in devising and deploying preventive measures against such possibilities. One of the major problems in need of such attention is detecting false identity that has become the common denominator of all serious crime, especially terrorism. Typical approaches to this problem rely on the similarity measure of textual and other content-based characteristics, which are usually not applicable in the case of deceptive and erroneous description. This barrier may be overcome through link information presented in communic...
2020 IEEE Symposium Series on Computational Intelligence (SSCI)
Cybercriminals are becoming more sophisticated wearing a mask of anonymity and unleashing more de... more Cybercriminals are becoming more sophisticated wearing a mask of anonymity and unleashing more destructive malware on a daily basis. The biggest challenge is coping with the abundance of malware created and filtering targeted samples of destructive malware for further investigation and analysis whilst discarding any inert samples, thus optimising the analysis by saving time, effort and resources. The most common technique is malware triaging to separate likely malware and unlikely malware samples. One such triaging technique is YARA rules, commonly used to detect and classify malware based on string and pattern matching, rules are triggered and alerted when their condition is satisfied. This pattern matching technique used by YARA rules and its detection rate can be improved in several ways, however, it can lead to bulky and complex rules that affect the performance of YARA rules. This paper proposes a fuzzy hashing aided enhanced YARA rules to improve the detection rate of YARA rules without significantly increasing the complexity and overheads inherent in the process. This proposed approach only uses an additional fuzzy hashing alongside basic YARA rules to complement each other, so that when one method cannot detect a match, then the other technique can. This work employs three triaging methods fuzzy hashing, import hashing and YARA rules to perform extensive experiments on the collected malware samples. The detection rate of enhanced YARA rules is compared against the detection rate of the employed triaging methods to demonstrate the improvement in the overall triaging results.