Huawen Liu - Academia.edu (original) (raw)
Papers by Huawen Liu
ACM Transactions on Knowledge Discovery from Data
Similarity representation plays a central role in increasingly popular anomaly detection techniqu... more Similarity representation plays a central role in increasingly popular anomaly detection techniques, which have been successfully applied in various realistic scenes. Until now, many low-rank representation techniques have been introduced to measure the similarity relations of data; yet, they only concern to minimize reconstruction errors, without involving the structural information of data. Besides, the traditional low-rank representation methods often take nuclear norm as their low-rank constraints, easily yielding a suboptimal solution. To address the problems above, in this article, we propose a novel anomaly detection method, which exploits kernel preserving embedding, as well as the double nuclear norm, to explore the similarity relations of data. Based on the similarity relations, a kind of probability transition matrix is derived, and a tailored random walk is further adopted to reveal anomalies. The proposed method can not only preserve the manifold structural properties o...
IEEE Transactions on Neural Networks and Learning Systems
Hashing offers a desirable and effective solution for efficiently retrieving the nearest neighbor... more Hashing offers a desirable and effective solution for efficiently retrieving the nearest neighbors from large-scale data because of its low storage and computation costs. One of the most appealing techniques for hashing learning is matrix factorization. However, most hashing methods focus only on building the mapping relationships between the Euclidean and Hamming spaces and, unfortunately, underestimate the naturally sparse structures of the data. In addition, parameter tuning is always a challenging and head-scratching problem for sparse hashing learning. To address these problems, in this article, we propose a novel hashing method termed adaptively sparse matrix factorization hashing (SMFH), which exploits sparse matrix factorization to explore the parsimonious structures of the data. Moreover, SMFH adopts an orthogonal transformation to minimize the quantization loss while deriving the binary codes. The most distinguished property of SMFH is that it is adaptive and parameter-free, that is, SMFH can automatically generate sparse representations and does not require human involvement to tune the regularization parameters for the sparse models. Empirical studies on four publicly available benchmark data sets show that the proposed method can achieve promising performance and is competitive with a variety of state-of-the-art hashing methods.
International Journal of Computational Intelligence Systems
Outlier detection is a hot topic in machine learning. With the newly emerging technologies and di... more Outlier detection is a hot topic in machine learning. With the newly emerging technologies and diverse applications, the interest of outlier detection is increasing greatly. Recently, a significant number of outlier detection methods have been witnessed and successfully applied in a wide range of fields, including medical health, credit card fraud and intrusion detection. They can be used for conventional data analysis. However, it is not a trivial work to identify rare behaviors or patterns out from complicated data. In this paper, we provide a brief overview of the outlier detection methods for high-dimensional data, and offer comprehensive understanding of the-state-of-the-art techniques of outlier detection for practitioners. Specifically, we firstly summarize the recent advances on outlier detection for high-dimensional data, and then make an extensive experimental comparison to the popular detection methods on public datasets. Finally, several challenging issues and future research directions are discussed.
Neural Computing and Applications
Complexity
Anomaly analysis is of great interest to diverse fields, including data mining and machine learni... more Anomaly analysis is of great interest to diverse fields, including data mining and machine learning, and plays a critical role in a wide range of applications, such as medical health, credit card fraud, and intrusion detection. Recently, a significant number of anomaly detection methods with a variety of types have been witnessed. This paper intends to provide a comprehensive overview of the existing work on anomaly detection, especially for the data with high dimensionalities and mixed types, where identifying anomalous patterns or behaviours is a nontrivial work. Specifically, we first present recent advances in anomaly detection, discussing the pros and cons of the detection methods. Then we conduct extensive experiments on public datasets to evaluate several typical and popular anomaly detection methods. The purpose of this paper is to offer a better understanding of the state-of-the-art techniques of anomaly detection for practitioners. Finally, we conclude by providing some di...
IEEE Transactions on Systems, Man, and Cybernetics: Systems
IEEE Transactions on Multimedia
Pattern Recognition
ABSTRACT Multi-label data are prevalent in real world. Due to its great potential applications, m... more ABSTRACT Multi-label data are prevalent in real world. Due to its great potential applications, multi-label learning has now been receiving more and more attention from many fields. However, how to effectively exploit the correlations of variables and labels, and tackle the high-dimensional problems of data are two major challenging issues for multi-label learning. In this paper we make an attempt to cope with these two problems by proposing an effective multi-label learning algorithm. Specifically, we make use of the technique of partial least square discriminant analysis to identify a common latent space between the variable space and the label space of multi-label data. Moreover, considering the label space of the multi-label data is sparse, a l1-norm penalty is further performed to constrain the Y-loadings of the optimization problem of partial least squares, making them sparse. The merit of our method is that it can capture the correlations and perform dimension reduction at the same time. The experimental results conducted on eleven public data sets show that our method is promising and superior to the state-of-the-art multi-label classifiers in most cases.
International Journal of Machine Learning and Cybernetics
Lecture Notes in Computer Science, 2016
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, Jan 14, 2016
Discovering causal relationships from observational data is a crucial problem and it has applicat... more Discovering causal relationships from observational data is a crucial problem and it has applications in many research areas. The PC algorithm is the state-of-the-art constraint based method for causal discovery. However, runtime of the PC algorithm, in the worst-case, is exponential to the number of nodes (variables), and thus it is inefficient when being applied to high dimensional data, e.g. gene expression datasets. On another note, the advancement of computer hardware in the last decade has resulted in the widespread availability of multi-core personal computers. There is a significant motivation for designing a parallelised PC algorithm that is suitable for personal computers and does not require end users' parallel computing knowledge beyond their competency in using the PC algorithm. In this paper, we develop parallel-PC, a fast and memory efficient PC algorithm using the parallel computing technique. We apply our method to a range of synthetic and real-world high dimens...
IEEE transactions on cybernetics, Jan 8, 2016
Multilabel learning has a wide range of potential applications in reality. It attracts a great de... more Multilabel learning has a wide range of potential applications in reality. It attracts a great deal of attention during the past years and has been extensively studied in many fields including image annotation and text categorization. Although many efforts have been made for multilabel learning, there are two challenging issues remaining, i.e., how to exploit the correlations and how to tackle the high-dimensional problems of multilabel data. In this paper, an effective algorithm is developed for multilabel classification with utilizing those data that are relevant to the targets. The key is the construction of a coefficient-based mapping between training and test instances, where the mapping relationship exploits the correlations among the instances, rather than the explicit relationship between the variables and the class labels of data. Further, a constraint, ℓ¹-norm penalty, is performed on the mapping relationship to make the model sparse, weakening the impacts of noisy data. O...
2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2014
Lecture Notes in Computer Science, 2011
ABSTRACT With rapid development of information technology, dimensionality of data in many applica... more ABSTRACT With rapid development of information technology, dimensionality of data in many applications is getting higher and higher. However, many features in the high-dimensional data are redundant. Their presence may pose a great number of challenges to traditional learning algorithms. Thus, it is necessary to develop an effective technique to remove irrelevant features from data. Currently, many endeavors have been attempted in this field. In this paper, we propose a new feature selection method by using conditional mutual information estimated dynamically. Its advantage is that it can exactly represent the correlation between features along with the selection procedure. Our performance evaluations on eight benchmark datasets show that our proposed method achieves comparable performance to other well-established feature selection algorithms in most cases.
PloS one, 2015
microRNAs (miRNAs) are important gene regulators at post-transcriptional level, and inferring miR... more microRNAs (miRNAs) are important gene regulators at post-transcriptional level, and inferring miRNA-mRNA regulatory relationships is a crucial problem. Consequently, several computational methods of predicting miRNA targets have been proposed using expression data with or without sequence based miRNA target information. A typical procedure for applying and evaluating such a method is i) collecting matched miRNA and mRNA expression profiles in a specific condition, e.g. a cancer dataset from The Cancer Genome Atlas (TCGA), ii) applying the new computational method to the selected dataset, iii) validating the predictions against knowledge from literature and third-party databases, and comparing the performance of the method with some existing methods. This procedure is time consuming given the time elapsed when collecting and processing data, repeating the work from existing methods, searching for knowledge from literature and third-party databases to validate the results, and compari...
Lecture Notes in Computer Science, 2007
Abstract. Flow graph (FG) is a new mathematical model which can be used for representing, analyzi... more Abstract. Flow graph (FG) is a new mathematical model which can be used for representing, analyzing, and discovering knowledge in databases. Due to its well-structured characteristics of network, FG is naturally con-sistent with granular computing (GrC). Meanwhile, GrC ...
International Journal of Pattern Recognition and Artificial Intelligence, 2007
The purpose of this paper is to start a conceptual investigation of approximation rule based on V... more The purpose of this paper is to start a conceptual investigation of approximation rule based on VPRS as a result of the certainty degree of rules in complete information system that cannot exactly express the uncertainty of those in incomplete information system, and then an efficient approximation rule induction algorithm under the rough set framework is presented. Instead of focusing on the minimal rule set, this algorithm hierarchically extracts rules in multistages from data sets to suit changing environments in learning and classification. In addition, a heuristic strategy is employed in the algorithm to improve its performance and reduce the time consumed in inducing. Experiments are carried out, and the results show that the proposed algorithm is effective in inducing rules which can enhance their adaptive capacities.
Lecture Notes in Computer Science, 2014
Lecture Notes in Computer Science, 2014
ACM Transactions on Knowledge Discovery from Data
Similarity representation plays a central role in increasingly popular anomaly detection techniqu... more Similarity representation plays a central role in increasingly popular anomaly detection techniques, which have been successfully applied in various realistic scenes. Until now, many low-rank representation techniques have been introduced to measure the similarity relations of data; yet, they only concern to minimize reconstruction errors, without involving the structural information of data. Besides, the traditional low-rank representation methods often take nuclear norm as their low-rank constraints, easily yielding a suboptimal solution. To address the problems above, in this article, we propose a novel anomaly detection method, which exploits kernel preserving embedding, as well as the double nuclear norm, to explore the similarity relations of data. Based on the similarity relations, a kind of probability transition matrix is derived, and a tailored random walk is further adopted to reveal anomalies. The proposed method can not only preserve the manifold structural properties o...
IEEE Transactions on Neural Networks and Learning Systems
Hashing offers a desirable and effective solution for efficiently retrieving the nearest neighbor... more Hashing offers a desirable and effective solution for efficiently retrieving the nearest neighbors from large-scale data because of its low storage and computation costs. One of the most appealing techniques for hashing learning is matrix factorization. However, most hashing methods focus only on building the mapping relationships between the Euclidean and Hamming spaces and, unfortunately, underestimate the naturally sparse structures of the data. In addition, parameter tuning is always a challenging and head-scratching problem for sparse hashing learning. To address these problems, in this article, we propose a novel hashing method termed adaptively sparse matrix factorization hashing (SMFH), which exploits sparse matrix factorization to explore the parsimonious structures of the data. Moreover, SMFH adopts an orthogonal transformation to minimize the quantization loss while deriving the binary codes. The most distinguished property of SMFH is that it is adaptive and parameter-free, that is, SMFH can automatically generate sparse representations and does not require human involvement to tune the regularization parameters for the sparse models. Empirical studies on four publicly available benchmark data sets show that the proposed method can achieve promising performance and is competitive with a variety of state-of-the-art hashing methods.
International Journal of Computational Intelligence Systems
Outlier detection is a hot topic in machine learning. With the newly emerging technologies and di... more Outlier detection is a hot topic in machine learning. With the newly emerging technologies and diverse applications, the interest of outlier detection is increasing greatly. Recently, a significant number of outlier detection methods have been witnessed and successfully applied in a wide range of fields, including medical health, credit card fraud and intrusion detection. They can be used for conventional data analysis. However, it is not a trivial work to identify rare behaviors or patterns out from complicated data. In this paper, we provide a brief overview of the outlier detection methods for high-dimensional data, and offer comprehensive understanding of the-state-of-the-art techniques of outlier detection for practitioners. Specifically, we firstly summarize the recent advances on outlier detection for high-dimensional data, and then make an extensive experimental comparison to the popular detection methods on public datasets. Finally, several challenging issues and future research directions are discussed.
Neural Computing and Applications
Complexity
Anomaly analysis is of great interest to diverse fields, including data mining and machine learni... more Anomaly analysis is of great interest to diverse fields, including data mining and machine learning, and plays a critical role in a wide range of applications, such as medical health, credit card fraud, and intrusion detection. Recently, a significant number of anomaly detection methods with a variety of types have been witnessed. This paper intends to provide a comprehensive overview of the existing work on anomaly detection, especially for the data with high dimensionalities and mixed types, where identifying anomalous patterns or behaviours is a nontrivial work. Specifically, we first present recent advances in anomaly detection, discussing the pros and cons of the detection methods. Then we conduct extensive experiments on public datasets to evaluate several typical and popular anomaly detection methods. The purpose of this paper is to offer a better understanding of the state-of-the-art techniques of anomaly detection for practitioners. Finally, we conclude by providing some di...
IEEE Transactions on Systems, Man, and Cybernetics: Systems
IEEE Transactions on Multimedia
Pattern Recognition
ABSTRACT Multi-label data are prevalent in real world. Due to its great potential applications, m... more ABSTRACT Multi-label data are prevalent in real world. Due to its great potential applications, multi-label learning has now been receiving more and more attention from many fields. However, how to effectively exploit the correlations of variables and labels, and tackle the high-dimensional problems of data are two major challenging issues for multi-label learning. In this paper we make an attempt to cope with these two problems by proposing an effective multi-label learning algorithm. Specifically, we make use of the technique of partial least square discriminant analysis to identify a common latent space between the variable space and the label space of multi-label data. Moreover, considering the label space of the multi-label data is sparse, a l1-norm penalty is further performed to constrain the Y-loadings of the optimization problem of partial least squares, making them sparse. The merit of our method is that it can capture the correlations and perform dimension reduction at the same time. The experimental results conducted on eleven public data sets show that our method is promising and superior to the state-of-the-art multi-label classifiers in most cases.
International Journal of Machine Learning and Cybernetics
Lecture Notes in Computer Science, 2016
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, Jan 14, 2016
Discovering causal relationships from observational data is a crucial problem and it has applicat... more Discovering causal relationships from observational data is a crucial problem and it has applications in many research areas. The PC algorithm is the state-of-the-art constraint based method for causal discovery. However, runtime of the PC algorithm, in the worst-case, is exponential to the number of nodes (variables), and thus it is inefficient when being applied to high dimensional data, e.g. gene expression datasets. On another note, the advancement of computer hardware in the last decade has resulted in the widespread availability of multi-core personal computers. There is a significant motivation for designing a parallelised PC algorithm that is suitable for personal computers and does not require end users' parallel computing knowledge beyond their competency in using the PC algorithm. In this paper, we develop parallel-PC, a fast and memory efficient PC algorithm using the parallel computing technique. We apply our method to a range of synthetic and real-world high dimens...
IEEE transactions on cybernetics, Jan 8, 2016
Multilabel learning has a wide range of potential applications in reality. It attracts a great de... more Multilabel learning has a wide range of potential applications in reality. It attracts a great deal of attention during the past years and has been extensively studied in many fields including image annotation and text categorization. Although many efforts have been made for multilabel learning, there are two challenging issues remaining, i.e., how to exploit the correlations and how to tackle the high-dimensional problems of multilabel data. In this paper, an effective algorithm is developed for multilabel classification with utilizing those data that are relevant to the targets. The key is the construction of a coefficient-based mapping between training and test instances, where the mapping relationship exploits the correlations among the instances, rather than the explicit relationship between the variables and the class labels of data. Further, a constraint, ℓ¹-norm penalty, is performed on the mapping relationship to make the model sparse, weakening the impacts of noisy data. O...
2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2014
Lecture Notes in Computer Science, 2011
ABSTRACT With rapid development of information technology, dimensionality of data in many applica... more ABSTRACT With rapid development of information technology, dimensionality of data in many applications is getting higher and higher. However, many features in the high-dimensional data are redundant. Their presence may pose a great number of challenges to traditional learning algorithms. Thus, it is necessary to develop an effective technique to remove irrelevant features from data. Currently, many endeavors have been attempted in this field. In this paper, we propose a new feature selection method by using conditional mutual information estimated dynamically. Its advantage is that it can exactly represent the correlation between features along with the selection procedure. Our performance evaluations on eight benchmark datasets show that our proposed method achieves comparable performance to other well-established feature selection algorithms in most cases.
PloS one, 2015
microRNAs (miRNAs) are important gene regulators at post-transcriptional level, and inferring miR... more microRNAs (miRNAs) are important gene regulators at post-transcriptional level, and inferring miRNA-mRNA regulatory relationships is a crucial problem. Consequently, several computational methods of predicting miRNA targets have been proposed using expression data with or without sequence based miRNA target information. A typical procedure for applying and evaluating such a method is i) collecting matched miRNA and mRNA expression profiles in a specific condition, e.g. a cancer dataset from The Cancer Genome Atlas (TCGA), ii) applying the new computational method to the selected dataset, iii) validating the predictions against knowledge from literature and third-party databases, and comparing the performance of the method with some existing methods. This procedure is time consuming given the time elapsed when collecting and processing data, repeating the work from existing methods, searching for knowledge from literature and third-party databases to validate the results, and compari...
Lecture Notes in Computer Science, 2007
Abstract. Flow graph (FG) is a new mathematical model which can be used for representing, analyzi... more Abstract. Flow graph (FG) is a new mathematical model which can be used for representing, analyzing, and discovering knowledge in databases. Due to its well-structured characteristics of network, FG is naturally con-sistent with granular computing (GrC). Meanwhile, GrC ...
International Journal of Pattern Recognition and Artificial Intelligence, 2007
The purpose of this paper is to start a conceptual investigation of approximation rule based on V... more The purpose of this paper is to start a conceptual investigation of approximation rule based on VPRS as a result of the certainty degree of rules in complete information system that cannot exactly express the uncertainty of those in incomplete information system, and then an efficient approximation rule induction algorithm under the rough set framework is presented. Instead of focusing on the minimal rule set, this algorithm hierarchically extracts rules in multistages from data sets to suit changing environments in learning and classification. In addition, a heuristic strategy is employed in the algorithm to improve its performance and reduce the time consumed in inducing. Experiments are carried out, and the results show that the proposed algorithm is effective in inducing rules which can enhance their adaptive capacities.
Lecture Notes in Computer Science, 2014
Lecture Notes in Computer Science, 2014