Yanchang Zhao - Academia.edu (original) (raw)
Papers by Yanchang Zhao
Expert Systems With Applications, Nov 1, 2023
2022 IEEE International Conference on Big Data (Big Data), Dec 17, 2022
Case studies: The case studies are not included in this oneline version. They are reserved exclus... more Case studies: The case studies are not included in this oneline version. They are reserved exclusively for a book version. Latest version: The latest online version is available at
Proceedings of The Web Conference 2020, 2020
In the modern tourism industry, next point-of-interest (POI) recommendation is an important mobil... more In the modern tourism industry, next point-of-interest (POI) recommendation is an important mobile service as it effectively aids hesitating travelers to decide the next POI to visit. Currently, most next POI recommender systems are built upon a cloud-based paradigm, where the recommendation models are trained and deployed on the powerful cloud servers. When a recommendation request is made by a user via mobile devices, the current contextual information will be uploaded to the cloud servers to help the well-trained models generate personalized recommendation results. However, in reality, this paradigm heavily relies on high-quality network connectivity, and is subject to high energy footprint in the operation and increasing privacy concerns among the public. To bypass these defects, we propose a novel Light Location Recommender System (LLRec) to perform next POI recommendation locally on resource-constrained mobile devices. To make LLRec fully compatible with the limited computing resources and memory space, we leverage FastGRNN, a lightweight but effective gated Recurrent Neural Network (RNN) as its main building block, and significantly compress the model size by adopting the tensor-train composition in the embedding layer. As a compact model, LLRec maintains its robustness via an innovative teacher-student training framework, where a powerful teacher model is trained on the cloud to learn essential knowledge from available contextual data, and the simplified student model LLRec is trained under the guidance of the teacher model. The final LLRec is downloaded and deployed on users’ mobile devices to generate accurate recommendations solely utilizing users’ local data. As a result, LLRec significantly reduces the dependency on cloud servers, thus allowing for next POI recommendation in a stable, cost-effective and secure way. Extensive experiments on two large-scale recommendation datasets further demonstrate the superiority of our proposed solution.
CCECE 2003 - Canadian Conference on Electrical and Computer Engineering. Toward a Caring and Humane Technology (Cat. No.03CH37436)
In many fields, the datasets used in data mining applications are usually of high dimensionality.... more In many fields, the datasets used in data mining applications are usually of high dimensionality. Most existing algorithms of clustering are effective and efficient when the dimensionality is low, but their performance and effectiveness degrade when the data space is high-dimensional. One reason is that their complexity increases exponentially with the dimensionality. To solve the problem, we put forward a
Data Mining for Business Applications
This chapter presents four applications of data mining in social security. The first is an applic... more This chapter presents four applications of data mining in social security. The first is an application of decision tree and association rules to find the demographic patterns of customers. Sequence mining is used in the second application to find activity sequence ...
Journal of Software, 2005
Proceedings of the ACM Web Conference 2023
IEEE Transactions on Knowledge and Data Engineering, 2006
Recent-biased approximations have received increased attention recently as a mechanism for learni... more Recent-biased approximations have received increased attention recently as a mechanism for learning trend patterns from time series or data streams. They have shown promise for clustering time series and incrementally pattern maintaining. In this paper, we design a generalized dimension-reduction framework for recent-biased approximations, aiming at making traditional dimensionreduction techniques actionable in recent-biased time series analysis. The framework is designed in two ways: equi-segmented scheme and vari-segmented scheme. In both schemes, time series data are first partitioned into segments and a dimension-reduction technique is applied to each segment. Then, more coefficients are kept for more recent data while fewer kept for older data. Thus, more details are preserved for recent data and fewer coefficients are kept for the whole time series, which improves the efficiency greatly. We experimentally evaluate the proposed approach, and demonstrate that traditional dimension-reduction techniques, such as SVD, DFT, DWT, PIP, PAA, and landmarks, can be embedded into our framework for recent-biased approximations over streaming time series.
... Some of them are not specially for data mining, but they are included here because they are u... more ... Some of them are not specially for data mining, but they are included here because they are useful in data mining applications. 1. Clustering Packages: fpc cluster pvclust mclust Partitioning-based clustering: kmeans, pam, pamk, clara Hierarchical clustering: hclust ...
IEEE/WIC/ACM International Conference on Web Intelligence
Online job boards have greatly improved the efficiency of job searching and have also provided va... more Online job boards have greatly improved the efficiency of job searching and have also provided valuable data for labour market research. However, there are a high proportion of duplicate job postings in most (if not all) job boards, because recruiters and job boards seek to improve their coverage of the market by integrating job postings from many different sources. These duplicate postings undermine the usability of job boards and the quality of labour market analytics derived from them. In this paper, we tackle the challenging problem of duplicate detection from online job postings. Specifically, we design a framework for duplicate detection from online job postings and, under the framework, implement and test 24 methods built with four different tokenisers, three vectorisers and six similarity measures. We conduct a comparative study and experimental evaluation of the 24 methods and compare their performance with a baseline approach. All methods are tested with a real-world dataset from a job boarding platform and are evaluated with six performance metrics. The experiment reveals that the top two methods are Overlap with skip-gram (OS) and Overlap with n-gram (OG), followed by TFIDF-cosine with n-gram (TCG) and TFIDF-cosine with skip-gram (TCS), and that all above four methods outperform the baseline approach in detecting duplicates. CCS CONCEPTS • Applied computing → Document analysis; • Computing methodologies → Information extraction; • Information systems → Data cleaning.
Different from traditional positive sequential pattern mining, negative sequential pattern mining... more Different from traditional positive sequential pattern mining, negative sequential pattern mining considers both positive and negative relationships between items. Negative sequential pattern mining doesn't necessarily follow the Apriori principle, and the searching space is much larger than positive pattern mining. Giving definitions and some constraints of negative sequential patterns, this paper proposes a new method for mining negative sequential patterns, called Negative-GSP. Negative-GSP can find negative sequential patterns effectively and efficiently by joining and pruning, and extensive experimental results show the efficiency of the method.
Communications in Computer and Information Science
Clustering is one of the most important techniques in data mining. This chapter presents a survey... more Clustering is one of the most important techniques in data mining. This chapter presents a survey of popular approaches for data clustering, including well-known clustering techniques, such as partitioning clustering, hierarchical clustering, density-based clustering and grid-based clustering, and recent advances in clustering, such as subspace clustering, text clustering and data stream clustering. The major challenges and future trends of data clustering will also be introduced in this chapter. The remainder of this chapter is organized as follows. The background of data clustering will be introduced in Section 2, including the definition of clustering, categories of clustering techniques, features of good clustering algorithms, and the validation of clustering. Section 3 will present main approaches for clustering, which range from the classic partitioning and hierarchical clustering to recent approaches of bi-clustering and semisupervised clustering. Challenges and future trends...
Energy costs can be a major component of operational costs for water utilities. Operational effic... more Energy costs can be a major component of operational costs for water utilities. Operational efficiencies including optimising energy costs while maintaining continuity of supply is one area to reduce overall operational costs. To address the challenge, we have proposed an effective optimisation model to minimise the energy cost for water distribution networks. A simulation of the model over a water distribution network in Sydney demonstrated that 15% saving in energy cost could be achieved using this approach, as compared with the existing rule-based method.
AI 2019: Advances in Artificial Intelligence
Incomplete data are quite common which can deteriorate statistical inference, often affecting evi... more Incomplete data are quite common which can deteriorate statistical inference, often affecting evidence-based policymaking. A typical example is the Business Longitudinal Analysis Data Environment (BLADE), an Australian Government's national data asset. In this paper, motivated by helping BLADE practitioners select and implement advanced imputation methods with a solid understanding of the impact different methods will have on data accuracy and reliability, we implement and examine performance of data imputation techniques based on 12 machine learning algorithms. They range from linear regression to neural networks. We compare the performance of these algorithms and assess the impact of various settings, including the number of input features and the length of time spans. To examine generalisability, we also impute two features with distinct characteristics. Experimental results show that three ensemble algorithms: extra trees regressor, bagging regressor and random forest consistently maintain high imputation performance over the benchmark linear regression across a range of performance metrics. Among them, we would recommend the extra trees regressor for its accuracy and computational efficiency.
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
In Disability Employment Services (DES), a growing problem is recommending to disabled job seeker... more In Disability Employment Services (DES), a growing problem is recommending to disabled job seekers which skill should be upgraded and the best level for upgrading this skill to increase their employment potential most. This problem involves counterfactual reasoning to infer causal effect of factors on employment status to recommend the most effective intervention. Related methods cannot solve our problem adequately since they are developed for non-counterfactual challenges, for binary causal factors, or for randomized trials. In this paper, we present a causality-based method to tackle the problem. The method includes two stages where causal factors of employment status are first detected from data. We then combine a counterfactual reasoning framework with a machine learning approach to build an interpretable model for generating personalized recommendations. Experiments on both synthetic datasets and a real case study from a DES provider show consistent promising performance of improving employability of disabled job seekers. Results from the case study disclose effective factors and their best levels for intervention to increase employability. The most effective intervention varies among job seekers. Our model can separate job seekers by degree of employability increase. This is helpful for DES providers to allocate resources for employment assistance. Moreover, causal interpretability makes our recommendations actionable in DES business practice.
Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11, 2011
Mining Negative Sequential Patterns (NSP) is much more challenging than mining Positive Sequentia... more Mining Negative Sequential Patterns (NSP) is much more challenging than mining Positive Sequential Patterns (PSP) due to the high computational complexity and huge search space required in calculating Negative Sequential Candidates (NSC). Very few approaches are available for mining NSP, which mainly rely on re-scanning databases after identifying PSP. As a result, they are very inefficient. In this paper, we propose an efficient algorithm for mining NSP, called e-NSP, which mines for NSP by only involving the identified PSP, without re-scanning databases. First, negative containment is defined to determine whether or not a data sequence contains a negative sequence. Second, an efficient approach is proposed to convert the negative containment problem to a positive containment problem. The supports of NSC are then calculated based only on the corresponding PSP. Finally, a simple but efficient approach is proposed to generate NSC. With e-NSP, mining NSP does not require additional database scans, and the existing PSP mining algorithms can be integrated into e-NSP to mine for NSP efficiently. e-NSP is compared with two currently available NSP mining algorithms on 14 synthetic and real-life datasets. Intensive experiments show that e-NSP takes as little as 3% of the runtime of the baseline approaches and is applicable for efficient mining of NSP in large datasets.
Lecture Notes in Computer Science, 2003
The clustering algorithm GDILC relies on density-based clustering with grid and is designed to di... more The clustering algorithm GDILC relies on density-based clustering with grid and is designed to discover clusters of arbitrary shapes and eliminate noises. However, it is not scalable to large high-dimensional datasets. In this paper, we improved this algorithm in five important directions. Through these improvements, AGRID is of high scalability and can process large high-dimensional datasets. It can discover clusters
Expert Systems With Applications, Nov 1, 2023
2022 IEEE International Conference on Big Data (Big Data), Dec 17, 2022
Case studies: The case studies are not included in this oneline version. They are reserved exclus... more Case studies: The case studies are not included in this oneline version. They are reserved exclusively for a book version. Latest version: The latest online version is available at
Proceedings of The Web Conference 2020, 2020
In the modern tourism industry, next point-of-interest (POI) recommendation is an important mobil... more In the modern tourism industry, next point-of-interest (POI) recommendation is an important mobile service as it effectively aids hesitating travelers to decide the next POI to visit. Currently, most next POI recommender systems are built upon a cloud-based paradigm, where the recommendation models are trained and deployed on the powerful cloud servers. When a recommendation request is made by a user via mobile devices, the current contextual information will be uploaded to the cloud servers to help the well-trained models generate personalized recommendation results. However, in reality, this paradigm heavily relies on high-quality network connectivity, and is subject to high energy footprint in the operation and increasing privacy concerns among the public. To bypass these defects, we propose a novel Light Location Recommender System (LLRec) to perform next POI recommendation locally on resource-constrained mobile devices. To make LLRec fully compatible with the limited computing resources and memory space, we leverage FastGRNN, a lightweight but effective gated Recurrent Neural Network (RNN) as its main building block, and significantly compress the model size by adopting the tensor-train composition in the embedding layer. As a compact model, LLRec maintains its robustness via an innovative teacher-student training framework, where a powerful teacher model is trained on the cloud to learn essential knowledge from available contextual data, and the simplified student model LLRec is trained under the guidance of the teacher model. The final LLRec is downloaded and deployed on users’ mobile devices to generate accurate recommendations solely utilizing users’ local data. As a result, LLRec significantly reduces the dependency on cloud servers, thus allowing for next POI recommendation in a stable, cost-effective and secure way. Extensive experiments on two large-scale recommendation datasets further demonstrate the superiority of our proposed solution.
CCECE 2003 - Canadian Conference on Electrical and Computer Engineering. Toward a Caring and Humane Technology (Cat. No.03CH37436)
In many fields, the datasets used in data mining applications are usually of high dimensionality.... more In many fields, the datasets used in data mining applications are usually of high dimensionality. Most existing algorithms of clustering are effective and efficient when the dimensionality is low, but their performance and effectiveness degrade when the data space is high-dimensional. One reason is that their complexity increases exponentially with the dimensionality. To solve the problem, we put forward a
Data Mining for Business Applications
This chapter presents four applications of data mining in social security. The first is an applic... more This chapter presents four applications of data mining in social security. The first is an application of decision tree and association rules to find the demographic patterns of customers. Sequence mining is used in the second application to find activity sequence ...
Journal of Software, 2005
Proceedings of the ACM Web Conference 2023
IEEE Transactions on Knowledge and Data Engineering, 2006
Recent-biased approximations have received increased attention recently as a mechanism for learni... more Recent-biased approximations have received increased attention recently as a mechanism for learning trend patterns from time series or data streams. They have shown promise for clustering time series and incrementally pattern maintaining. In this paper, we design a generalized dimension-reduction framework for recent-biased approximations, aiming at making traditional dimensionreduction techniques actionable in recent-biased time series analysis. The framework is designed in two ways: equi-segmented scheme and vari-segmented scheme. In both schemes, time series data are first partitioned into segments and a dimension-reduction technique is applied to each segment. Then, more coefficients are kept for more recent data while fewer kept for older data. Thus, more details are preserved for recent data and fewer coefficients are kept for the whole time series, which improves the efficiency greatly. We experimentally evaluate the proposed approach, and demonstrate that traditional dimension-reduction techniques, such as SVD, DFT, DWT, PIP, PAA, and landmarks, can be embedded into our framework for recent-biased approximations over streaming time series.
... Some of them are not specially for data mining, but they are included here because they are u... more ... Some of them are not specially for data mining, but they are included here because they are useful in data mining applications. 1. Clustering Packages: fpc cluster pvclust mclust Partitioning-based clustering: kmeans, pam, pamk, clara Hierarchical clustering: hclust ...
IEEE/WIC/ACM International Conference on Web Intelligence
Online job boards have greatly improved the efficiency of job searching and have also provided va... more Online job boards have greatly improved the efficiency of job searching and have also provided valuable data for labour market research. However, there are a high proportion of duplicate job postings in most (if not all) job boards, because recruiters and job boards seek to improve their coverage of the market by integrating job postings from many different sources. These duplicate postings undermine the usability of job boards and the quality of labour market analytics derived from them. In this paper, we tackle the challenging problem of duplicate detection from online job postings. Specifically, we design a framework for duplicate detection from online job postings and, under the framework, implement and test 24 methods built with four different tokenisers, three vectorisers and six similarity measures. We conduct a comparative study and experimental evaluation of the 24 methods and compare their performance with a baseline approach. All methods are tested with a real-world dataset from a job boarding platform and are evaluated with six performance metrics. The experiment reveals that the top two methods are Overlap with skip-gram (OS) and Overlap with n-gram (OG), followed by TFIDF-cosine with n-gram (TCG) and TFIDF-cosine with skip-gram (TCS), and that all above four methods outperform the baseline approach in detecting duplicates. CCS CONCEPTS • Applied computing → Document analysis; • Computing methodologies → Information extraction; • Information systems → Data cleaning.
Different from traditional positive sequential pattern mining, negative sequential pattern mining... more Different from traditional positive sequential pattern mining, negative sequential pattern mining considers both positive and negative relationships between items. Negative sequential pattern mining doesn't necessarily follow the Apriori principle, and the searching space is much larger than positive pattern mining. Giving definitions and some constraints of negative sequential patterns, this paper proposes a new method for mining negative sequential patterns, called Negative-GSP. Negative-GSP can find negative sequential patterns effectively and efficiently by joining and pruning, and extensive experimental results show the efficiency of the method.
Communications in Computer and Information Science
Clustering is one of the most important techniques in data mining. This chapter presents a survey... more Clustering is one of the most important techniques in data mining. This chapter presents a survey of popular approaches for data clustering, including well-known clustering techniques, such as partitioning clustering, hierarchical clustering, density-based clustering and grid-based clustering, and recent advances in clustering, such as subspace clustering, text clustering and data stream clustering. The major challenges and future trends of data clustering will also be introduced in this chapter. The remainder of this chapter is organized as follows. The background of data clustering will be introduced in Section 2, including the definition of clustering, categories of clustering techniques, features of good clustering algorithms, and the validation of clustering. Section 3 will present main approaches for clustering, which range from the classic partitioning and hierarchical clustering to recent approaches of bi-clustering and semisupervised clustering. Challenges and future trends...
Energy costs can be a major component of operational costs for water utilities. Operational effic... more Energy costs can be a major component of operational costs for water utilities. Operational efficiencies including optimising energy costs while maintaining continuity of supply is one area to reduce overall operational costs. To address the challenge, we have proposed an effective optimisation model to minimise the energy cost for water distribution networks. A simulation of the model over a water distribution network in Sydney demonstrated that 15% saving in energy cost could be achieved using this approach, as compared with the existing rule-based method.
AI 2019: Advances in Artificial Intelligence
Incomplete data are quite common which can deteriorate statistical inference, often affecting evi... more Incomplete data are quite common which can deteriorate statistical inference, often affecting evidence-based policymaking. A typical example is the Business Longitudinal Analysis Data Environment (BLADE), an Australian Government's national data asset. In this paper, motivated by helping BLADE practitioners select and implement advanced imputation methods with a solid understanding of the impact different methods will have on data accuracy and reliability, we implement and examine performance of data imputation techniques based on 12 machine learning algorithms. They range from linear regression to neural networks. We compare the performance of these algorithms and assess the impact of various settings, including the number of input features and the length of time spans. To examine generalisability, we also impute two features with distinct characteristics. Experimental results show that three ensemble algorithms: extra trees regressor, bagging regressor and random forest consistently maintain high imputation performance over the benchmark linear regression across a range of performance metrics. Among them, we would recommend the extra trees regressor for its accuracy and computational efficiency.
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
In Disability Employment Services (DES), a growing problem is recommending to disabled job seeker... more In Disability Employment Services (DES), a growing problem is recommending to disabled job seekers which skill should be upgraded and the best level for upgrading this skill to increase their employment potential most. This problem involves counterfactual reasoning to infer causal effect of factors on employment status to recommend the most effective intervention. Related methods cannot solve our problem adequately since they are developed for non-counterfactual challenges, for binary causal factors, or for randomized trials. In this paper, we present a causality-based method to tackle the problem. The method includes two stages where causal factors of employment status are first detected from data. We then combine a counterfactual reasoning framework with a machine learning approach to build an interpretable model for generating personalized recommendations. Experiments on both synthetic datasets and a real case study from a DES provider show consistent promising performance of improving employability of disabled job seekers. Results from the case study disclose effective factors and their best levels for intervention to increase employability. The most effective intervention varies among job seekers. Our model can separate job seekers by degree of employability increase. This is helpful for DES providers to allocate resources for employment assistance. Moreover, causal interpretability makes our recommendations actionable in DES business practice.
Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11, 2011
Mining Negative Sequential Patterns (NSP) is much more challenging than mining Positive Sequentia... more Mining Negative Sequential Patterns (NSP) is much more challenging than mining Positive Sequential Patterns (PSP) due to the high computational complexity and huge search space required in calculating Negative Sequential Candidates (NSC). Very few approaches are available for mining NSP, which mainly rely on re-scanning databases after identifying PSP. As a result, they are very inefficient. In this paper, we propose an efficient algorithm for mining NSP, called e-NSP, which mines for NSP by only involving the identified PSP, without re-scanning databases. First, negative containment is defined to determine whether or not a data sequence contains a negative sequence. Second, an efficient approach is proposed to convert the negative containment problem to a positive containment problem. The supports of NSC are then calculated based only on the corresponding PSP. Finally, a simple but efficient approach is proposed to generate NSC. With e-NSP, mining NSP does not require additional database scans, and the existing PSP mining algorithms can be integrated into e-NSP to mine for NSP efficiently. e-NSP is compared with two currently available NSP mining algorithms on 14 synthetic and real-life datasets. Intensive experiments show that e-NSP takes as little as 3% of the runtime of the baseline approaches and is applicable for efficient mining of NSP in large datasets.
Lecture Notes in Computer Science, 2003
The clustering algorithm GDILC relies on density-based clustering with grid and is designed to di... more The clustering algorithm GDILC relies on density-based clustering with grid and is designed to discover clusters of arbitrary shapes and eliminate noises. However, it is not scalable to large high-dimensional datasets. In this paper, we improved this algorithm in five important directions. Through these improvements, AGRID is of high scalability and can process large high-dimensional datasets. It can discover clusters