Han Liu | Cardiff University (original) (raw)

Papers by Han Liu

Research paper thumbnail of Induction of Classification Rules by Gini-Index Based Rule Generation

Rule learning is one of the most popular areas in machine learning research, because the outcome ... more Rule learning is one of the most popular areas in machine learning research, because the outcome of learning is to produce a set of rules, which not only provides accurate predictions but also shows a transparent process of mapping inputs to outputs. In general, rule learning approaches can be divided into two main types, namely, 'divide and conquer' and 'separate and conquer'. The former type of rule learning is also known as Top-Down Induction of Decision Trees, which means to learn a set of rules represented in the form of a decision tree. This approach results in the production of a large number of complex rules (usually due to the replicated sub-tree problem), which lowers the computational efficiency in both the training and testing stages, and leads to the overfitting of training data. Due to this problem, researchers have been gradually motivated to develop 'separate and conquer' rule learning approaches, also known as covering approaches, by learning a set of rules on a sequential basis. In particular, a rule is learned and the instances covered by this rule are deleted from the training set, such that the learning of the next rule is based on a smaller training set. In this paper, we propose a new algorithm, GIBRG, which employs Gini-Index to measure the quality of each rule being learned, in the context of 'separate and conquer' rule learning. Our experiments show that the proposed algorithm outperforms both decision tree learning algorithms (C4.5, CART) and 'separate and conquer' approaches (Prism). In addition, it also leads to a smaller number of rules and rule terms, thus being more computationally efficient and less prone to overfitting.

Research paper thumbnail of Fuzzy Rule Based Systems for Gender Classification from Blog Data

International Conference on Advanced Computational Intelligence, Mar 29, 2018

—Gender classification is a popular machine learning task, which has been undertaken in various d... more —Gender classification is a popular machine learning task, which has been undertaken in various domains, e.g. business intelligence, access control and cyber security. In the context of information granulation, gender related information can be divided into three types, namely, biological information, vision based information and social network based information. In traditional machine learning, gender identification has been typically treated as a discriminative classification task, i.e. it is aimed at learning a classifier that discriminates between male and female. In this paper, we argue that it is not always appropriate to identify gender in the way of discriminative classification, especially when considering the case that both male and female people are of high diversity and thus individuals of different genders could have high similarity to each other in terms of their characteristics. In order to address the above issue, we propose the use of a fuzzy method for generative classification of gender. In particular, we focus on gender classification based on social network information. We conduct an experiment study by using a blog data set, and compare the fuzzy method with C4.5, Naive Bayes and Support Vector Machine in terms of classification performance. The results show that the fuzzy method outperforms the other methods and is also capable of capturing the diversity of both male and female people and dealing with the fuzziness in terms of gender identification.

Research paper thumbnail of Unified Framework for Control of Machine Learning Tasks towards Effective and Efficient Processing of Big Data

Witold Pedrycz and Shyi-Ming Chen (eds) Data Science and Big Data: An Environment of Computational Intelligence

Big data can be generally characterised by 5 Vs – Volume, Velocity, Variety, Veracity and Variabi... more Big data can be generally characterised by 5 Vs – Volume, Velocity, Variety, Veracity and Variability. Many studies have been focused on using machine learning as a powerful tool of big data processing. In machine learning context , learning algorithms are typically evaluated in terms of accuracy, efficiency, interpretability and stability. These four dimensions can be strongly related to ve-racity, volume, variety and variability and are impacted by both the nature of learning algorithms and characteristics of data. This chapter analyses in depth how the quality of computational models can be impacted by data characteristics as well as strategies involved in learning algorithms. This chapter also introduces a unified framework for control of machine learning tasks towards appropriate employment of algorithms and efficient processing of big data. In particular, this framework is designed to achieve effective selection of data pre-processing techniques towards effective selection of relevant attributes, sampling of representative training and test data, and appropriate dealing with missing values and noise. More importantly, this framework allows the employment of suitable machine learning algorithms on the basis of the training data provided from the data pre-processing stage towards building of accurate, efficient and interpretable computational models.

Research paper thumbnail of Fuzzy Rule Based Systems for Interpretable Sentiment Analysis

International Conference on Advanced Computational Intelligence

—Sentiment analysis, which is also known as opinion mining, aims to recognise the attitude or emo... more —Sentiment analysis, which is also known as opinion mining, aims to recognise the attitude or emotion of people through natural language processing, text analysis and computational linguistics. In recent years, many studies have focused on sentiment classification in the context of machine learning, e.g. to identify that a sentiment is positive or negative. In particular, the bag-of-words method has been popularly used to transform textual data into structured data, in order to enable the direct use of machine learning algorithms for sentiment classification. Through the bag-of-words method, each single term in a text document is turned into a single attribute to make up a structured data set, which results in high dimensionality of the data set and thus negative impact on the interpretability of computational models for sentiment analysis. This paper proposes the use of fuzzy rule based systems as computational models towards accurate and interpretable analysis of sentiments. The use of fuzzy logic is better aligned with the inherent uncertainty of language, while the " white box " characteristic of the rule based learning approaches leads to better interpretability of the results. The proposed approach is tested on four datasets containing movie reviews; the aim is to compare its performance in terms of accuracy with two other approaches for sentiment analysis that are known to perform very well. The results indicate that the fuzzy rule based approach performs marginally better than the well-known machine learning techniques, while reducing the computational complexity and increasing the interpretability.

Research paper thumbnail of Transformation of Discriminative Single-Task Classification into Generative Multi-Task Classification in Machine Learning Context

International Conference on Advanced Computational Intelligence

—Classification is one of the most popular tasks of machine learning, which has been involved in ... more —Classification is one of the most popular tasks of machine learning, which has been involved in broad applications in practice, such as decision making, sentiment analysis and pattern recognition. It involves the assignment of a class/label to an instance and is based on the assumption that each instance can only belong to one class. This assumption does not hold, especially for indexing problems (when an item, such as a movie, can belong to more than one category) or for complex items that reflect more than one aspect, e.g. a product review outlining advantages and disadvantages may be at the same time positive and negative. To address this problem, multi-label classification has been increasingly used in recent years, by transforming the data to allow an instance to have more than one label; the nature of learning, however, is the same as traditional learning, i.e. learning to discriminate one class from other classes and the output of a classifier is still single (although the output may contain a set of labels). In this paper we propose a fundamentally different type of classification in which the membership of an instance to all classes(/labels) is judged by a multiple-input-multiple-output classifier through generative multi-task learning. An experimental study is conducted on five UCI data sets to show empirically that an instance can belong to more than one class, by using the theory of fuzzy logic and checking the extent to which an instance belongs to each single class, i.e. the fuzzy membership degree. The paper positions new research directions on multi-task classification in the context of both supervised learning and semi-supervised learning.

Research paper thumbnail of Granular Computing Based Approach for Classification towards Reduction of Bias in Ensemble Learning

Machine learning has become a powerful approach in practical applications such as decision making... more Machine learning has become a powerful approach in practical applications such as decision making , sentiment analysis and ontology engineering. In order to improve the overall performance in machine learning tasks, ensemble learning has become increasingly popular by combining different learning algorithms or models. Popular approaches of ensemble learning include Bagging and Boosting, which involve voting towards the final classification. The voting in both Bagging and Boosting could result in incorrect classification due to the bias in the way voting takes place. In order to reduce the bias in voting, this paper proposes a prob-abilistic approach of voting in the context of granular computing towards improvement of overall accuracy of classification. An experimental study is reported to validate the proposed approach of voting by using 15 data sets from the UCI repository. The results show that probabilistic voting is effective in increasing the accuracy through reduction of the bias in voting. This paper contributes to the theoretical and empirical analysis of causes of bias in voting, towards advancing ensemble learning approaches through the use of probabilistic voting.

Research paper thumbnail of Complexity Control in Rule Based Models for Classification in Machine Learning Context

UK Workshop on Computational Intelligence, Sep 2016

A rule based model is a special type of computational models, which can be built by using expert ... more A rule based model is a special type of computational models, which can be built by using expert knowledge or learning from real data. In this context, rule based modelling approaches can be divided into two categories: expert based approaches and data based approaches. Due to the vast and rapid increase in data, the latter approach has become increasingly popular for building rule based models. In machine learning context, rule based models can be evaluated in three main dimensions , namely accuracy, efficiency and interpretability. All these dimensions are usually affected by the key characteristic of a rule based model which is typically referred to as model complexity. This paper focuses on theoretical and empirical analysis of complexity of rule based models, especially for classification tasks. In particular, the significance of model complexity is argued and a list of impact factors against the complexity are identified. This paper also proposes several techniques for effective control of model complexity, and experimental studies are reported for presentation and discussion of results in order to analyze critically and comparatively the extent to which the proposed techniques are effective in control of model complexity.

Research paper thumbnail of Rule Based Networks: An Efficient and Interpretable Representation of Computational Models

Due to the vast and rapid increase in the size of data, data mining has been an increasingly impo... more Due to the vast and rapid increase in the size of data, data mining has been an increasingly important tool for the purpose of knowledge discovery to prevent the presence of rich data but poor knowledge. In this context, machine learning can be seen as a powerful approach to achieve intelligent data mining. In practice, machine learning is also an intelligent approach for predictive modelling. Rule learning methods, a special type of machine learning methods, can be used to build a rule based system as a special type of expert systems for both knowledge discovery and predictive modelling. A rule based system may be represented through different structures. The techniques for representing rules are known as rule representation, which is significant for knowledge discovery in relation to the interpretability of the model, as well as for predictive modelling with regard to efficiency in predicting unseen instances. This paper justifies the significance of rule representation and presents several existing representation techniques. Two types of novel networked topologies for rule representation are developed against existing techniques. This paper also includes complexity analysis of the networked topologies in order to show their advantages comparing with the existing techniques in terms of model interpretability and computational efficiency.

Research paper thumbnail of Nature and Biology Inspired Approach of Classification towards Reduction of Bias in Machine Learning

International Conference on Machine Learning and Cybernetics

Machine learning has become a powerful tool in real applications such as decision making, sentime... more Machine learning has become a powerful tool in real applications such as decision making, sentiment prediction and ontology engineering. In the form of learning strategies, machine learning can be specialized into two types: supervised learning and unsupervised learning. Classification is a special type of supervised learning task, which can also be referred to as categorical prediction. In other words, classification tasks involve predictions of the values of discrete attributes. Some popular classification algorithms include Naïve Bayes and K Nearest Neighbour. The above type of classification algorithms generally involves voting towards classifying unseen instances. In traditional ways, the voting is made on the basis of any employed statistical heuristics such as probability. In Naïve Bayes, the voting is made through selecting the class with the highest posterior probability on the basis of the values of all independent attributes. In K Nearest Neighbour, majority voting is usually used towards classifying test instances. This kind of voting is considered to be biased, which may lead to overfitting. In order to avoid such overfitting, this paper proposes to employ a nature and biology inspired approach of voting referred to as probabilistic voting towards reduction of bias. An extended experimental study is reported to show how the probabilistic voting can manage to effectively reduce the bias towards improvement of classification accuracy.

Research paper thumbnail of Rule Based Systems: A Granular Computing Perspective

A rule based system is a special type of expert system, which typically consists of a set of if-t... more A rule based system is a special type of expert system, which typically consists of a set of if-then rules. Such rules can be used in the real world for both academic and practical purposes. In general, rule based systems are involved in knowledge discovery tasks for both purposes and predictive modelling tasks for the latter purpose. In the context of granular computing, each of the rules that make up a rule based system can be seen as a granule. This is due to the fact that granulation in general means decomposition of a whole into several parts. Similarly , each rule consists of a number of rule terms. From this point of view, each rule term can also be seen as a granule. As mentioned above, rule based systems can be used for the purpose of knowledge discovery, which means to extract information or knowledge discovered from data. Therefore, rules and rule terms that make up a rule based system are considered as information granules. This paper positions the research of rule based systems in the granular computing context, which explores ways of achieving advances in the former area through the novel use of theories and techniques in the latter area. In particular, this paper gives a certain perspective on how to use set theory for management of information granules for rules/rule terms and different types of computational logic for reduction of learning bias. The effectiveness is critically analyzed and discussed. Further directions of this research area are recommended towards achieving advances in rule based systems through the use of granular computing theories and techniques.

Research paper thumbnail of Interpretability of Computational Models for Sentiment Analysis

Witold Pedrycz and Shyi-Ming Chen (eds.), Sentiment Analysis and Ontology Engineering: An Environment of Computational Intelligence, Mar 23, 2016

Sentiment analysis, which is also known as opinion mining, has been an increasingly popular resea... more Sentiment analysis, which is also known as opinion mining, has been an increasingly popular research area focusing on sentiment classification/regression. In many studies, computational models have been considered as effective and efficient tools for sentiment analysis. Computational models could be built by using expert knowledge or learning from data. From this viewpoint, the design of computational models could be categorized into expert based design and data based design. Due to the vast and rapid increase in data, the latter approach of design has become increasingly more popular for building computational models. A data based design typically follows machine learning approaches, each of which involves a particular strategy of learning. Therefore, the resulting computational models are usually represented in different forms. For example, neural network learning results in models in the form of multi-layer perceptron network whereas decision tree learning results in a rule set in the form of decision tree. On the basis of above description, inter-pretability has become a main problem that arises with computational models. This chapter explores the significance of interpretability for computational models as well as analyzes the factors that impact on interpretability. This chapter also introduces several ways to evaluate and improve the interpretability for computational models which are used as sentiment analysis systems. In particular, rule based systems , a special type of computational models, are used as an example for illustration with respects to evaluation and improvements through the use of computational intelligence methodologies.

Research paper thumbnail of Induction of Modular Classification Rules by Information Entropy Based Rule Generation

Innovative Issues in Intelligent Systems, edited by Vassil Sgurev, Ronald Yager, Janusz Kacprzyk, Vladimir Jotsov, Feb 3, 2016

Prism has been developed as a modular classification rule generator following the separate and co... more Prism has been developed as a modular classification rule generator following the separate and conquer approach since 1987 due to the replicated sub-tree problem occurring in Top-Down Induction of Decision Trees (TDIDT). A series of experiments have been done to compare the performance between Prism and TDIDT which proved that Prism may generally provide a similar level of accuracy as TDIDT but with fewer rules and fewer terms per rule. In addition, Prism is generally more tolerant to noise with consistently better accuracy than TDIDT. However, the authors have identified through some experiments that Prism may also give rule sets which tend to underfit training sets in some cases. This paper introduces a new modular classification rule generator, which follows the separate and conquer approach, in order to avoid the problems which arise with Prism. In this paper, the authors review the Prism method and its advantages compared to TDIDT as well as its disadvantages that are overcome by a new method using Information Entropy Based Rule Generation (IEBRG). The authors also set up an experimental study on the performance of the new method in classification accuracy and computational efficiency. The method is also evaluated comparatively with Prism.

Research paper thumbnail of Collaborative Rule Generation: An Ensemble Learning Approach

Due to the vast and rapid increase in data, data mining has become an increasingly important tool... more Due to the vast and rapid increase in data, data mining has become an increasingly important tool for the purpose of knowledge discovery in order to prevent the presence of rich data but poor knowledge. Data mining tasks can be undertaken in two ways, namely, manual walkthrough of data and use of machine learning approaches. Due to the presence of big data, machine learning has thus become a powerful tool to do data mining in intelligent ways. A popular approach of machine learning is inductive learning, which can be used to generate a rule set (a set of rules) using a particular algorithm. Inductive learning can involve a single base algorithm learning from a single data set following a standard learning approach. In this approach, the learning algorithm can generate a single rule set such as decision trees. On the other hand, the inductive learning can also involve a single base algorithm learning from multiple data sets following an ensemble learning approach. In this approach, the learning algorithm can generate multiple rule sets such as random forests. The latter approach is usually designed to reduce overfitting of models that usually arises when the former approach is adopted. In this context, the ensemble learning approach usually enables the improvement of the overall accuracy in prediction. The aim of this paper is to introduce a new approach of ensemble learning called Collaborative Rule Generation. In the new approach, the inductive learning involves multiple base algorithms learning from a single data set to generate a single rule set, which aims to enable each rule to have a higher quality. This paper also includes an experimental study validating the Collaborative Rule Generation approach and discusses the results in both quantitative and qualitative ways.

Research paper thumbnail of Rule Based Systems for Big Data: A Machine Learning Approach

Studies in Big Data 13, Springer, Sep 10, 2015

The ideas introduced in this book explore the relationships among rule based systems, machine lea... more The ideas introduced in this book explore the relationships among rule based systems, machine learning and big data. Rule based systems are seen as a special type of expert systems, which can be built by using expert knowledge or learning from real data.
The book focuses on the development and evaluation of rule based systems in terms of accuracy, efficiency and interpretability. In particular, a unified framework for building rule based systems, which consists of the operations of rule generation, rule simplification and rule representation, is presented. Each of these operations is detailed using specific methods or techniques. In addition, this book also presents some ensemble learning frameworks for building ensemble rule based systems.

Research paper thumbnail of Hybrid Ensemble Learning Approach for Generation of Classification Rules

International Conference on Machine Learning and Cybernetics 2015, Jul 2015

Due to the daily increase in the size of data, machine learning has become a popular approach for... more Due to the daily increase in the size of data, machine learning has become a popular approach for intelligent processing of data. In particular, machine learning algorithms are used to discover meaningful knowledge or build predictive models from data. For example, inductive learning algorithms involve generation of rules which can be in the form of either a decision tree or if-then rules. However, most of learning algorithms suffer from overfitting of training data. In other words, these learning algorithms can build models that perform extremely well on training data but poorly on other data. The overfitting problem is originating from both learning algorithms and data. In this context, the nature of machine learning problem can be referred to as bias and variance. The former is originating from learning algorithms whereas the latter is originating from data. Therefore, reduction of overfitting can be achieved through scaling up algorithms on one side or scaling down data on the other side. Both bias and variance can be reduced through use of ensemble learning approaches. This paper introduces particular ways to address the issues on overfitting of rule based classifiers through both scaling up algorithms and scaling down data in the context of ensemble learning.

Research paper thumbnail of Network Based Rule Representation for Knowledge Discovery and Predictive Modelling

IEEE International Conference on Fuzzy Systems 2015, Aug 2015

Due to the vast and rapid increase in data, data mining has been an increasingly important tool f... more Due to the vast and rapid increase in data, data mining has been an increasingly important tool for the purpose of knowledge discovery to prevent the presence of rich data but poor knowledge. In this context, machine learning can be seen as a powerful approach to achieve intelligent data mining. In practice, machine learning is also an intelligent approach for predictive modelling. A special type of machine learning methods, which are known as rule based methods such as decision trees, can be used to build a rule based system as a special type of expert systems for both knowledge discovery and predictive modelling. A rule based system may be represented through different structures. The techniques for representing rules are known as rule representation, which is significant for knowledge discovery in relation to the interpretability of the model, as well as for predictive modelling with regard to efficiency in predicting unseen instances. This paper justifies the significance of rule representation. Some networked topologies for rule representation are introduced against existing techniques. The network topologies are validated using complexity analysis in order to show their advantages comparing with the existing techniques in terms of model interpretability and computational efficiency.

Research paper thumbnail of Collaborative Decision Making by Ensemble Rule Based Classification Systems

Witold Pedrycz and Shyi-Ming Chen (eds), Granular Computing and Decision-Making, Studies in Big Data 10, Springer, Apr 2015

Rule based classification is a popular approach for decision making. It is also achievable that m... more Rule based classification is a popular approach for decision making. It is also achievable that multiple rule based classifiers work together for group decision making by using ensemble learning approach. This kind of expert system is referred to as ensemble rule based classification system by means of a system of systems. In machine learning, an ensemble learning approach is usually adopted in order to improve overall predictive accuracy, which means to provide highly trusted decisions. This chapter introduces basic concepts of ensemble learning and reviews Random Prism to analyze its performance. This chapter also introduces an extended framework of ensemble learning, which is referred to as Collaborative and Competitive Random Decision Rules (CCRDR) and includes Information Entropy Based Rule Generation (IEBRG) and original Prism in addition to PrismTCS as base classifiers. This is in order to overcome the identified limitations of Random Prism. Each of the base classifiers mentioned above is also introduced with respects to its essence and applications. An experimental study is undertaken towards comparative validation between the CCRDR and Random Prism. Contributions and Ongoing and future works are also highlighted.

Research paper thumbnail of Categorization and Construction of Rule Based Systems

International Conference on Engineering Applications of Neural Networks 2014, Sep 2014

Expert systems have been increasingly popular for commercial importance. A rule based system is a... more Expert systems have been increasingly popular for commercial importance. A rule based system is a special type of an expert system, which consists of a set of ‘if-then’ rules and can be applied as a decision support system in many areas such as healthcare, transportation and security. Rule based systems can be constructed based on both expert knowledge and data. This paper aims to introduce the theory of rule based systems especially on categorization and construction of such systems from a conceptual point of view. This paper also introduces rule based systems for classification tasks in detail.

Research paper thumbnail of Unified Framework for Construction of Rule Based Classification Systems

Witold Pedrycz and Shyi-Ming Chen (eds), Information Granularity, Big Data and Computational Intelligence, Studies in Big Data 8, Springer, Jul 2014

Automatic generation of classification rules has been an increasingly popular technique in commer... more Automatic generation of classification rules has been an increasingly popular technique in commercial applications such as Big Data analytics, rule based expert systems and decision making systems. However, a principal problem that arises with most methods for generation of classification rules is the overfitting of training data. When Big Data is dealt with, this may result in the generation of a large number of complex rules. This may not only increase computational cost but also lower the accuracy in predicting further unseen instances . This has led to the necessity of developing pruning methods for the simplification of rules. In addition, classification rules are used further to make predictions after the completion of their generation. As efficiency is concerned, it is expected to find the first rule that fires as soon as possible by searching through a rule set. Thus a suitable structure is required to represent the rule set effectively. In this chapter, the authors introduce a unified framework for construction of rule based classification systems consisting of three operations: rule generation, rule simplification and rule representation particularly on Big Data. The authors also review some existing methods and techniques used for each of the three operations and highlight the limitations of them as well as introduce some novel methods and techniques developed in their more recent research. The novel methods and techniques are also discussed in comparison to those existing ones reviewed earlier with respects to effective and efficient processing of Big Data.

Research paper thumbnail of Homogeneous and heterogeneous distributed classification for pocket data mining

Transactions on Large-Scale Data and Knowledge-Centered Systems V, Springer, Jan 2012

Pocket Data Mining (PDM) describes the full process of analysing data streams in mobile ad hoc di... more Pocket Data Mining (PDM) describes the full process of analysing data streams in mobile ad hoc distributed environments. Ad- vances in mobile devices like smart phones and tablet computers have made it possible for a wide range of applications to run in such an environment. In this paper, we propose the adoption of data stream classi?cation techniques for PDM. Evident by a thorough experimental study, it has been proved that running heterogeneous/di?erent, or ho- mogeneous/similar data stream classi?cation techniques over vertically partitioned data (data partitioned according to the feature space) results in comparable performance to batch and centralised learning techniques.

Research paper thumbnail of Induction of Classification Rules by Gini-Index Based Rule Generation

Rule learning is one of the most popular areas in machine learning research, because the outcome ... more Rule learning is one of the most popular areas in machine learning research, because the outcome of learning is to produce a set of rules, which not only provides accurate predictions but also shows a transparent process of mapping inputs to outputs. In general, rule learning approaches can be divided into two main types, namely, 'divide and conquer' and 'separate and conquer'. The former type of rule learning is also known as Top-Down Induction of Decision Trees, which means to learn a set of rules represented in the form of a decision tree. This approach results in the production of a large number of complex rules (usually due to the replicated sub-tree problem), which lowers the computational efficiency in both the training and testing stages, and leads to the overfitting of training data. Due to this problem, researchers have been gradually motivated to develop 'separate and conquer' rule learning approaches, also known as covering approaches, by learning a set of rules on a sequential basis. In particular, a rule is learned and the instances covered by this rule are deleted from the training set, such that the learning of the next rule is based on a smaller training set. In this paper, we propose a new algorithm, GIBRG, which employs Gini-Index to measure the quality of each rule being learned, in the context of 'separate and conquer' rule learning. Our experiments show that the proposed algorithm outperforms both decision tree learning algorithms (C4.5, CART) and 'separate and conquer' approaches (Prism). In addition, it also leads to a smaller number of rules and rule terms, thus being more computationally efficient and less prone to overfitting.

Research paper thumbnail of Fuzzy Rule Based Systems for Gender Classification from Blog Data

International Conference on Advanced Computational Intelligence, Mar 29, 2018

—Gender classification is a popular machine learning task, which has been undertaken in various d... more —Gender classification is a popular machine learning task, which has been undertaken in various domains, e.g. business intelligence, access control and cyber security. In the context of information granulation, gender related information can be divided into three types, namely, biological information, vision based information and social network based information. In traditional machine learning, gender identification has been typically treated as a discriminative classification task, i.e. it is aimed at learning a classifier that discriminates between male and female. In this paper, we argue that it is not always appropriate to identify gender in the way of discriminative classification, especially when considering the case that both male and female people are of high diversity and thus individuals of different genders could have high similarity to each other in terms of their characteristics. In order to address the above issue, we propose the use of a fuzzy method for generative classification of gender. In particular, we focus on gender classification based on social network information. We conduct an experiment study by using a blog data set, and compare the fuzzy method with C4.5, Naive Bayes and Support Vector Machine in terms of classification performance. The results show that the fuzzy method outperforms the other methods and is also capable of capturing the diversity of both male and female people and dealing with the fuzziness in terms of gender identification.

Research paper thumbnail of Unified Framework for Control of Machine Learning Tasks towards Effective and Efficient Processing of Big Data

Witold Pedrycz and Shyi-Ming Chen (eds) Data Science and Big Data: An Environment of Computational Intelligence

Big data can be generally characterised by 5 Vs – Volume, Velocity, Variety, Veracity and Variabi... more Big data can be generally characterised by 5 Vs – Volume, Velocity, Variety, Veracity and Variability. Many studies have been focused on using machine learning as a powerful tool of big data processing. In machine learning context , learning algorithms are typically evaluated in terms of accuracy, efficiency, interpretability and stability. These four dimensions can be strongly related to ve-racity, volume, variety and variability and are impacted by both the nature of learning algorithms and characteristics of data. This chapter analyses in depth how the quality of computational models can be impacted by data characteristics as well as strategies involved in learning algorithms. This chapter also introduces a unified framework for control of machine learning tasks towards appropriate employment of algorithms and efficient processing of big data. In particular, this framework is designed to achieve effective selection of data pre-processing techniques towards effective selection of relevant attributes, sampling of representative training and test data, and appropriate dealing with missing values and noise. More importantly, this framework allows the employment of suitable machine learning algorithms on the basis of the training data provided from the data pre-processing stage towards building of accurate, efficient and interpretable computational models.

Research paper thumbnail of Fuzzy Rule Based Systems for Interpretable Sentiment Analysis

International Conference on Advanced Computational Intelligence

—Sentiment analysis, which is also known as opinion mining, aims to recognise the attitude or emo... more —Sentiment analysis, which is also known as opinion mining, aims to recognise the attitude or emotion of people through natural language processing, text analysis and computational linguistics. In recent years, many studies have focused on sentiment classification in the context of machine learning, e.g. to identify that a sentiment is positive or negative. In particular, the bag-of-words method has been popularly used to transform textual data into structured data, in order to enable the direct use of machine learning algorithms for sentiment classification. Through the bag-of-words method, each single term in a text document is turned into a single attribute to make up a structured data set, which results in high dimensionality of the data set and thus negative impact on the interpretability of computational models for sentiment analysis. This paper proposes the use of fuzzy rule based systems as computational models towards accurate and interpretable analysis of sentiments. The use of fuzzy logic is better aligned with the inherent uncertainty of language, while the " white box " characteristic of the rule based learning approaches leads to better interpretability of the results. The proposed approach is tested on four datasets containing movie reviews; the aim is to compare its performance in terms of accuracy with two other approaches for sentiment analysis that are known to perform very well. The results indicate that the fuzzy rule based approach performs marginally better than the well-known machine learning techniques, while reducing the computational complexity and increasing the interpretability.

Research paper thumbnail of Transformation of Discriminative Single-Task Classification into Generative Multi-Task Classification in Machine Learning Context

International Conference on Advanced Computational Intelligence

—Classification is one of the most popular tasks of machine learning, which has been involved in ... more —Classification is one of the most popular tasks of machine learning, which has been involved in broad applications in practice, such as decision making, sentiment analysis and pattern recognition. It involves the assignment of a class/label to an instance and is based on the assumption that each instance can only belong to one class. This assumption does not hold, especially for indexing problems (when an item, such as a movie, can belong to more than one category) or for complex items that reflect more than one aspect, e.g. a product review outlining advantages and disadvantages may be at the same time positive and negative. To address this problem, multi-label classification has been increasingly used in recent years, by transforming the data to allow an instance to have more than one label; the nature of learning, however, is the same as traditional learning, i.e. learning to discriminate one class from other classes and the output of a classifier is still single (although the output may contain a set of labels). In this paper we propose a fundamentally different type of classification in which the membership of an instance to all classes(/labels) is judged by a multiple-input-multiple-output classifier through generative multi-task learning. An experimental study is conducted on five UCI data sets to show empirically that an instance can belong to more than one class, by using the theory of fuzzy logic and checking the extent to which an instance belongs to each single class, i.e. the fuzzy membership degree. The paper positions new research directions on multi-task classification in the context of both supervised learning and semi-supervised learning.

Research paper thumbnail of Granular Computing Based Approach for Classification towards Reduction of Bias in Ensemble Learning

Machine learning has become a powerful approach in practical applications such as decision making... more Machine learning has become a powerful approach in practical applications such as decision making , sentiment analysis and ontology engineering. In order to improve the overall performance in machine learning tasks, ensemble learning has become increasingly popular by combining different learning algorithms or models. Popular approaches of ensemble learning include Bagging and Boosting, which involve voting towards the final classification. The voting in both Bagging and Boosting could result in incorrect classification due to the bias in the way voting takes place. In order to reduce the bias in voting, this paper proposes a prob-abilistic approach of voting in the context of granular computing towards improvement of overall accuracy of classification. An experimental study is reported to validate the proposed approach of voting by using 15 data sets from the UCI repository. The results show that probabilistic voting is effective in increasing the accuracy through reduction of the bias in voting. This paper contributes to the theoretical and empirical analysis of causes of bias in voting, towards advancing ensemble learning approaches through the use of probabilistic voting.

Research paper thumbnail of Complexity Control in Rule Based Models for Classification in Machine Learning Context

UK Workshop on Computational Intelligence, Sep 2016

A rule based model is a special type of computational models, which can be built by using expert ... more A rule based model is a special type of computational models, which can be built by using expert knowledge or learning from real data. In this context, rule based modelling approaches can be divided into two categories: expert based approaches and data based approaches. Due to the vast and rapid increase in data, the latter approach has become increasingly popular for building rule based models. In machine learning context, rule based models can be evaluated in three main dimensions , namely accuracy, efficiency and interpretability. All these dimensions are usually affected by the key characteristic of a rule based model which is typically referred to as model complexity. This paper focuses on theoretical and empirical analysis of complexity of rule based models, especially for classification tasks. In particular, the significance of model complexity is argued and a list of impact factors against the complexity are identified. This paper also proposes several techniques for effective control of model complexity, and experimental studies are reported for presentation and discussion of results in order to analyze critically and comparatively the extent to which the proposed techniques are effective in control of model complexity.

Research paper thumbnail of Rule Based Networks: An Efficient and Interpretable Representation of Computational Models

Due to the vast and rapid increase in the size of data, data mining has been an increasingly impo... more Due to the vast and rapid increase in the size of data, data mining has been an increasingly important tool for the purpose of knowledge discovery to prevent the presence of rich data but poor knowledge. In this context, machine learning can be seen as a powerful approach to achieve intelligent data mining. In practice, machine learning is also an intelligent approach for predictive modelling. Rule learning methods, a special type of machine learning methods, can be used to build a rule based system as a special type of expert systems for both knowledge discovery and predictive modelling. A rule based system may be represented through different structures. The techniques for representing rules are known as rule representation, which is significant for knowledge discovery in relation to the interpretability of the model, as well as for predictive modelling with regard to efficiency in predicting unseen instances. This paper justifies the significance of rule representation and presents several existing representation techniques. Two types of novel networked topologies for rule representation are developed against existing techniques. This paper also includes complexity analysis of the networked topologies in order to show their advantages comparing with the existing techniques in terms of model interpretability and computational efficiency.

Research paper thumbnail of Nature and Biology Inspired Approach of Classification towards Reduction of Bias in Machine Learning

International Conference on Machine Learning and Cybernetics

Machine learning has become a powerful tool in real applications such as decision making, sentime... more Machine learning has become a powerful tool in real applications such as decision making, sentiment prediction and ontology engineering. In the form of learning strategies, machine learning can be specialized into two types: supervised learning and unsupervised learning. Classification is a special type of supervised learning task, which can also be referred to as categorical prediction. In other words, classification tasks involve predictions of the values of discrete attributes. Some popular classification algorithms include Naïve Bayes and K Nearest Neighbour. The above type of classification algorithms generally involves voting towards classifying unseen instances. In traditional ways, the voting is made on the basis of any employed statistical heuristics such as probability. In Naïve Bayes, the voting is made through selecting the class with the highest posterior probability on the basis of the values of all independent attributes. In K Nearest Neighbour, majority voting is usually used towards classifying test instances. This kind of voting is considered to be biased, which may lead to overfitting. In order to avoid such overfitting, this paper proposes to employ a nature and biology inspired approach of voting referred to as probabilistic voting towards reduction of bias. An extended experimental study is reported to show how the probabilistic voting can manage to effectively reduce the bias towards improvement of classification accuracy.

Research paper thumbnail of Rule Based Systems: A Granular Computing Perspective

A rule based system is a special type of expert system, which typically consists of a set of if-t... more A rule based system is a special type of expert system, which typically consists of a set of if-then rules. Such rules can be used in the real world for both academic and practical purposes. In general, rule based systems are involved in knowledge discovery tasks for both purposes and predictive modelling tasks for the latter purpose. In the context of granular computing, each of the rules that make up a rule based system can be seen as a granule. This is due to the fact that granulation in general means decomposition of a whole into several parts. Similarly , each rule consists of a number of rule terms. From this point of view, each rule term can also be seen as a granule. As mentioned above, rule based systems can be used for the purpose of knowledge discovery, which means to extract information or knowledge discovered from data. Therefore, rules and rule terms that make up a rule based system are considered as information granules. This paper positions the research of rule based systems in the granular computing context, which explores ways of achieving advances in the former area through the novel use of theories and techniques in the latter area. In particular, this paper gives a certain perspective on how to use set theory for management of information granules for rules/rule terms and different types of computational logic for reduction of learning bias. The effectiveness is critically analyzed and discussed. Further directions of this research area are recommended towards achieving advances in rule based systems through the use of granular computing theories and techniques.

Research paper thumbnail of Interpretability of Computational Models for Sentiment Analysis

Witold Pedrycz and Shyi-Ming Chen (eds.), Sentiment Analysis and Ontology Engineering: An Environment of Computational Intelligence, Mar 23, 2016

Sentiment analysis, which is also known as opinion mining, has been an increasingly popular resea... more Sentiment analysis, which is also known as opinion mining, has been an increasingly popular research area focusing on sentiment classification/regression. In many studies, computational models have been considered as effective and efficient tools for sentiment analysis. Computational models could be built by using expert knowledge or learning from data. From this viewpoint, the design of computational models could be categorized into expert based design and data based design. Due to the vast and rapid increase in data, the latter approach of design has become increasingly more popular for building computational models. A data based design typically follows machine learning approaches, each of which involves a particular strategy of learning. Therefore, the resulting computational models are usually represented in different forms. For example, neural network learning results in models in the form of multi-layer perceptron network whereas decision tree learning results in a rule set in the form of decision tree. On the basis of above description, inter-pretability has become a main problem that arises with computational models. This chapter explores the significance of interpretability for computational models as well as analyzes the factors that impact on interpretability. This chapter also introduces several ways to evaluate and improve the interpretability for computational models which are used as sentiment analysis systems. In particular, rule based systems , a special type of computational models, are used as an example for illustration with respects to evaluation and improvements through the use of computational intelligence methodologies.

Research paper thumbnail of Induction of Modular Classification Rules by Information Entropy Based Rule Generation

Innovative Issues in Intelligent Systems, edited by Vassil Sgurev, Ronald Yager, Janusz Kacprzyk, Vladimir Jotsov, Feb 3, 2016

Prism has been developed as a modular classification rule generator following the separate and co... more Prism has been developed as a modular classification rule generator following the separate and conquer approach since 1987 due to the replicated sub-tree problem occurring in Top-Down Induction of Decision Trees (TDIDT). A series of experiments have been done to compare the performance between Prism and TDIDT which proved that Prism may generally provide a similar level of accuracy as TDIDT but with fewer rules and fewer terms per rule. In addition, Prism is generally more tolerant to noise with consistently better accuracy than TDIDT. However, the authors have identified through some experiments that Prism may also give rule sets which tend to underfit training sets in some cases. This paper introduces a new modular classification rule generator, which follows the separate and conquer approach, in order to avoid the problems which arise with Prism. In this paper, the authors review the Prism method and its advantages compared to TDIDT as well as its disadvantages that are overcome by a new method using Information Entropy Based Rule Generation (IEBRG). The authors also set up an experimental study on the performance of the new method in classification accuracy and computational efficiency. The method is also evaluated comparatively with Prism.

Research paper thumbnail of Collaborative Rule Generation: An Ensemble Learning Approach

Due to the vast and rapid increase in data, data mining has become an increasingly important tool... more Due to the vast and rapid increase in data, data mining has become an increasingly important tool for the purpose of knowledge discovery in order to prevent the presence of rich data but poor knowledge. Data mining tasks can be undertaken in two ways, namely, manual walkthrough of data and use of machine learning approaches. Due to the presence of big data, machine learning has thus become a powerful tool to do data mining in intelligent ways. A popular approach of machine learning is inductive learning, which can be used to generate a rule set (a set of rules) using a particular algorithm. Inductive learning can involve a single base algorithm learning from a single data set following a standard learning approach. In this approach, the learning algorithm can generate a single rule set such as decision trees. On the other hand, the inductive learning can also involve a single base algorithm learning from multiple data sets following an ensemble learning approach. In this approach, the learning algorithm can generate multiple rule sets such as random forests. The latter approach is usually designed to reduce overfitting of models that usually arises when the former approach is adopted. In this context, the ensemble learning approach usually enables the improvement of the overall accuracy in prediction. The aim of this paper is to introduce a new approach of ensemble learning called Collaborative Rule Generation. In the new approach, the inductive learning involves multiple base algorithms learning from a single data set to generate a single rule set, which aims to enable each rule to have a higher quality. This paper also includes an experimental study validating the Collaborative Rule Generation approach and discusses the results in both quantitative and qualitative ways.

Research paper thumbnail of Rule Based Systems for Big Data: A Machine Learning Approach

Studies in Big Data 13, Springer, Sep 10, 2015

The ideas introduced in this book explore the relationships among rule based systems, machine lea... more The ideas introduced in this book explore the relationships among rule based systems, machine learning and big data. Rule based systems are seen as a special type of expert systems, which can be built by using expert knowledge or learning from real data.
The book focuses on the development and evaluation of rule based systems in terms of accuracy, efficiency and interpretability. In particular, a unified framework for building rule based systems, which consists of the operations of rule generation, rule simplification and rule representation, is presented. Each of these operations is detailed using specific methods or techniques. In addition, this book also presents some ensemble learning frameworks for building ensemble rule based systems.

Research paper thumbnail of Hybrid Ensemble Learning Approach for Generation of Classification Rules

International Conference on Machine Learning and Cybernetics 2015, Jul 2015

Due to the daily increase in the size of data, machine learning has become a popular approach for... more Due to the daily increase in the size of data, machine learning has become a popular approach for intelligent processing of data. In particular, machine learning algorithms are used to discover meaningful knowledge or build predictive models from data. For example, inductive learning algorithms involve generation of rules which can be in the form of either a decision tree or if-then rules. However, most of learning algorithms suffer from overfitting of training data. In other words, these learning algorithms can build models that perform extremely well on training data but poorly on other data. The overfitting problem is originating from both learning algorithms and data. In this context, the nature of machine learning problem can be referred to as bias and variance. The former is originating from learning algorithms whereas the latter is originating from data. Therefore, reduction of overfitting can be achieved through scaling up algorithms on one side or scaling down data on the other side. Both bias and variance can be reduced through use of ensemble learning approaches. This paper introduces particular ways to address the issues on overfitting of rule based classifiers through both scaling up algorithms and scaling down data in the context of ensemble learning.

Research paper thumbnail of Network Based Rule Representation for Knowledge Discovery and Predictive Modelling

IEEE International Conference on Fuzzy Systems 2015, Aug 2015

Due to the vast and rapid increase in data, data mining has been an increasingly important tool f... more Due to the vast and rapid increase in data, data mining has been an increasingly important tool for the purpose of knowledge discovery to prevent the presence of rich data but poor knowledge. In this context, machine learning can be seen as a powerful approach to achieve intelligent data mining. In practice, machine learning is also an intelligent approach for predictive modelling. A special type of machine learning methods, which are known as rule based methods such as decision trees, can be used to build a rule based system as a special type of expert systems for both knowledge discovery and predictive modelling. A rule based system may be represented through different structures. The techniques for representing rules are known as rule representation, which is significant for knowledge discovery in relation to the interpretability of the model, as well as for predictive modelling with regard to efficiency in predicting unseen instances. This paper justifies the significance of rule representation. Some networked topologies for rule representation are introduced against existing techniques. The network topologies are validated using complexity analysis in order to show their advantages comparing with the existing techniques in terms of model interpretability and computational efficiency.

Research paper thumbnail of Collaborative Decision Making by Ensemble Rule Based Classification Systems

Witold Pedrycz and Shyi-Ming Chen (eds), Granular Computing and Decision-Making, Studies in Big Data 10, Springer, Apr 2015

Rule based classification is a popular approach for decision making. It is also achievable that m... more Rule based classification is a popular approach for decision making. It is also achievable that multiple rule based classifiers work together for group decision making by using ensemble learning approach. This kind of expert system is referred to as ensemble rule based classification system by means of a system of systems. In machine learning, an ensemble learning approach is usually adopted in order to improve overall predictive accuracy, which means to provide highly trusted decisions. This chapter introduces basic concepts of ensemble learning and reviews Random Prism to analyze its performance. This chapter also introduces an extended framework of ensemble learning, which is referred to as Collaborative and Competitive Random Decision Rules (CCRDR) and includes Information Entropy Based Rule Generation (IEBRG) and original Prism in addition to PrismTCS as base classifiers. This is in order to overcome the identified limitations of Random Prism. Each of the base classifiers mentioned above is also introduced with respects to its essence and applications. An experimental study is undertaken towards comparative validation between the CCRDR and Random Prism. Contributions and Ongoing and future works are also highlighted.

Research paper thumbnail of Categorization and Construction of Rule Based Systems

International Conference on Engineering Applications of Neural Networks 2014, Sep 2014

Expert systems have been increasingly popular for commercial importance. A rule based system is a... more Expert systems have been increasingly popular for commercial importance. A rule based system is a special type of an expert system, which consists of a set of ‘if-then’ rules and can be applied as a decision support system in many areas such as healthcare, transportation and security. Rule based systems can be constructed based on both expert knowledge and data. This paper aims to introduce the theory of rule based systems especially on categorization and construction of such systems from a conceptual point of view. This paper also introduces rule based systems for classification tasks in detail.

Research paper thumbnail of Unified Framework for Construction of Rule Based Classification Systems

Witold Pedrycz and Shyi-Ming Chen (eds), Information Granularity, Big Data and Computational Intelligence, Studies in Big Data 8, Springer, Jul 2014

Automatic generation of classification rules has been an increasingly popular technique in commer... more Automatic generation of classification rules has been an increasingly popular technique in commercial applications such as Big Data analytics, rule based expert systems and decision making systems. However, a principal problem that arises with most methods for generation of classification rules is the overfitting of training data. When Big Data is dealt with, this may result in the generation of a large number of complex rules. This may not only increase computational cost but also lower the accuracy in predicting further unseen instances . This has led to the necessity of developing pruning methods for the simplification of rules. In addition, classification rules are used further to make predictions after the completion of their generation. As efficiency is concerned, it is expected to find the first rule that fires as soon as possible by searching through a rule set. Thus a suitable structure is required to represent the rule set effectively. In this chapter, the authors introduce a unified framework for construction of rule based classification systems consisting of three operations: rule generation, rule simplification and rule representation particularly on Big Data. The authors also review some existing methods and techniques used for each of the three operations and highlight the limitations of them as well as introduce some novel methods and techniques developed in their more recent research. The novel methods and techniques are also discussed in comparison to those existing ones reviewed earlier with respects to effective and efficient processing of Big Data.

Research paper thumbnail of Homogeneous and heterogeneous distributed classification for pocket data mining

Transactions on Large-Scale Data and Knowledge-Centered Systems V, Springer, Jan 2012

Pocket Data Mining (PDM) describes the full process of analysing data streams in mobile ad hoc di... more Pocket Data Mining (PDM) describes the full process of analysing data streams in mobile ad hoc distributed environments. Ad- vances in mobile devices like smart phones and tablet computers have made it possible for a wide range of applications to run in such an environment. In this paper, we propose the adoption of data stream classi?cation techniques for PDM. Evident by a thorough experimental study, it has been proved that running heterogeneous/di?erent, or ho- mogeneous/similar data stream classi?cation techniques over vertically partitioned data (data partitioned according to the feature space) results in comparable performance to batch and centralised learning techniques.