Abdullah Sonmez - Academia.edu (original) (raw)
Papers by Abdullah Sonmez
Pattern Recognition Letters, 2014
ABSTRACT Networked data consist of nodes and links between the nodes which indicate their depende... more ABSTRACT Networked data consist of nodes and links between the nodes which indicate their dependencies. Nodes have content features which are available for all the data; on the other hand, the labels are available only for the training data. Given the features for all the nodes and labels for training nodes, in transductive classification, labels for all remaining nodes are predicted. Learning algorithms that use both node content features and links have been developed. For example, collective classification algorithms use aggregated (such as sum or average of) labels of neighbors, in addition to node features, as inputs to a classifier. The classifier is trained using the training data only. When testing, since the neighbors’ labels are used as classifier inputs, the labels for the test set need to be determined through an iterative procedure. While it is usually very difficult to obtain labels on the whole dataset, features are usually easier to obtain. In this paper, we introduce a new method of transductive network classification which can use the test node features when training the classifier. We train our classifier using enriched node features. The enriched node features include, in addition to the node’s own features, the aggregated neighbors’ features and aggregation of node and neighbor features passed through simple logical operators OR and AND. Enriched features may contain irrelevant or redundant features, which could decrease classifier performance. Therefore, we employ feature selection to determine whether a feature among the set of enriched features should be used for classifier training or not. Our feature selection method, called FCBF#, is a mutual information based, filter type, fast, feature selection method. Experimental results on three different network datasets show that classification accuracies obtained using network enriched and selected features are comparable or better than content only or collective classification.
Lecture Notes in Computer Science, 2012
This paper develops PAC (probably approximately correct) error bounds for network classifiers in ... more This paper develops PAC (probably approximately correct) error bounds for network classifiers in the transductive setting, where the network node inputs and links are all known, the training nodes class labels are known, and the goal is to classify a working set of nodes that have unknown class labels. The bounds are valid for any model of network generation. They require working nodes to be selected independently, but not uniformly at random. For example, they allow different regions of the network to have different densities of unlabeled nodes.
Many networks grow by adding successive cohorts – layers of nodes. Often, the nodes in each layer... more Many networks grow by adding successive cohorts – layers of nodes. Often, the nodes in each layer are selected independently of each other, but from a distribution that can depend on which nodes were selected for previous cohorts. For example, successive waves of friends invite their friends to join social networks. We present error bounds for collective classification over these networks. 1
Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for b... more Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for better classification of networked data when unlabeled test node features and links are available. In this study, we provide detailed results on the performance of collective classification algorithms when content or link noise is present. First of all, we show that collective classification algorithms are more robust to content noise than content only classification. We also evaluate the performance of collective classification when additive link noise is present. We show that, especially when content and/or link noise is present, feature and/or node selection is essential for better collective classification. 1
Lecture Notes in Social Networks, 2014
Production of social network data in different kinds and huge amounts brings with it classificati... more Production of social network data in different kinds and huge amounts brings with it classification problems which need to be solved. In this chapter, we introduce a framework for classification in a social network. Aggregation of neighbor labels and sampling are two important aspects of classification in a social network. We give details of different aggregation methods and sampling methods. Then, we discuss different graph properties, especially homophily, which may be helpful in determining which type of classification algorithm should be used. We give details of a collective classification algorithm, ICA (Iterative Classification Algorithm), which can be used for semi-supervised learning in general and transductive learning in particular on a social network. We present classification results on three different datasets, using different aggregation and sampling methods and classifiers.
Lecture Notes in Computer Science, 2011
Collective classification algorithms have been used to improve classification performance when ne... more Collective classification algorithms have been used to improve classification performance when network training data with content, link and label information and test data with content and link information are available. Collective classification algorithms use a base classifier which is trained on training content and link data. The base classifier inputs usually consist of the content vector concatenated with an aggregation vector of neighborhood class information. In this paper, instead of using a single base classifier, we propose using different types of base classifiers for content and link. We then combine the content and link classifier outputs using different classifier combination methods. Our experiments show that using heterogeneous classifiers for link and content classification and combining their outputs gives accuracies as good as collective classification. Our method can also be extended to collective classification scenarios with multiple types of content and link.
Production of social network data in different kinds and huge amounts brings with it classificati... more Production of social network data in different kinds and huge amounts brings with it classification problems which need to be solved. In this chapter, we introduce a framework for classification in a social network. Aggregation of neighbor labels and sampling are two important aspects of classification in a social network. We give details of different aggregation methods and sampling methods. Then, we discuss different graph properties, especially homophily, which may be helpful in determining which type of classification algorithm should be used. We give details of a collective classification algorithm, ICA (Iterative Classification Algorithm), which can be used for semi-supervised learning in general and transductive learning in particular on a social network. We present classification results on three different datasets, using different aggregation and sampling methods and classifiers.
Many networks grow by adding successive cohorts-layers of nodes. Often, the nodes in each layer a... more Many networks grow by adding successive cohorts-layers of nodes. Often, the nodes in each layer are selected independently of each other, but from a distribution that can depend on which nodes were selected for previous cohorts. For example, successive waves of friends invite their friends to join social networks. We present error bounds for collective classification over these networks.
Representation in Music/Musical …, 2005
Abstract. We evaluate the music composer classification using an approximation of the Kolmogorov ... more Abstract. We evaluate the music composer classification using an approximation of the Kolmogorov distance between different music pieces. The distance approximation has recently been suggested by Vitanyi and his colleagues. They use a clustering method to evalute the distance metric. However the clustering is too slow for large (>60) data sets. We suggest using the distance metric together with a k-nearest neighbor classifier. We measure the performance of the distance metric based on the test classification accuracy of the classifier. A classification accuracy of 79% is achieved for a training data set of 57 midi files from three different classical composers. We find out that the classification accuracy increases with training set size. The performance of the metric seems to also depend on different pre-processing methods, hence domain knowledge and input representation could make a difference on how the distance metric performs.
EURASIP Journal on Applied …, 2007
We report our findings on using MIDI files and audio features from MIDI, separately and combined ... more We report our findings on using MIDI files and audio features from MIDI, separately and combined together, for MIDI music genre classification. We use McKay and Fujinaga's 3-root and 9-leaf genre data set. In order to compute distances between MIDI pieces, we use normalized compression distance (NCD). NCD uses the compressed length of a string as an approximation to its Kolmogorov complexity and has previously been used for music genre and composer clustering. We convert the MIDI pieces to audio and then use the audio features to train different classifiers. MIDI and audio from MIDI classifiers alone achieve much smaller accuracies than those reported by McKay and Fujinaga who used not NCD but a number of domain-based MIDI features for their classification. Combining MIDI and audio from MIDI classifiers improves accuracy and gets closer to, but still worse, accuracies than McKay and Fujinaga's. The best root genre accuracies achieved using MIDI, audio, and combination of them are 0.75, 0.86, and 0.93, respectively, compared to 0.98 of McKay and Fujinaga. Successful classifier combination requires diversity of the base classifiers. We achieve diversity through using certain number of seconds of the MIDI file, different sample rates and sizes for the audio file, and different classification algorithms.
itu.edu.tr
Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for b... more Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for better classification of networked data when unlabeled test node features and links are available. In this study, we provide detailed results on the performance of collective classification algorithms when content or link noise is present. First of all, we show that collective classification algorithms are more robust to content noise than content only classification. We also evaluate the performance of collective classification when additive link noise is present. We show that, especially when content and/or link noise is present, feature and/or node selection is essential for better collective classification.
Lecture Notes in Computer Science, 2012
ABSTRACT This paper develops PAC (probably approximately correct) error bounds for network classi... more ABSTRACT This paper develops PAC (probably approximately correct) error bounds for network classifiers in the transductive setting, where the network node inputs and links are all known, the training nodes class labels are known, and the goal is to classify a working set of nodes that have unknown class labels. The bounds are valid for any model of network generation. They require working nodes to be selected independently, but not uniformly at random. For example, they allow different regions of the network to have different densities of unlabeled nodes.
Pattern Recognition Letters, 2014
ABSTRACT Networked data consist of nodes and links between the nodes which indicate their depende... more ABSTRACT Networked data consist of nodes and links between the nodes which indicate their dependencies. Nodes have content features which are available for all the data; on the other hand, the labels are available only for the training data. Given the features for all the nodes and labels for training nodes, in transductive classification, labels for all remaining nodes are predicted. Learning algorithms that use both node content features and links have been developed. For example, collective classification algorithms use aggregated (such as sum or average of) labels of neighbors, in addition to node features, as inputs to a classifier. The classifier is trained using the training data only. When testing, since the neighbors’ labels are used as classifier inputs, the labels for the test set need to be determined through an iterative procedure. While it is usually very difficult to obtain labels on the whole dataset, features are usually easier to obtain. In this paper, we introduce a new method of transductive network classification which can use the test node features when training the classifier. We train our classifier using enriched node features. The enriched node features include, in addition to the node’s own features, the aggregated neighbors’ features and aggregation of node and neighbor features passed through simple logical operators OR and AND. Enriched features may contain irrelevant or redundant features, which could decrease classifier performance. Therefore, we employ feature selection to determine whether a feature among the set of enriched features should be used for classifier training or not. Our feature selection method, called FCBF#, is a mutual information based, filter type, fast, feature selection method. Experimental results on three different network datasets show that classification accuracies obtained using network enriched and selected features are comparable or better than content only or collective classification.
Lecture Notes in Computer Science, 2012
This paper develops PAC (probably approximately correct) error bounds for network classifiers in ... more This paper develops PAC (probably approximately correct) error bounds for network classifiers in the transductive setting, where the network node inputs and links are all known, the training nodes class labels are known, and the goal is to classify a working set of nodes that have unknown class labels. The bounds are valid for any model of network generation. They require working nodes to be selected independently, but not uniformly at random. For example, they allow different regions of the network to have different densities of unlabeled nodes.
Many networks grow by adding successive cohorts – layers of nodes. Often, the nodes in each layer... more Many networks grow by adding successive cohorts – layers of nodes. Often, the nodes in each layer are selected independently of each other, but from a distribution that can depend on which nodes were selected for previous cohorts. For example, successive waves of friends invite their friends to join social networks. We present error bounds for collective classification over these networks. 1
Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for b... more Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for better classification of networked data when unlabeled test node features and links are available. In this study, we provide detailed results on the performance of collective classification algorithms when content or link noise is present. First of all, we show that collective classification algorithms are more robust to content noise than content only classification. We also evaluate the performance of collective classification when additive link noise is present. We show that, especially when content and/or link noise is present, feature and/or node selection is essential for better collective classification. 1
Lecture Notes in Social Networks, 2014
Production of social network data in different kinds and huge amounts brings with it classificati... more Production of social network data in different kinds and huge amounts brings with it classification problems which need to be solved. In this chapter, we introduce a framework for classification in a social network. Aggregation of neighbor labels and sampling are two important aspects of classification in a social network. We give details of different aggregation methods and sampling methods. Then, we discuss different graph properties, especially homophily, which may be helpful in determining which type of classification algorithm should be used. We give details of a collective classification algorithm, ICA (Iterative Classification Algorithm), which can be used for semi-supervised learning in general and transductive learning in particular on a social network. We present classification results on three different datasets, using different aggregation and sampling methods and classifiers.
Lecture Notes in Computer Science, 2011
Collective classification algorithms have been used to improve classification performance when ne... more Collective classification algorithms have been used to improve classification performance when network training data with content, link and label information and test data with content and link information are available. Collective classification algorithms use a base classifier which is trained on training content and link data. The base classifier inputs usually consist of the content vector concatenated with an aggregation vector of neighborhood class information. In this paper, instead of using a single base classifier, we propose using different types of base classifiers for content and link. We then combine the content and link classifier outputs using different classifier combination methods. Our experiments show that using heterogeneous classifiers for link and content classification and combining their outputs gives accuracies as good as collective classification. Our method can also be extended to collective classification scenarios with multiple types of content and link.
Production of social network data in different kinds and huge amounts brings with it classificati... more Production of social network data in different kinds and huge amounts brings with it classification problems which need to be solved. In this chapter, we introduce a framework for classification in a social network. Aggregation of neighbor labels and sampling are two important aspects of classification in a social network. We give details of different aggregation methods and sampling methods. Then, we discuss different graph properties, especially homophily, which may be helpful in determining which type of classification algorithm should be used. We give details of a collective classification algorithm, ICA (Iterative Classification Algorithm), which can be used for semi-supervised learning in general and transductive learning in particular on a social network. We present classification results on three different datasets, using different aggregation and sampling methods and classifiers.
Many networks grow by adding successive cohorts-layers of nodes. Often, the nodes in each layer a... more Many networks grow by adding successive cohorts-layers of nodes. Often, the nodes in each layer are selected independently of each other, but from a distribution that can depend on which nodes were selected for previous cohorts. For example, successive waves of friends invite their friends to join social networks. We present error bounds for collective classification over these networks.
Representation in Music/Musical …, 2005
Abstract. We evaluate the music composer classification using an approximation of the Kolmogorov ... more Abstract. We evaluate the music composer classification using an approximation of the Kolmogorov distance between different music pieces. The distance approximation has recently been suggested by Vitanyi and his colleagues. They use a clustering method to evalute the distance metric. However the clustering is too slow for large (>60) data sets. We suggest using the distance metric together with a k-nearest neighbor classifier. We measure the performance of the distance metric based on the test classification accuracy of the classifier. A classification accuracy of 79% is achieved for a training data set of 57 midi files from three different classical composers. We find out that the classification accuracy increases with training set size. The performance of the metric seems to also depend on different pre-processing methods, hence domain knowledge and input representation could make a difference on how the distance metric performs.
EURASIP Journal on Applied …, 2007
We report our findings on using MIDI files and audio features from MIDI, separately and combined ... more We report our findings on using MIDI files and audio features from MIDI, separately and combined together, for MIDI music genre classification. We use McKay and Fujinaga's 3-root and 9-leaf genre data set. In order to compute distances between MIDI pieces, we use normalized compression distance (NCD). NCD uses the compressed length of a string as an approximation to its Kolmogorov complexity and has previously been used for music genre and composer clustering. We convert the MIDI pieces to audio and then use the audio features to train different classifiers. MIDI and audio from MIDI classifiers alone achieve much smaller accuracies than those reported by McKay and Fujinaga who used not NCD but a number of domain-based MIDI features for their classification. Combining MIDI and audio from MIDI classifiers improves accuracy and gets closer to, but still worse, accuracies than McKay and Fujinaga's. The best root genre accuracies achieved using MIDI, audio, and combination of them are 0.75, 0.86, and 0.93, respectively, compared to 0.98 of McKay and Fujinaga. Successful classifier combination requires diversity of the base classifiers. We achieve diversity through using certain number of seconds of the MIDI file, different sample rates and sizes for the audio file, and different classification algorithms.
itu.edu.tr
Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for b... more Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for better classification of networked data when unlabeled test node features and links are available. In this study, we provide detailed results on the performance of collective classification algorithms when content or link noise is present. First of all, we show that collective classification algorithms are more robust to content noise than content only classification. We also evaluate the performance of collective classification when additive link noise is present. We show that, especially when content and/or link noise is present, feature and/or node selection is essential for better collective classification.
Lecture Notes in Computer Science, 2012
ABSTRACT This paper develops PAC (probably approximately correct) error bounds for network classi... more ABSTRACT This paper develops PAC (probably approximately correct) error bounds for network classifiers in the transductive setting, where the network node inputs and links are all known, the training nodes class labels are known, and the goal is to classify a working set of nodes that have unknown class labels. The bounds are valid for any model of network generation. They require working nodes to be selected independently, but not uniformly at random. For example, they allow different regions of the network to have different densities of unlabeled nodes.