Abdullah Sonmez - Academia.edu (original) (raw)

Papers by Abdullah Sonmez

Research paper thumbnail of Feature enrichment and selection for transductive classification on networked data

Pattern Recognition Letters, 2014

ABSTRACT Networked data consist of nodes and links between the nodes which indicate their depende... more ABSTRACT Networked data consist of nodes and links between the nodes which indicate their dependencies. Nodes have content features which are available for all the data; on the other hand, the labels are available only for the training data. Given the features for all the nodes and labels for training nodes, in transductive classification, labels for all remaining nodes are predicted. Learning algorithms that use both node content features and links have been developed. For example, collective classification algorithms use aggregated (such as sum or average of) labels of neighbors, in addition to node features, as inputs to a classifier. The classifier is trained using the training data only. When testing, since the neighbors’ labels are used as classifier inputs, the labels for the test set need to be determined through an iterative procedure. While it is usually very difficult to obtain labels on the whole dataset, features are usually easier to obtain. In this paper, we introduce a new method of transductive network classification which can use the test node features when training the classifier. We train our classifier using enriched node features. The enriched node features include, in addition to the node’s own features, the aggregated neighbors’ features and aggregation of node and neighbor features passed through simple logical operators OR and AND. Enriched features may contain irrelevant or redundant features, which could decrease classifier performance. Therefore, we employ feature selection to determine whether a feature among the set of enriched features should be used for classifier training or not. Our feature selection method, called FCBF#, is a mutual information based, filter type, fast, feature selection method. Experimental results on three different network datasets show that classification accuracies obtained using network enriched and selected features are comparable or better than content only or collective classification.

Research paper thumbnail of Validation of Network Classifiers

Lecture Notes in Computer Science, 2012

This paper develops PAC (probably approximately correct) error bounds for network classifiers in ... more This paper develops PAC (probably approximately correct) error bounds for network classifiers in the transductive setting, where the network node inputs and links are all known, the training nodes class labels are known, and the goal is to classify a working set of nodes that have unknown class labels. The bounds are valid for any model of network generation. They require working nodes to be selected independently, but not uniformly at random. For example, they allow different regions of the network to have different densities of unlabeled nodes.

Research paper thumbnail of Validating Collective Classification Using Cohorts

Many networks grow by adding successive cohorts – layers of nodes. Often, the nodes in each layer... more Many networks grow by adding successive cohorts – layers of nodes. Often, the nodes in each layer are selected independently of each other, but from a distribution that can depend on which nodes were selected for previous cohorts. For example, successive waves of friends invite their friends to join social networks. We present error bounds for collective classification over these networks. 1

Research paper thumbnail of Collective Classification with Content and Link Noise

Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for b... more Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for better classification of networked data when unlabeled test node features and links are available. In this study, we provide detailed results on the performance of collective classification algorithms when content or link noise is present. First of all, we show that collective classification algorithms are more robust to content noise than content only classification. We also evaluate the performance of collective classification when additive link noise is present. We show that, especially when content and/or link noise is present, feature and/or node selection is essential for better collective classification. 1

Research paper thumbnail of Classification in Social Networks

Lecture Notes in Social Networks, 2014

Production of social network data in different kinds and huge amounts brings with it classificati... more Production of social network data in different kinds and huge amounts brings with it classification problems which need to be solved. In this chapter, we introduce a framework for classification in a social network. Aggregation of neighbor labels and sampling are two important aspects of classification in a social network. We give details of different aggregation methods and sampling methods. Then, we discuss different graph properties, especially homophily, which may be helpful in determining which type of classification algorithm should be used. We give details of a collective classification algorithm, ICA (Iterative Classification Algorithm), which can be used for semi-supervised learning in general and transductive learning in particular on a social network. We present classification results on three different datasets, using different aggregation and sampling methods and classifiers.

Research paper thumbnail of Collective Classification Using Heterogeneous Classifiers

Lecture Notes in Computer Science, 2011

Collective classification algorithms have been used to improve classification performance when ne... more Collective classification algorithms have been used to improve classification performance when network training data with content, link and label information and test data with content and link information are available. Collective classification algorithms use a base classifier which is trained on training content and link data. The base classifier inputs usually consist of the content vector concatenated with an aggregation vector of neighborhood class information. In this paper, instead of using a single base classifier, we propose using different types of base classifiers for content and link. We then combine the content and link classifier outputs using different classifier combination methods. Our experiments show that using heterogeneous classifiers for link and content classification and combining their outputs gives accuracies as good as collective classification. Our method can also be extended to collective classification scenarios with multiple types of content and link.

Research paper thumbnail of Classification in Social Networks

Production of social network data in different kinds and huge amounts brings with it classificati... more Production of social network data in different kinds and huge amounts brings with it classification problems which need to be solved. In this chapter, we introduce a framework for classification in a social network. Aggregation of neighbor labels and sampling are two important aspects of classification in a social network. We give details of different aggregation methods and sampling methods. Then, we discuss different graph properties, especially homophily, which may be helpful in determining which type of classification algorithm should be used. We give details of a collective classification algorithm, ICA (Iterative Classification Algorithm), which can be used for semi-supervised learning in general and transductive learning in particular on a social network. We present classification results on three different datasets, using different aggregation and sampling methods and classifiers.

Research paper thumbnail of Validating collective classification using cohorts

Many networks grow by adding successive cohorts-layers of nodes. Often, the nodes in each layer a... more Many networks grow by adding successive cohorts-layers of nodes. Often, the nodes in each layer are selected independently of each other, but from a distribution that can depend on which nodes were selected for previous cohorts. For example, successive waves of friends invite their friends to join social networks. We present error bounds for collective classification over these networks.

Research paper thumbnail of Music classification using Kolmogorov distance

Representation in Music/Musical …, 2005

Research paper thumbnail of Music Classification Using Kolmogorov Distance

Abstract. We evaluate the music composer classification using an approximation of the Kolmogorov ... more Abstract. We evaluate the music composer classification using an approximation of the Kolmogorov distance between different music pieces. The distance approximation has recently been suggested by Vitanyi and his colleagues. They use a clustering method to evalute the distance metric. However the clustering is too slow for large (>60) data sets. We suggest using the distance metric together with a k-nearest neighbor classifier. We measure the performance of the distance metric based on the test classification accuracy of the classifier. A classification accuracy of 79% is achieved for a training data set of 57 midi files from three different classical composers. We find out that the classification accuracy increases with training set size. The performance of the metric seems to also depend on different pre-processing methods, hence domain knowledge and input representation could make a difference on how the distance metric performs.

Research paper thumbnail of Music genre classification using midi and audio features

EURASIP Journal on Applied …, 2007

We report our findings on using MIDI files and audio features from MIDI, separately and combined ... more We report our findings on using MIDI files and audio features from MIDI, separately and combined together, for MIDI music genre classification. We use McKay and Fujinaga's 3-root and 9-leaf genre data set. In order to compute distances between MIDI pieces, we use normalized compression distance (NCD). NCD uses the compressed length of a string as an approximation to its Kolmogorov complexity and has previously been used for music genre and composer clustering. We convert the MIDI pieces to audio and then use the audio features to train different classifiers. MIDI and audio from MIDI classifiers alone achieve much smaller accuracies than those reported by McKay and Fujinaga who used not NCD but a number of domain-based MIDI features for their classification. Combining MIDI and audio from MIDI classifiers improves accuracy and gets closer to, but still worse, accuracies than McKay and Fujinaga's. The best root genre accuracies achieved using MIDI, audio, and combination of them are 0.75, 0.86, and 0.93, respectively, compared to 0.98 of McKay and Fujinaga. Successful classifier combination requires diversity of the base classifiers. We achieve diversity through using certain number of seconds of the MIDI file, different sample rates and sizes for the audio file, and different classification algorithms.

Research paper thumbnail of Collective Classification with Content and Link Noise

itu.edu.tr

Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for b... more Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for better classification of networked data when unlabeled test node features and links are available. In this study, we provide detailed results on the performance of collective classification algorithms when content or link noise is present. First of all, we show that collective classification algorithms are more robust to content noise than content only classification. We also evaluate the performance of collective classification when additive link noise is present. We show that, especially when content and/or link noise is present, feature and/or node selection is essential for better collective classification.

Research paper thumbnail of Validation of Network Classifiers

Lecture Notes in Computer Science, 2012

ABSTRACT This paper develops PAC (probably approximately correct) error bounds for network classi... more ABSTRACT This paper develops PAC (probably approximately correct) error bounds for network classifiers in the transductive setting, where the network node inputs and links are all known, the training nodes class labels are known, and the goal is to classify a working set of nodes that have unknown class labels. The bounds are valid for any model of network generation. They require working nodes to be selected independently, but not uniformly at random. For example, they allow different regions of the network to have different densities of unlabeled nodes.

Research paper thumbnail of Feature enrichment and selection for transductive classification on networked data

Pattern Recognition Letters, 2014

ABSTRACT Networked data consist of nodes and links between the nodes which indicate their depende... more ABSTRACT Networked data consist of nodes and links between the nodes which indicate their dependencies. Nodes have content features which are available for all the data; on the other hand, the labels are available only for the training data. Given the features for all the nodes and labels for training nodes, in transductive classification, labels for all remaining nodes are predicted. Learning algorithms that use both node content features and links have been developed. For example, collective classification algorithms use aggregated (such as sum or average of) labels of neighbors, in addition to node features, as inputs to a classifier. The classifier is trained using the training data only. When testing, since the neighbors’ labels are used as classifier inputs, the labels for the test set need to be determined through an iterative procedure. While it is usually very difficult to obtain labels on the whole dataset, features are usually easier to obtain. In this paper, we introduce a new method of transductive network classification which can use the test node features when training the classifier. We train our classifier using enriched node features. The enriched node features include, in addition to the node’s own features, the aggregated neighbors’ features and aggregation of node and neighbor features passed through simple logical operators OR and AND. Enriched features may contain irrelevant or redundant features, which could decrease classifier performance. Therefore, we employ feature selection to determine whether a feature among the set of enriched features should be used for classifier training or not. Our feature selection method, called FCBF#, is a mutual information based, filter type, fast, feature selection method. Experimental results on three different network datasets show that classification accuracies obtained using network enriched and selected features are comparable or better than content only or collective classification.

Research paper thumbnail of Validation of Network Classifiers

Lecture Notes in Computer Science, 2012

This paper develops PAC (probably approximately correct) error bounds for network classifiers in ... more This paper develops PAC (probably approximately correct) error bounds for network classifiers in the transductive setting, where the network node inputs and links are all known, the training nodes class labels are known, and the goal is to classify a working set of nodes that have unknown class labels. The bounds are valid for any model of network generation. They require working nodes to be selected independently, but not uniformly at random. For example, they allow different regions of the network to have different densities of unlabeled nodes.

Research paper thumbnail of Validating Collective Classification Using Cohorts

Many networks grow by adding successive cohorts – layers of nodes. Often, the nodes in each layer... more Many networks grow by adding successive cohorts – layers of nodes. Often, the nodes in each layer are selected independently of each other, but from a distribution that can depend on which nodes were selected for previous cohorts. For example, successive waves of friends invite their friends to join social networks. We present error bounds for collective classification over these networks. 1

Research paper thumbnail of Collective Classification with Content and Link Noise

Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for b... more Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for better classification of networked data when unlabeled test node features and links are available. In this study, we provide detailed results on the performance of collective classification algorithms when content or link noise is present. First of all, we show that collective classification algorithms are more robust to content noise than content only classification. We also evaluate the performance of collective classification when additive link noise is present. We show that, especially when content and/or link noise is present, feature and/or node selection is essential for better collective classification. 1

Research paper thumbnail of Classification in Social Networks

Lecture Notes in Social Networks, 2014

Production of social network data in different kinds and huge amounts brings with it classificati... more Production of social network data in different kinds and huge amounts brings with it classification problems which need to be solved. In this chapter, we introduce a framework for classification in a social network. Aggregation of neighbor labels and sampling are two important aspects of classification in a social network. We give details of different aggregation methods and sampling methods. Then, we discuss different graph properties, especially homophily, which may be helpful in determining which type of classification algorithm should be used. We give details of a collective classification algorithm, ICA (Iterative Classification Algorithm), which can be used for semi-supervised learning in general and transductive learning in particular on a social network. We present classification results on three different datasets, using different aggregation and sampling methods and classifiers.

Research paper thumbnail of Collective Classification Using Heterogeneous Classifiers

Lecture Notes in Computer Science, 2011

Collective classification algorithms have been used to improve classification performance when ne... more Collective classification algorithms have been used to improve classification performance when network training data with content, link and label information and test data with content and link information are available. Collective classification algorithms use a base classifier which is trained on training content and link data. The base classifier inputs usually consist of the content vector concatenated with an aggregation vector of neighborhood class information. In this paper, instead of using a single base classifier, we propose using different types of base classifiers for content and link. We then combine the content and link classifier outputs using different classifier combination methods. Our experiments show that using heterogeneous classifiers for link and content classification and combining their outputs gives accuracies as good as collective classification. Our method can also be extended to collective classification scenarios with multiple types of content and link.

Research paper thumbnail of Classification in Social Networks

Production of social network data in different kinds and huge amounts brings with it classificati... more Production of social network data in different kinds and huge amounts brings with it classification problems which need to be solved. In this chapter, we introduce a framework for classification in a social network. Aggregation of neighbor labels and sampling are two important aspects of classification in a social network. We give details of different aggregation methods and sampling methods. Then, we discuss different graph properties, especially homophily, which may be helpful in determining which type of classification algorithm should be used. We give details of a collective classification algorithm, ICA (Iterative Classification Algorithm), which can be used for semi-supervised learning in general and transductive learning in particular on a social network. We present classification results on three different datasets, using different aggregation and sampling methods and classifiers.

Research paper thumbnail of Validating collective classification using cohorts

Many networks grow by adding successive cohorts-layers of nodes. Often, the nodes in each layer a... more Many networks grow by adding successive cohorts-layers of nodes. Often, the nodes in each layer are selected independently of each other, but from a distribution that can depend on which nodes were selected for previous cohorts. For example, successive waves of friends invite their friends to join social networks. We present error bounds for collective classification over these networks.

Research paper thumbnail of Music classification using Kolmogorov distance

Representation in Music/Musical …, 2005

Research paper thumbnail of Music Classification Using Kolmogorov Distance

Abstract. We evaluate the music composer classification using an approximation of the Kolmogorov ... more Abstract. We evaluate the music composer classification using an approximation of the Kolmogorov distance between different music pieces. The distance approximation has recently been suggested by Vitanyi and his colleagues. They use a clustering method to evalute the distance metric. However the clustering is too slow for large (>60) data sets. We suggest using the distance metric together with a k-nearest neighbor classifier. We measure the performance of the distance metric based on the test classification accuracy of the classifier. A classification accuracy of 79% is achieved for a training data set of 57 midi files from three different classical composers. We find out that the classification accuracy increases with training set size. The performance of the metric seems to also depend on different pre-processing methods, hence domain knowledge and input representation could make a difference on how the distance metric performs.

Research paper thumbnail of Music genre classification using midi and audio features

EURASIP Journal on Applied …, 2007

We report our findings on using MIDI files and audio features from MIDI, separately and combined ... more We report our findings on using MIDI files and audio features from MIDI, separately and combined together, for MIDI music genre classification. We use McKay and Fujinaga's 3-root and 9-leaf genre data set. In order to compute distances between MIDI pieces, we use normalized compression distance (NCD). NCD uses the compressed length of a string as an approximation to its Kolmogorov complexity and has previously been used for music genre and composer clustering. We convert the MIDI pieces to audio and then use the audio features to train different classifiers. MIDI and audio from MIDI classifiers alone achieve much smaller accuracies than those reported by McKay and Fujinaga who used not NCD but a number of domain-based MIDI features for their classification. Combining MIDI and audio from MIDI classifiers improves accuracy and gets closer to, but still worse, accuracies than McKay and Fujinaga's. The best root genre accuracies achieved using MIDI, audio, and combination of them are 0.75, 0.86, and 0.93, respectively, compared to 0.98 of McKay and Fujinaga. Successful classifier combination requires diversity of the base classifiers. We achieve diversity through using certain number of seconds of the MIDI file, different sample rates and sizes for the audio file, and different classification algorithms.

Research paper thumbnail of Collective Classification with Content and Link Noise

itu.edu.tr

Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for b... more Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for better classification of networked data when unlabeled test node features and links are available. In this study, we provide detailed results on the performance of collective classification algorithms when content or link noise is present. First of all, we show that collective classification algorithms are more robust to content noise than content only classification. We also evaluate the performance of collective classification when additive link noise is present. We show that, especially when content and/or link noise is present, feature and/or node selection is essential for better collective classification.

Research paper thumbnail of Validation of Network Classifiers

Lecture Notes in Computer Science, 2012

ABSTRACT This paper develops PAC (probably approximately correct) error bounds for network classi... more ABSTRACT This paper develops PAC (probably approximately correct) error bounds for network classifiers in the transductive setting, where the network node inputs and links are all known, the training nodes class labels are known, and the goal is to classify a working set of nodes that have unknown class labels. The bounds are valid for any model of network generation. They require working nodes to be selected independently, but not uniformly at random. For example, they allow different regions of the network to have different densities of unlabeled nodes.