H. Altay Güvenir - Academia.edu (original) (raw)
Papers by H. Altay Güvenir
Lecture Notes in Computer Science, 2004
Inducing classification rules on domains from which information is gathered at regular periods le... more Inducing classification rules on domains from which information is gathered at regular periods lead the number of such classification rules to be generally so huge that selection of interesting ones among all discovered rules becomes an important task. At each period, using the newly gathered information from the domain, the new classification rules are induced. Therefore, these rules stream through time and are so called streaming classification rules. In this paper, an interactive rule interestingness-learning algorithm (IRIL) is developed to automatically label the classification rules either as "interesting" or "uninteresting" with limited user interaction. In our study, VFP (Voting Feature Projections), a feature projection based incremental classification learning algorithm, is also developed in the framework of IRIL. The concept description learned by the VFP algorithm constitutes a novel approach for interestingness analysis of streaming classification rules.
Machine Learning, Apr 1, 1996
This paper presents a new form of exemplar-based learning, based on a representation scheme calle... more This paper presents a new form of exemplar-based learning, based on a representation scheme called feature partitioning, and a particular implementation of this technique called CFP (for Classification by Feature Partitioning). Learning in CFP is accomplished by storing the objects separately in each feature dimension as disjoint sets of values called segments. A segment is expanded through generalization or specialized by dividing it into sub-segments. Classification is based on a weighted voting among the individual predictions of the features, which are simply the class values of the segments corresponding to the values of a test instance for each feature. An empirical evaluation of CFP and its comparison with two other classification techniques that consider each feature separately are given.
International Journal of Pattern Recognition and Artificial Intelligence, Feb 1, 2013
Many machine learning algorithms require the features to be categorical. Hence, they require all ... more Many machine learning algorithms require the features to be categorical. Hence, they require all numeric-valued data to be discretized into intervals. In this paper, we present a new discretization method based on the receiver operating characteristics (ROC) Curve (AUC) measure. Maximum area under ROC curve-based discretization (MAD) is a global, static and supervised discretization method. MAD uses the sorted order of the continuous values of a feature and discretizes the feature in such a way that the AUC based on that feature is to be maximized. The proposed method is compared with alternative discretization methods such as ChiMerge, Entropy-Minimum Description Length Principle (MDLP), Fixed Frequency Discretization (FFD), and Proportional Discretization (PD). FFD and PD have been recently proposed and are designed for Naïve Bayes learning. ChiMerge is a merging discretization method as the MAD method. Evaluations are performed in terms of M-Measure, an AUC-based metric for multi-class classi¯cation, and accuracy values obtained from Naïve Bayes and Aggregating One-Dependence Estimators (AODE) algorithms by using real-world datasets. Empirical results show that MAD is a strong candidate to be a good alternative to other discretization methods.
arXiv (Cornell University), Jul 26, 1996
This paper proposes a mechanism for learning pattern correspondences between two languages from a... more This paper proposes a mechanism for learning pattern correspondences between two languages from a corpus of translated sentence pairs. The proposed mechanism uses analogical reasoning between two translations. Given a pair of translations, the similar parts of the sentences in the source language must correspond the similar parts of the sentences in the target language. Similarly, the di erent parts should correspond to the respective parts in the translated sentences. The correspondences between the similarities, and also di erences are learned in the form of translation rules. The system is tested on a small training dataset and produced promising results for further investigation.
Distributed Artificial Intelligence (DAI) research is concerned with solving problems using both ... more Distributed Artificial Intelligence (DAI) research is concerned with solving problems using both AI techniques and distributed processing. DAI techniques can be used to solve many complex problems such as engineering design problems, interpretation problems, planning problems, etc. In this paper, first a survey of DAI approach to problem solving is given and then how a DAI tool, the CEF, is used to implement a system for designing steam condensers, called STEAMER, is described
Biophysical Journal, 2015
In most of the real-world domains, benefit and costs of classifications can be dependent on the c... more In most of the real-world domains, benefit and costs of classifications can be dependent on the characteristics of individual examples. In such cases, there is no static benefit matrix available in the domain and each classification benefit is calculated separately. This situation, called feature dependency, is evaluated in the framework of our newly proposed classification algorithm Benefit Maximizing classifier with Feature Intervals (BMFI) that uses feature projection based knowledge representation. This new approach has been evaluated over bank loan applications and experimental results are presented.
Due to the increase in data mining research and applications, selection of interesting rules amon... more Due to the increase in data mining research and applications, selection of interesting rules among a huge number of learned rules is an important task in data mining applications. In this paper, the metrics for the interestingness of a rule is investigated and an algorithm that can classify the learned rules according to their interestingness is developed. Classification algorithms were designed to maximize the number of correctly classified instances, given a set of unseen test cases. Furthermore, feature projection based classification algorithms were tested and shown to be successful in large number of real domains. So, in this work, a feature projection based classification algorithm (VFI, Voting Feature Intervals) is adapted to the rule interestingness problem, and FPRC (Feature Projection Based Rule Classification) algorithm is developed.
Voting Features based Classifiers, shortly VFC, have been shown to perform well on most real-worl... more Voting Features based Classifiers, shortly VFC, have been shown to perform well on most real-world data sets. They are robust to irrelevant features and missing feature values. In this paper, we introduce an extension to VFC, called Voting Features based Classifier with feature Construction, VFCC for short, and show its application to the problem of predicting if a bank will encounter financial distress, by analyzing current financial statements. The previously developed VFC learn a set of rules that contain a single condition based on a single feature in their antecedent. The VFCC algorithm proposed in this work, on the other hand, constructs rules whose antecedents may contain conjuncts based on several features. Experimental results on recent financial ratios of banks in Turkey show that the VFCC algorithm achieves better accuracy than other well-known rule learning classification algorithms.
Systems for inducing concept descriptions from examples are valuable tools for assisting in the t... more Systems for inducing concept descriptions from examples are valuable tools for assisting in the task of knowledge acquisition for expert systems. In this research three machine learning techniques are applied to the problem of predicting the daily changes in the index of Istanbul Stock Market, given the price changes in other investment instruments such as foreign currencies and gold, also changes in the interest rates of government bonds and bank certificate of deposit accounts. The techniques used are instance-based learning (IBL), nested-generalized exemplars (NGE), and neural networks (NN). These techniques are applied to the actual data comprising the values between January 1991 and July 1992. The most important characteristic of this data is the large amount of noise inherent in its domain. In this paper we compare these three learning techniques in terms of efficiency, ability to cope with noisy data, and human friendliness of the learned concepts.
Springer eBooks, 1993
Several representation techniques have been used to describe concepts for su-pervised learning ta... more Several representation techniques have been used to describe concepts for su-pervised learning tasks. Exemplar-based learning techniques store only specific examples that are representatives of other several similar instances. Previous im-plementations of this approach usually ...
Europace, Oct 12, 2016
The aims of this study include (i) pursuing data-mining experiments on the Angiotensin II-Antagon... more The aims of this study include (i) pursuing data-mining experiments on the Angiotensin II-Antagonist in Paroxysmal Atrial Fibrillation (ANTIPAF-AFNET 2) trial dataset containing atrial fibrillation (AF) burden scores of patients with many clinical parameters and (ii) revealing possible correlations between the estimated risk factors of AF and other clinical findings or measurements provided in the dataset. Methods Ranking Instances by Maximizing the Area under a Receiver Operating Characteristics (ROC) Curve (RIMARC) is used to determine the predictive weights (P w) of baseline variables on the primary endpoint. Chi-square automatic interaction detector algorithm is performed for comparing the results of RIMARC. The primary endpoint of the ANTIPAF-AFNET 2 trial was the percentage of days with documented episodes of paroxysmal AF or with suspected persistent AF. Results By means of the RIMARC analysis algorithm, baseline SF-12 mental component score (P w ¼ 0.3597), age (P w ¼ 0.2865), blood urea nitrogen (BUN) (P w ¼ 0.2719), systolic blood pressure (P w ¼ 0.2240), and creatinine level (Pw ¼ 0.1570) of the patients were found to be predictors of AF burden. Atrial fibrillation burden increases as baseline SF-12 mental component score gets lower; systolic blood pressure, BUN and creatinine levels become higher; and the patient gets older. The AF burden increased significantly at age .76. Conclusions With the ANTIPAF-AFNET 2 dataset, the present data-mining analyses suggest that a baseline SF-12 mental component score, age, systolic blood pressure, BUN, and creatinine level of the patients are predictors of AF burden. Additional studies are necessary to understand the distinct kidney-specific pathophysiological pathways that contribute to AF burden.
Lecture Notes in Computer Science, 2006
This volume contains the papers presented at the 8th European Conference on Case-Based Reasoning ... more This volume contains the papers presented at the 8th European Conference on Case-Based Reasoning (ECCBR 2006). Case-Based Reasoning (CBR) is an artificial intelligence approach where new problems are solved by remembering, adapting and reusing solutions to a previously solved, similar problem. The collection of previously solved problems and their associated solutions is stored in the case base. New or adapted solutions are learned and updated in the case base as needed. ECCBR and its sister conference ICCBR alternate every year. ECCBR 2006 followed a series of seven successful European Workshops previously held in
This paper describes ongoing research project to develop an intelligent tutoring system for teach... more This paper describes ongoing research project to develop an intelligent tutoring system for teaching Turkish grammar as a foreign language. The teaching strategy is based on drill and practice. The learner is expected to translate a given meaning into a sentence in Turkish. The meaning to be translated is either randomly and meaningfully generated by the system or set up by the learner. To be able to evaluate the learner’s translation, or explain how to translate a meaning into a Turkish sentence, the system must be able to generate the correct sentence in Turkish as well. The paper describes a generative grammar which is both easy for human learners and suitable for computer generation of sentences in the Turkish language. The grammar developed here is based on the agglutinative structure and the vowel harmonic feature of the Turkish language.
Knowledge Based Systems, 2009
In a typical application of association rule learning from market basket data, a set of transacti... more In a typical application of association rule learning from market basket data, a set of transactions for a fixed period of time is used as input to rule learning algorithms. For example, the well-known Apriori algorithm can be applied to learn a set of association rules from such a transaction set. However, learning association rules from a set of transactions is not a one time only process. For example, a market manager may perform the association rule learning process once every month over the set of transactions collected through the last month. For this reason, we will consider the problem where transaction sets are input to the system as a stream of packages. The sets of transactions may come in varying sizes and in varying periods. Once a set of transactions arrive, the association rule learning algorithm is executed on the last set of transactions, resulting in new association rules. Therefore, the set of association rules learned will accumulate and increase in number over time, making the mining of interesting ones out of this enlarging set of association rules impractical for human experts. We refer to this sequence of rules as ''association rule set stream" or ''streaming association rules" and the main motivation behind this research is to develop a technique to overcome the interesting rule selection problem. A successful association rule mining system should select and present only the interesting rules to the domain experts. However, definition of interestingness of association rules on a given domain usually differs from one expert to another and also over time for a given expert. This paper proposes a post-processing method to learn a subjective model for the interestingness concept description of the streaming association rules. The uniqueness of the proposed method is its ability to formulate the interestingness issue of association rules as a benefit-maximizing classification problem and obtain a different interestingness model for each user. In this new classification scheme, the determining features are the selective objective interestingness factors related to the interestingness of the association rules, and the target feature is the interestingness label of those rules. The proposed method works incrementally and employs user interactivity at a certain level. It is evaluated on a real market dataset. The results show that the model can successfully select the interesting ones.
Gazi Journal of Economics and Business
Türkiye Bilişim Derneği (TBD), 2020
2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016
In this paper, we propose an efficient solution for the privacy-preserving of a bipartite ranking... more In this paper, we propose an efficient solution for the privacy-preserving of a bipartite ranking algorithm. The bipartite ranking problem can be considered as finding a function that ranks positive instances (in a dataset) higher than the negative ones. However, one common concern for all the existing schemes is the privacy of individuals in the dataset. That is, one (e.g., a researcher) needs to access the records of all individuals in the dataset in order to run the algorithm. This privacy concern puts limitations on the use of sensitive personal data for such analysis. The RIMARC (Ranking Instances by Maximizing Area under the ROC Curve) algorithm solves the bipartite ranking problem by learning a model to rank instances. As part of the model, it learns weights for each feature by analyzing the area under receiver operating characteristic (ROC) curve. RIMARC algorithm is shown to be more accurate and efficient than its counterparts. Thus, we use this algorithm as a building-block and provide a privacy-preserving version of the RIMARC algorithm using homomorphic encryption and secure multi-party computation. Our proposed algorithm lets a data owner outsource the storage and processing of its encrypted dataset to a semi-trusted cloud. Then, a researcher can get the results of his/her queries (to learn the ranking function) on the dataset by interacting with the cloud. During this process, neither the researcher nor the cloud learns any information about the raw dataset. We prove the security of the proposed algorithm and show its efficiency via experiments on real data.
This paper presents the results of the application of an instance-based learning algorithm k-Near... more This paper presents the results of the application of an instance-based learning algorithm k-Nearest Neighbor Method on Feature Projections (k-NNFP) to text categorization and compares it with k-Nearest Neighbor Classifier (k-NN). k-NNFP is similar to k-NN except it finds the nearest neighbors according to each feature separately.
Lecture Notes in Computer Science, 2004
Inducing classification rules on domains from which information is gathered at regular periods le... more Inducing classification rules on domains from which information is gathered at regular periods lead the number of such classification rules to be generally so huge that selection of interesting ones among all discovered rules becomes an important task. At each period, using the newly gathered information from the domain, the new classification rules are induced. Therefore, these rules stream through time and are so called streaming classification rules. In this paper, an interactive rule interestingness-learning algorithm (IRIL) is developed to automatically label the classification rules either as "interesting" or "uninteresting" with limited user interaction. In our study, VFP (Voting Feature Projections), a feature projection based incremental classification learning algorithm, is also developed in the framework of IRIL. The concept description learned by the VFP algorithm constitutes a novel approach for interestingness analysis of streaming classification rules.
Machine Learning, Apr 1, 1996
This paper presents a new form of exemplar-based learning, based on a representation scheme calle... more This paper presents a new form of exemplar-based learning, based on a representation scheme called feature partitioning, and a particular implementation of this technique called CFP (for Classification by Feature Partitioning). Learning in CFP is accomplished by storing the objects separately in each feature dimension as disjoint sets of values called segments. A segment is expanded through generalization or specialized by dividing it into sub-segments. Classification is based on a weighted voting among the individual predictions of the features, which are simply the class values of the segments corresponding to the values of a test instance for each feature. An empirical evaluation of CFP and its comparison with two other classification techniques that consider each feature separately are given.
International Journal of Pattern Recognition and Artificial Intelligence, Feb 1, 2013
Many machine learning algorithms require the features to be categorical. Hence, they require all ... more Many machine learning algorithms require the features to be categorical. Hence, they require all numeric-valued data to be discretized into intervals. In this paper, we present a new discretization method based on the receiver operating characteristics (ROC) Curve (AUC) measure. Maximum area under ROC curve-based discretization (MAD) is a global, static and supervised discretization method. MAD uses the sorted order of the continuous values of a feature and discretizes the feature in such a way that the AUC based on that feature is to be maximized. The proposed method is compared with alternative discretization methods such as ChiMerge, Entropy-Minimum Description Length Principle (MDLP), Fixed Frequency Discretization (FFD), and Proportional Discretization (PD). FFD and PD have been recently proposed and are designed for Naïve Bayes learning. ChiMerge is a merging discretization method as the MAD method. Evaluations are performed in terms of M-Measure, an AUC-based metric for multi-class classi¯cation, and accuracy values obtained from Naïve Bayes and Aggregating One-Dependence Estimators (AODE) algorithms by using real-world datasets. Empirical results show that MAD is a strong candidate to be a good alternative to other discretization methods.
arXiv (Cornell University), Jul 26, 1996
This paper proposes a mechanism for learning pattern correspondences between two languages from a... more This paper proposes a mechanism for learning pattern correspondences between two languages from a corpus of translated sentence pairs. The proposed mechanism uses analogical reasoning between two translations. Given a pair of translations, the similar parts of the sentences in the source language must correspond the similar parts of the sentences in the target language. Similarly, the di erent parts should correspond to the respective parts in the translated sentences. The correspondences between the similarities, and also di erences are learned in the form of translation rules. The system is tested on a small training dataset and produced promising results for further investigation.
Distributed Artificial Intelligence (DAI) research is concerned with solving problems using both ... more Distributed Artificial Intelligence (DAI) research is concerned with solving problems using both AI techniques and distributed processing. DAI techniques can be used to solve many complex problems such as engineering design problems, interpretation problems, planning problems, etc. In this paper, first a survey of DAI approach to problem solving is given and then how a DAI tool, the CEF, is used to implement a system for designing steam condensers, called STEAMER, is described
Biophysical Journal, 2015
In most of the real-world domains, benefit and costs of classifications can be dependent on the c... more In most of the real-world domains, benefit and costs of classifications can be dependent on the characteristics of individual examples. In such cases, there is no static benefit matrix available in the domain and each classification benefit is calculated separately. This situation, called feature dependency, is evaluated in the framework of our newly proposed classification algorithm Benefit Maximizing classifier with Feature Intervals (BMFI) that uses feature projection based knowledge representation. This new approach has been evaluated over bank loan applications and experimental results are presented.
Due to the increase in data mining research and applications, selection of interesting rules amon... more Due to the increase in data mining research and applications, selection of interesting rules among a huge number of learned rules is an important task in data mining applications. In this paper, the metrics for the interestingness of a rule is investigated and an algorithm that can classify the learned rules according to their interestingness is developed. Classification algorithms were designed to maximize the number of correctly classified instances, given a set of unseen test cases. Furthermore, feature projection based classification algorithms were tested and shown to be successful in large number of real domains. So, in this work, a feature projection based classification algorithm (VFI, Voting Feature Intervals) is adapted to the rule interestingness problem, and FPRC (Feature Projection Based Rule Classification) algorithm is developed.
Voting Features based Classifiers, shortly VFC, have been shown to perform well on most real-worl... more Voting Features based Classifiers, shortly VFC, have been shown to perform well on most real-world data sets. They are robust to irrelevant features and missing feature values. In this paper, we introduce an extension to VFC, called Voting Features based Classifier with feature Construction, VFCC for short, and show its application to the problem of predicting if a bank will encounter financial distress, by analyzing current financial statements. The previously developed VFC learn a set of rules that contain a single condition based on a single feature in their antecedent. The VFCC algorithm proposed in this work, on the other hand, constructs rules whose antecedents may contain conjuncts based on several features. Experimental results on recent financial ratios of banks in Turkey show that the VFCC algorithm achieves better accuracy than other well-known rule learning classification algorithms.
Systems for inducing concept descriptions from examples are valuable tools for assisting in the t... more Systems for inducing concept descriptions from examples are valuable tools for assisting in the task of knowledge acquisition for expert systems. In this research three machine learning techniques are applied to the problem of predicting the daily changes in the index of Istanbul Stock Market, given the price changes in other investment instruments such as foreign currencies and gold, also changes in the interest rates of government bonds and bank certificate of deposit accounts. The techniques used are instance-based learning (IBL), nested-generalized exemplars (NGE), and neural networks (NN). These techniques are applied to the actual data comprising the values between January 1991 and July 1992. The most important characteristic of this data is the large amount of noise inherent in its domain. In this paper we compare these three learning techniques in terms of efficiency, ability to cope with noisy data, and human friendliness of the learned concepts.
Springer eBooks, 1993
Several representation techniques have been used to describe concepts for su-pervised learning ta... more Several representation techniques have been used to describe concepts for su-pervised learning tasks. Exemplar-based learning techniques store only specific examples that are representatives of other several similar instances. Previous im-plementations of this approach usually ...
Europace, Oct 12, 2016
The aims of this study include (i) pursuing data-mining experiments on the Angiotensin II-Antagon... more The aims of this study include (i) pursuing data-mining experiments on the Angiotensin II-Antagonist in Paroxysmal Atrial Fibrillation (ANTIPAF-AFNET 2) trial dataset containing atrial fibrillation (AF) burden scores of patients with many clinical parameters and (ii) revealing possible correlations between the estimated risk factors of AF and other clinical findings or measurements provided in the dataset. Methods Ranking Instances by Maximizing the Area under a Receiver Operating Characteristics (ROC) Curve (RIMARC) is used to determine the predictive weights (P w) of baseline variables on the primary endpoint. Chi-square automatic interaction detector algorithm is performed for comparing the results of RIMARC. The primary endpoint of the ANTIPAF-AFNET 2 trial was the percentage of days with documented episodes of paroxysmal AF or with suspected persistent AF. Results By means of the RIMARC analysis algorithm, baseline SF-12 mental component score (P w ¼ 0.3597), age (P w ¼ 0.2865), blood urea nitrogen (BUN) (P w ¼ 0.2719), systolic blood pressure (P w ¼ 0.2240), and creatinine level (Pw ¼ 0.1570) of the patients were found to be predictors of AF burden. Atrial fibrillation burden increases as baseline SF-12 mental component score gets lower; systolic blood pressure, BUN and creatinine levels become higher; and the patient gets older. The AF burden increased significantly at age .76. Conclusions With the ANTIPAF-AFNET 2 dataset, the present data-mining analyses suggest that a baseline SF-12 mental component score, age, systolic blood pressure, BUN, and creatinine level of the patients are predictors of AF burden. Additional studies are necessary to understand the distinct kidney-specific pathophysiological pathways that contribute to AF burden.
Lecture Notes in Computer Science, 2006
This volume contains the papers presented at the 8th European Conference on Case-Based Reasoning ... more This volume contains the papers presented at the 8th European Conference on Case-Based Reasoning (ECCBR 2006). Case-Based Reasoning (CBR) is an artificial intelligence approach where new problems are solved by remembering, adapting and reusing solutions to a previously solved, similar problem. The collection of previously solved problems and their associated solutions is stored in the case base. New or adapted solutions are learned and updated in the case base as needed. ECCBR and its sister conference ICCBR alternate every year. ECCBR 2006 followed a series of seven successful European Workshops previously held in
This paper describes ongoing research project to develop an intelligent tutoring system for teach... more This paper describes ongoing research project to develop an intelligent tutoring system for teaching Turkish grammar as a foreign language. The teaching strategy is based on drill and practice. The learner is expected to translate a given meaning into a sentence in Turkish. The meaning to be translated is either randomly and meaningfully generated by the system or set up by the learner. To be able to evaluate the learner’s translation, or explain how to translate a meaning into a Turkish sentence, the system must be able to generate the correct sentence in Turkish as well. The paper describes a generative grammar which is both easy for human learners and suitable for computer generation of sentences in the Turkish language. The grammar developed here is based on the agglutinative structure and the vowel harmonic feature of the Turkish language.
Knowledge Based Systems, 2009
In a typical application of association rule learning from market basket data, a set of transacti... more In a typical application of association rule learning from market basket data, a set of transactions for a fixed period of time is used as input to rule learning algorithms. For example, the well-known Apriori algorithm can be applied to learn a set of association rules from such a transaction set. However, learning association rules from a set of transactions is not a one time only process. For example, a market manager may perform the association rule learning process once every month over the set of transactions collected through the last month. For this reason, we will consider the problem where transaction sets are input to the system as a stream of packages. The sets of transactions may come in varying sizes and in varying periods. Once a set of transactions arrive, the association rule learning algorithm is executed on the last set of transactions, resulting in new association rules. Therefore, the set of association rules learned will accumulate and increase in number over time, making the mining of interesting ones out of this enlarging set of association rules impractical for human experts. We refer to this sequence of rules as ''association rule set stream" or ''streaming association rules" and the main motivation behind this research is to develop a technique to overcome the interesting rule selection problem. A successful association rule mining system should select and present only the interesting rules to the domain experts. However, definition of interestingness of association rules on a given domain usually differs from one expert to another and also over time for a given expert. This paper proposes a post-processing method to learn a subjective model for the interestingness concept description of the streaming association rules. The uniqueness of the proposed method is its ability to formulate the interestingness issue of association rules as a benefit-maximizing classification problem and obtain a different interestingness model for each user. In this new classification scheme, the determining features are the selective objective interestingness factors related to the interestingness of the association rules, and the target feature is the interestingness label of those rules. The proposed method works incrementally and employs user interactivity at a certain level. It is evaluated on a real market dataset. The results show that the model can successfully select the interesting ones.
Gazi Journal of Economics and Business
Türkiye Bilişim Derneği (TBD), 2020
2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016
In this paper, we propose an efficient solution for the privacy-preserving of a bipartite ranking... more In this paper, we propose an efficient solution for the privacy-preserving of a bipartite ranking algorithm. The bipartite ranking problem can be considered as finding a function that ranks positive instances (in a dataset) higher than the negative ones. However, one common concern for all the existing schemes is the privacy of individuals in the dataset. That is, one (e.g., a researcher) needs to access the records of all individuals in the dataset in order to run the algorithm. This privacy concern puts limitations on the use of sensitive personal data for such analysis. The RIMARC (Ranking Instances by Maximizing Area under the ROC Curve) algorithm solves the bipartite ranking problem by learning a model to rank instances. As part of the model, it learns weights for each feature by analyzing the area under receiver operating characteristic (ROC) curve. RIMARC algorithm is shown to be more accurate and efficient than its counterparts. Thus, we use this algorithm as a building-block and provide a privacy-preserving version of the RIMARC algorithm using homomorphic encryption and secure multi-party computation. Our proposed algorithm lets a data owner outsource the storage and processing of its encrypted dataset to a semi-trusted cloud. Then, a researcher can get the results of his/her queries (to learn the ranking function) on the dataset by interacting with the cloud. During this process, neither the researcher nor the cloud learns any information about the raw dataset. We prove the security of the proposed algorithm and show its efficiency via experiments on real data.
This paper presents the results of the application of an instance-based learning algorithm k-Near... more This paper presents the results of the application of an instance-based learning algorithm k-Nearest Neighbor Method on Feature Projections (k-NNFP) to text categorization and compares it with k-Nearest Neighbor Classifier (k-NN). k-NNFP is similar to k-NN except it finds the nearest neighbors according to each feature separately.