Using anonymized data for classification (original) (raw)

Improving Accuracy of Classification Models Induced from Anonymized Datasets

2013

The performance of classifiers and other data mining models can be significantly enhanced using the large repositories of digital data collected nowadays by public and private organizations. However, the original records stored in those repositories cannot be released to the data miners as they frequently contain sensitive information. The emerging field of Privacy Preserving Data Publishing (PPDP) deals with this important challenge. In this paper, we present NSVDist (Non-homogeneous generalization with Sensitive Value Distributions)a new anonymization algorithm that, given minimal anonymity and diversity parameters along with an information loss measure, issues corresponding non-homogeneous anonymizations where the sensitive attribute is published as frequency distributions over the sensitive domain rather than in the usual form of exact sensitive values. In our experiments with eight datasets and four different classification algorithms, we show that classifiers induced from data generalized by NSVDist tend to be more accurate than classifiers induced using state-of-the-art anonymization algorithms.

Anonymizing Classification Data for Privacy Preservation

IEEE Transactions on Knowledge and Data Engineering, 2000

Classification is a fundamental problem in data analysis. Training a classifier requires accessing a large collection of data. Releasing person-specific data, such as customer data or patient records, may pose a threat to individual's privacy. Even after removing explicit identifying information such as Name and SSN, it is still possible to link released records back to their identities by matching some combination of non-identifying attributes such as {Sex, Zip, Birthdate}. A useful approach to combat such linking attacks, called k-anonymization [1], is anonymizing the linking attributes so that at least k released records match each value combination of the linking attributes. Previous work attempted to find an optimal k-anonymization that minimizes some data distortion metric. We argue that minimizing the distortion to the training data is not relevant to the classification goal that requires extracting the structure of predication on the "future" data.

Practical Privacy-Preserving Data Mining

2008

Data anonymization is of increasing importance for allowing sharing of individual data for a variety of data analysis and mining applications. Most of existing work on data anonymization optimizes the anonymization in terms of data utility typically through one-size-fits-all measures such as data discernibility. Our primary viewpoint in this paper is that each target application may have a unique need of the data and the best way of measuring data utility is based on the analysis task for which the anonymized data will ultimately be used. We take a top-down analysis of typical application scenarios and derive applicationoriented anonymization criteria. We propose a prioritized anonymization scheme where we prioritize the attributes for anonymization based on how important and critical they are to the application needs. Finally, we present preliminary results that show the benefits of our approach.

Accuracy and Utility Balanced Privacy Preserving Classification Mining by Improving K-Anonymization

International journal of simulation: systems, science & technology

The data available is vast and data is being analyzed to improve businesses. This data analysis also contributes to society in different ways. Now there are new challenges to protect privacy of data. So, Privacy Preserving Data Mining (PPDM) techniques have evolved which protect the privacy of data while carrying out data analysis. Privacy Preserving Data Publishing (PPDP) is a part of PPDM which is a major research area. As part of PPDP several anonymization algorithms are proposed. Kanonymization is one among them. In this paper a new method for privacy preserving data mining is proposed which is better than applying k-anonymization alone. The present research work focuses on the approach which decreases the risk of various attacks and at the same time provides more utility of data.

Efficient Anonymizations with Enhanced Utility

2009 IEEE International Conference on Data Mining Workshops, 2009

The k-anonymization method is a commonly used privacy-preserving technique. Previous studies used various measures of utility that aim at enhancing the correlation between the original public data and the generalized public data. We, bearing in mind that a primary goal in releasing the anonymized database for data mining is to deduce methods of predicting the private data from the public data, propose a new information-theoretic measure that aims at enhancing the correlation between the generalized public data and the private data. Such a measure significantly enhances the utility of the released anonymized database for data mining. We then proceed to describe a new and highly efficient algorithm that is designed to achieve k-anonymity with high utility. That algorithm is based on a modified version of sequential clustering which is the method of choice in clustering, and it is independent of the underlying measure of utility.

Attacks on Anonymization-Based Privacy-Preserving: A Survey for Data Mining and Data Publishing

2013

Data mining is the extraction of vast interesting patterns or knowledge from huge amount of data. The initial idea of privacy-preserving data mining PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive information. The key issues were how to modify the data and how to recover the data mining result from the modified data. Privacy-preserving data mining considers the problem of running data mining algorithms on confidential data that is not supposed to be revealed even to the party running the algorithm. In contrast, privacy-preserving data publishing (PPDP) may not necessarily be tied to a specific data mining task, and the data mining task may be unknown at the time of data publishing. PPDP studies how to transform raw data into a version that is immunized against privacy attacks but that still supports effective data mining tasks. Privacy-preserving for both data mining (PPDM) and data publishing (PPDP) has become increasingly popular because it allows sharing of privacy sensitive data for analysis purposes. One well studied approach is the k-anonymity model [1] which in turn led to other models such as confidence bounding, l-diversity, t-closeness, (α,k)-anonymity, etc. In particular, all known mechanisms try to minimize information loss and such an attempt provides a loophole for attacks. The aim of this paper is to present a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explain their effects on Data Privacy.

Effects of data anonymization on the data mining results

2012

This article examines the possibility of publication of students' data, such as secondary school success, state graduation exam scores and success during their first year of university study for analyses. In order to discover data patterns and relationships using data mining techniques, the data must be released in the form of original tuples, instead of pre-aggregated statistics. These records contain sensitive and even confidential personal information, which implies significant privacy concerns regarding the disclosure of such data. Removing explicit identifiers prior to data release cannot guarantee anonymity, since the datasets still contain information that can be used for linking the released records with publicly available collections that include students' identities. One of the privacy preserving techniques proposed in the literature is the k-anonymization. The process of anonymizing a data set usually involves generalizing data records and, consequently, it incurs loss of relevant information. In the primary research undertaken in the University of Dubrovnik's students' database the effect of anonymization has been measured by comparing the results of mining the original data set with the results of mining the altered data set to determine if it is possible to use anonymized data for research purposes.

An adaptive mechanism for anonymizing set-valued data

Classification is a fundamental problem in data analysis. Training a classifier requires accessing a large collection of data. Releasing person-specific data, such as customer data or patient records, may pose a threat to individual's privacy. Even after removing explicit identifying information such as Name and SSN, it is still possible to link released records back to their identities by matching some combination of non-identifying attributes such as {Sex,Zip, Birth date}. A useful approach to combat such linking attacks, called k-anonymization , is anonymizing the linking attributes so that at least k released records match each value combination of the linking attributes. Previous work attempted to find an optimal k-anonymization that minimizes some data distortion metric. We argue that minimizing the distortion to the training data is not relevant to the classification goal that requires extracting the structure of predication on the " future " data. In this paper, we propose a anonymization solution for classification. Our goal is to find a anonymization, not necessarily optimal in the sense of minimizing data distortion, that preserves the classification structure. We conducted intensive experiments to evaluate the impact of anonymization on the classification on future data. Experiments on real life data show that the quality of classification can be preserved even for highly restrictive anonymity requirements.

Impact of Anonymization on Analysis of Data

The anonymization of data often results in loss of utility. There are different variants available in literature for anonymization. We studied various techniques and their impact on the utility. In addition, we propose a new technique which merges two techniques of anonymization to achieve higher utility. Our proposed technique performs better than other techniques for a class of queries called trend queries.

A General Survey of Privacy-Preserving Data Mining Models and Algorithms

In recent years, privacy-preserving data mining has been studied extensively, because of the wide proliferation of sensitive information on the internet. A number of algorithmic techniques have been designed for privacy-preserving data mining. In this paper, we provide a review of the state-of-the-art methods for privacy. We discuss methods for randomization, k-anonymization, and distributed privacy-preserving data mining. We also discuss cases in which the output of data mining applications needs to be sanitized for privacy-preservation purposes. We discuss the computational and theoretical limits associated with privacy-preservation over high dimensional data sets.