Publishing microdata with a robust privacy guarantee (original) (raw)
Related papers
2017
Nowadays, data and knowledge extracted by data mining techniques represent a key asset driving research, innovation, and policy-making activities. Many agencies and organizations have recognized the need of accelerating such trends and are therefore willing to release the data they collected to other parties, for purposes such as research and the formulation of public policies. However, the data publication processes are today still very difficult. Data often contains personally identifiable information and therefore releasing such data may result privacy breaches, this is the case for the examples of micro-data, e.g., census data and medical data. This thesis studies how we can publish and share micro data in privacy-preserving manner. This present a next ensive study of this problem along three dimensions: Designing a simple, intuitive, and robust privacy model, designing an effective anonymization technique that works on sparse and high-dimensional data and developing a methodolo...
Extended K-Anonymity Model for Privacy Preserving on Micro Data
International Journal of Computer Network and Information Security, 2015
Today, information collectors, particularly statistical organizations, are faced with two conflicting issues. On one hand, according to their natural responsibilities and the increasing demand for the collected data, they are committed to propagate the information more extensively and with higher quality and on the other hand, due to the public concern about the privacy of personal information and the legal responsibility of these organizations in protecting the private information of their users, they should guarantee that while providing all the information to the population, the privacy is reasonably preserved. This issue becomes more crucial when the datasets published by data mining methods are at risk of attribute and identity disclosure attacks. In order to overcome this problem, several approaches, called p-sensitive k-anonymity, p+-sensitive k-anonymity, and (p, α)-sensitive k-anonymity, were proposed. The drawbacks of these methods include the inability to protect micro datasets against attribute disclosure and the high value of the distortion ratio. In order to eliminate these drawbacks, this paper proposes an algorithm that fully protects the propagated micro data against identity and attribute disclosure and significantly reduces the distortion ratio during the anonymity process.
A framework for efficient data anonymization under privacy and accuracy constraints
ACM Transactions on Database Systems, 2009
Recent research studied the problem of publishing microdata without revealing sensitive information, leading to the privacy-preserving paradigms of k-anonymity and -diversity. k-anonymity protects against the identification of an individual's record. -diversity, in addition, safeguards against the association of an individual with specific sensitive information. However, existing approaches suffer from at least one of the following drawbacks: (i) -diversification is solved by techniques developed for the simpler k-anonymization problem, causing unnecessary information loss. (ii) The anonymization process is inefficient in terms of computational and I/O cost. (iii) Previous research focused exclusively on the privacy-constrained problem and ignored the equally important accuracy-constrained (or dual) anonymization problem.
Personalised anonymity for microdata release
IET Information Security, 2018
Individual privacy protection in the released data sets has become an important issue in recent years. The release of microdata provides a significant information resource for researchers, whereas the release of person-specific data poses a threat to individual privacy. Unfortunately, microdata could be linked with publicly available information to exactly re-identify individuals' identities. In order to relieve privacy concerns, data has to be protected with a privacy protection mechanism before its disclosure. The k-anonymity model is an important method in privacy protection to reduce the risk of re-identification in microdata release. This model necessitates the indistinguishably of each tuple from at least k − 1 other tuples in the released data. While k-anonymity preserves the truthfulness of the released data, the privacy level of anonymisation is same for each individual. However, different individuals have different privacy needs in the real world. Thereby, personalisation plays an important role in supporting the notion of individual privacy protection. This study proposes a personalised anonymity model that provides distinct privacy levels for each individual by offering them to control their anonymity on the released data. To satisfy the personal anonymity requirements with low information loss, the authors introduce a clustering based algorithm.
Handicapping attacker's confidence: an alternative to k-anonymization
Knowledge and Information Systems, 2007
We present an approach of limiting the confidence of inferring sensitive properties to protect against the threats caused by data mining abilities. The problem has dual goals: preserve the information for a wanted data analysis request and limit the usefulness of unwanted sensitive inferences that may be derived from the release of data. Sensitive inferences are specified by a set of "privacy templates". Each template specifies the sensitive property to be protected, the attributes identifying a group of individuals, and a maximum threshold for the confidence of inferring the sensitive property given the identifying attributes. We show that suppressing the domain values monotonically decreases the maximum confidence of such sensitive inferences. Hence, we propose a data transformation that minimally suppresses the domain values in the data to satisfy the set of privacy templates. The transformed data is free of sensitive inferences even in the presence of data mining algorithms. The prior k-anonymization focuses on personal identities. This work focuses on the association between personal identities and sensitive properties.
Fast Data Anonymization with Low Information Loss
2007
Recent research studied the problem of publishing microdata without revealing sensitive information, leading to the privacy preserving paradigms of k-anonymity and -diversity. k-anonymity protects against the identification of an individual's record. -diversity, in addition, safeguards against the association of an individual with specific sensitive information. However, existing approaches suffer from at least one of the following drawbacks: (i) The information loss metrics are counter-intuitive and fail to capture data inaccuracies inflicted for the sake of privacy. (ii) -diversity is solved by techniques developed for the simpler k-anonymity problem, which introduces unnecessary inaccuracies. (iii) The anonymization process is inefficient in terms of computation and I/O cost.
Acta Universitatis Apulensis. Mathematics - Informatics, 2008
New privacy regulations together with ever-increasing data availability and computational power have created a huge interest in data privacy research. One major research direction is built around k-anonymity property, which is required for the released data. Although k-anonymity protects against identity disclosure, it fails to provide an adequate level of protection with respect to attribute disclosure. We introduced a new privacy protection property called p-sensitive k-anonymity that avoids this shortcoming. We developed new algorithms (GreedyPKClustering and EnhancedPKClustering) and adapted an existing algorithm (Incognito) to generate masked microdata with p-sensitive k-anonymity property. All these algorithms try to reduce the amount of information lost while transforming data to conform to p-sensitive k-anonymity. They are different in the masking methods they use. The new algorithms are based on local recoding masking methods. Incognito, initially designed for k-anonymity, uses global recoding for masking. This paper's goal is to compare the impact of the masking method on the quality of the masked microdata obtained. For this we compare the quality of the results (cost measures based on data utility) and the efficiency (running time) of these three algorithms for masking both real and synthetic data sets.
An adaptive mechanism for anonymizing set-valued data
Classification is a fundamental problem in data analysis. Training a classifier requires accessing a large collection of data. Releasing person-specific data, such as customer data or patient records, may pose a threat to individual's privacy. Even after removing explicit identifying information such as Name and SSN, it is still possible to link released records back to their identities by matching some combination of non-identifying attributes such as {Sex,Zip, Birth date}. A useful approach to combat such linking attacks, called k-anonymization , is anonymizing the linking attributes so that at least k released records match each value combination of the linking attributes. Previous work attempted to find an optimal k-anonymization that minimizes some data distortion metric. We argue that minimizing the distortion to the training data is not relevant to the classification goal that requires extracting the structure of predication on the " future " data. In this paper, we propose a anonymization solution for classification. Our goal is to find a anonymization, not necessarily optimal in the sense of minimizing data distortion, that preserves the classification structure. We conducted intensive experiments to evaluate the impact of anonymization on the classification on future data. Experiments on real life data show that the quality of classification can be preserved even for highly restrictive anonymity requirements.
Efficient Personalized Privacy Preservation Using Anonymization
The k-anonymity privacy for publishing micro data requires that each equivalence class contains at least k records. Many authors have studied that k-anonymity cannot prevent attribute disclosure. The technique of l-diversity has been introduced to address this; l-diversity requires that each equivalence class must have at least well-represented values for every sensitive attribute. In this paper, we show that l-diversity has many limitations. In particular, it is not necessary or sufficient to prevent attribute disclosure. Motivated by these limitations, we propose a new method to detect privacy which is called as closeness. We first present the base model t-closeness, which includes the distribution of sensitive attributes in any of the equivalence classes is near to the distribution of the attribute in the overall table (i.e., the difference between the two given distributions should be no more than threshold value t). tcloseness that gives higher utility. We present our method for designing a distance measure between given two probability distributions and give two distance measures. Here we discuss the method for implementing closeness as a privacy concern and illustrate its advantages through examples and experiments.
Avoiding Attribute Disclosure with the (Extended) p-Sensitive k-Anonymity Model
Annals of Information Systems, 2009
Existing privacy regulations together with large amounts of available data created a huge interest in data privacy research. A main research direction is built around the k-anonymity property. Several shortcomings of the k-anonymity model were addressed by new privacy models such as p-sensitive k-anonymity, l-diversity, (α, k)-anonymity, t-closeness. In this chapter we describe two algorithms (GreedyPKClustering and EnhancedPKClustering) for generating (extended) p-sensitive k-anonymous microdata. In our experiments, we compare the quality of generated microdata obtained with the mentioned algorithms and with another existing anonymization algorithm (Incognito). Also, we present two new branches of p-sensitive k-anonymity, the constrained p-sensitive k-anonymity model and the psensitive k-anonymity model for social networks.