New Multi-dimensional Sorting Based K-Anonymity Microaggregation for Statistical Disclosure Control (original) (raw)

Microaggregation Sorting Framework for K-Anonymity Statistical Disclosure Control in Cloud Computing

IEEE Transactions on Cloud Computing, 2015

In cloud computing, there have led to an increase in the capability to store and record personal data (microdata) in the cloud. In most cases, data providers have no/little control that has led to concern that the personal data may be beached. Microaggregation techniques seek to protect microdata in such a way that data can be published and mined without providing any private information that can be linked to specific individuals. An optimal microaggregation method must minimize the information loss resulting from this replacement process. The challenge is how to minimize the information loss during the microaggregation process. This paper presents a sorting framework for Statistical Disclosure Control (SDC) to protect microdata in cloud computing. It consists of two stages. In the first stage, an algorithm sorts all records in a data set in a particular way to ensure that during microaggregation very dissimilar observations are never entered into the same cluster. In the second stage a microaggregation method is used to create k-anonymous clusters while minimizing the information loss. The performance of the proposed techniques is compared against the most recent microaggregation methods. Experimental results using benchmark datasets show that the proposed algorithms perform significantly better than existing associate techniques in the literature.

Novel Iterative Min-Max Clustering to Minimize Information Loss in Statistical Disclosure Control

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2015

In recent years, there has been an alarming increase of online identity theft and attacks using personally identifiable information. The goal of privacy preservation is to de-associate individuals from sensitive or microdata information. Microaggregation techniques seeks to protect microdata in such a way that can be published and mined without providing any private information that can be linked to specific individuals. Microaggregation works by partitioning the microdata into groups of at least k records and then replacing the records in each group with the centroid of the group. An optimal microaggregation method must minimize the information loss resulting from this replacement process. The challenge is how to minimize the information loss during the microaggregation process. This paper presents a new microaggregation technique for Statistical Disclosure Control (SDC). It consists of two stages. In the first stage, the algorithm sorts all the records in the data set in a particular way to ensure that during microaggregation very dissimilar observations are never entered into the same cluster. In the second stage an optimal microaggregation method is used to create k-anonymous clusters while minimizing the information loss. It works by taking the sorted data and simultaneously creating two distant clusters using the two extreme sorted values as seeds for the clusters. The performance of the proposed technique is compared against the most recent microaggregation methods. Experimental results using benchmark datasets show that the proposed algorithm has the lowest information loss compared with a basket of techniques in the literature.

A pairwise-systematic microaggregation for statistical disclosure control

Proceedings of the 10th IEEE …, 2011

Microdata protection in statistical databases has recently become a major societal concern and has been intensively studied in recent years. Statistical Disclosure Control (SDC) is often applied to statistical databases before they are released for public use. Microaggregation for SDC is a family of methods to protect microdata from individual identification. SDC seeks to protect microdata in such a way that can be published and mined without providing any private information that can be linked to specific individuals. Microaggregation works by partitioning the microdata into groups of at least records and then replacing the records in each group with the centroid of the group. An optimal microaggregation method must minimize the information loss resulting from this replacement process. The challenge is how to minimize the information loss during the microaggregation process. This paper presents a pairwise systematic (P-S) microaggregation method to minimize the information loss. The proposed technique simultaneously forms two distant groups at a time with the corresponding similar records together in a systematic way and then anonymized with the centroid of each group individually. The structure of P-S problem is defined and investigated and an algorithm of the proposed problem is developed. The performance of the P-S algorithm is compared against the most recent microaggregation methods. Experimental results show that P-S algorithm incurs less than half information loss than the latest microaggregation methods for all of the test situations.

K−Means Clustering Microaggregation for Statistical Disclosure Control

Advances in Intelligent Systems and Computing, 2012

This paper presents a K-means clustering technique that satisfies the biobjective function to minimize the information loss and maintain k-anonymity. The proposed technique starts with one cluster and subsequently partitions the dataset into two or more clusters such that the total information loss across all clusters is the least, while satisfying the k-anonymity requirement. The structure of K− means clustering problem is defined and investigated and an algorithm of the proposed problem is developed. The performance of the K− means clustering algorithm is compared against the most recent microaggregation methods. Experimental results show that K− means clustering algorithm incurs less information loss than the latest microaggregation methods for all of the test situations.

Extended K-Anonymity Model for Privacy Preserving on Micro Data

International Journal of Computer Network and Information Security, 2015

Today, information collectors, particularly statistical organizations, are faced with two conflicting issues. On one hand, according to their natural responsibilities and the increasing demand for the collected data, they are committed to propagate the information more extensively and with higher quality and on the other hand, due to the public concern about the privacy of personal information and the legal responsibility of these organizations in protecting the private information of their users, they should guarantee that while providing all the information to the population, the privacy is reasonably preserved. This issue becomes more crucial when the datasets published by data mining methods are at risk of attribute and identity disclosure attacks. In order to overcome this problem, several approaches, called p-sensitive k-anonymity, p+-sensitive k-anonymity, and (p, α)-sensitive k-anonymity, were proposed. The drawbacks of these methods include the inability to protect micro datasets against attribute disclosure and the high value of the distortion ratio. In order to eliminate these drawbacks, this paper proposes an algorithm that fully protects the propagated micro data against identity and attribute disclosure and significantly reduces the distortion ratio during the anonymity process.

Efficient Anonymization Algorithms to Prevent Generalized Losses and Membership Disclosure in Microdata

2017

Nowadays, data and knowledge extracted by data mining techniques represent a key asset driving research, innovation, and policy-making activities. Many agencies and organizations have recognized the need of accelerating such trends and are therefore willing to release the data they collected to other parties, for purposes such as research and the formulation of public policies. However, the data publication processes are today still very difficult. Data often contains personally identifiable information and therefore releasing such data may result privacy breaches, this is the case for the examples of micro-data, e.g., census data and medical data. This thesis studies how we can publish and share micro data in privacy-preserving manner. This present a next ensive study of this problem along three dimensions: Designing a simple, intuitive, and robust privacy model, designing an effective anonymization technique that works on sparse and high-dimensional data and developing a methodolo...

A DENSITY-BASED MICRO AGGREGATION TECHNIQUE FOR PRIVACY-PRESERVING DATA MINING

Microaggregation is an effective means of protecting the microdata in the statistical databases. Microaggragation protects the microdata by partitioning them into groups of at least k records in each group and substituting the records in each group with the centroid of the group. An optimal microaggregation can be achieved by minimizing the information loss incurred from the aggregation. This paper presents a density-based microagregation method for protecting the numeric data employing the densitybased notion of clustering. In this work we provide a microaggragation method which reduces the risk of microdata disclosure and incurs the minimum information loss with incresed data utility. In addition to it the paper also shows an experimental compariosn with the existing heuristics for microaggregation.

Privacy-preserving data publishing for cluster analysis

Data & Knowledge Engineering, 2009

Releasing person-specific data could potentially reveal sensitive information about individuals. k-anonymization is a promising privacy protection mechanism in data publishing. Although substantial research has been conducted on k-anonymization and its extensions in recent years, only a few prior works have considered releasing data for some specific purpose of data analysis. This paper presents a practical data publishing framework for generating a masked version of data that preserves both individual privacy and information usefulness for cluster analysis. Experiments on real-life data suggest that by focusing on preserving cluster structure in the masking process, the cluster quality is significantly better than the cluster quality of the masked data without such focus. The major challenge of masking data for cluster analysis is the lack of class labels that could be used to guide the masking process. Our approach converts the problem into the counterpart problem for classification analysis, wherein class labels encode the cluster structure in the data, and presents a framework to evaluate the cluster quality on the masked data.

TFRP: An efficient microaggregation algorithm for statistical disclosure control

Journal of Systems and Software, 2007

Recently, the issue of Statistic Disclosure Control (SDC) has attracted much attention. SDC is a very important part of data security dealing with the protection of databases. Microaggregation for SDC techniques is widely used to protect confidentiality in statistical databases released for public use. The basic problem of microaggregation is that similar records are clustered into groups, and each group contains at least k records to prevent disclosure of individual information, where k is a pre-defined security threshold. For a certain k, an optimal multivariable microaggregation has the lowest information loss. The minimum information loss is an NP-hard problem. Existing fixed-size techniques can obtain a low information loss with O(n 2) or O(n 3 /k) time complexity. To improve the execution time and lower information loss, this study proposes the Two Fixed Reference Points (TFRP) method, a two-phase algorithm for microaggregation. In the first phase, TFRP employs the pre-computing and median-of-medians techniques to efficiently shorten its running time to O(n 2 /k). To decrease information loss in the second phase, TFRP generates variable-size groups by removing the lower homogenous groups. Experimental results reveal that the proposed method is significantly faster than the Diameter and the Centroid methods. Running on several test datasets, TFRP also significantly reduces information loss, particularly in sparse datasets with a large k.

Personalised anonymity for microdata release

IET Information Security, 2018

Individual privacy protection in the released data sets has become an important issue in recent years. The release of microdata provides a significant information resource for researchers, whereas the release of person-specific data poses a threat to individual privacy. Unfortunately, microdata could be linked with publicly available information to exactly re-identify individuals' identities. In order to relieve privacy concerns, data has to be protected with a privacy protection mechanism before its disclosure. The k-anonymity model is an important method in privacy protection to reduce the risk of re-identification in microdata release. This model necessitates the indistinguishably of each tuple from at least k − 1 other tuples in the released data. While k-anonymity preserves the truthfulness of the released data, the privacy level of anonymisation is same for each individual. However, different individuals have different privacy needs in the real world. Thereby, personalisation plays an important role in supporting the notion of individual privacy protection. This study proposes a personalised anonymity model that provides distinct privacy levels for each individual by offering them to control their anonymity on the released data. To satisfy the personal anonymity requirements with low information loss, the authors introduce a clustering based algorithm.