2018 Differential Privacy Synthetic Data Challenge (original) (raw)

Challenge Details

The Differential Privacy Synthetic Data Challenge tasked participants with creating new methods, or improving existing methods of data de-identification, while preserving the dataset's utility for analysis. Competitors participated in three marathon matches on the Topcoder platform with the goal of designing, implementing, and proving that their synthetic data generation algorithm satisfied differential privacy. All solutions were required to satisfy the differential privacy guarantee, a provable guarantee of individual privacy protection.

Differential Privacy Video Explanation

What is Differential Privacy?

What is Differential Privacy?

Blue button with white text that reads "View Winner Open-Source Code"

The Demand for Data Privacy

The growing influence of data is accompanied by an increased risk of exploitation when information falls into the wrong hands. Weaknesses in the security of the original data threaten the privacy of individuals. Even a dataset that protects individual identities is vulnerable to misuse.

NIST PSCR remains committed to the mission of advancing public safety research and has worked extensively to address the need for the preservation of security and privacy, as achieved in part through de-identification. This challenge aimed to protect individual privacy while allowing for public safety data to be used by researchers for positive purposes and outcomes. It is well known that privacy in data release is an important area for the Federal Government (which has an Open Data Policy), state governments, the public safety sector and many commercial non-governmental organizations. Developments from this competition were designed to drive major advances in the practical applications of differential privacy for these organizations.

circle with image inside of a crosswalk with people walking around a city and coding in the forefront of the image

Privacy-preserving synthetic data tools support the use of datasets with significant public value that otherwise could not be shared due to the presence of sensitive personal information. Using common analytics tasks such as clustering, classification, and regression, a synthetic (or artificial) dataset can be produced to serve as a practical replacement for the original sensitive data. By mathematically proving that a synthetic data generator satisfies the rigorous Differential Privacy guarantee, we can be confident that the synthetic data it produces won't contain any information that can be traced back to specific individuals in the original data.

The Challenge Impact

The challenge built upon Phase 1, the Unlinkable Data Challenge in which contestants proposed mechanisms to protect personally identifiable information while maintaining a dataset's utility for analysis. Phase 2 applied these concepts in an algorithmic competition that produced differentially private synthetic data tools to de-identify privacy-sensitive datasets. As a result, this challenge proved to be the first time that two competing differential privacy approaches were clearly benchmarked against each other. This was also a new opportunity for leaders in the field to take differential privacy theory and convert it to practice applied algorithms. These outcomes are significant implications for differential privacy research and data privacy.

Topcoder logo

Read more about the details of this challenge on Challenge.gov and the Topcoder forum (account login required).