Jianwei Qian | Illinois Institute of Technology (original) (raw)
Papers by Jianwei Qian
We are speeding toward a not-too-distant future when we can perform human-computer interaction us... more We are speeding toward a not-too-distant future when we can perform human-computer interaction using solely our voice. Speech recognition is the key technology that powers voice input, and it is usually outsourced to the cloud for the best performance. However, user privacy is at risk because voiceprints are directly exposed to the cloud, which gives rise to security issues such as spoof attacks on speaker authentication systems. Additionally, it may cause privacy issues as well, for instance, the speech content could be abused for user profiling. To address this unexplored problem, we propose to add an intermediary between users and the cloud, named VoiceMask, to anonymize speech data before sending it to the cloud for speech recognition. It aims to mitigate the security and privacy risks by concealing voiceprints from the cloud. VoiceMask is built upon voice conversion but is much more than that; it is resistant to two de-anonymization attacks and satisfies differential privacy. It performs anonymization in resource-limited mobile devices while still maintaining the usability of the cloud-based voice input service. We implement VoiceMask on Android and present extensive experimental results. The evaluation substantiates the efficacy of VoiceMask, e.g., it is able to reduce the chance of a user's voice being identified from 50 people by a mean of 84%, while reducing voice input accuracy no more than 14.2%.
At the same time the European Union is implementing new strict data protection regulations, China... more At the same time the European Union is implementing new strict data protection regulations, China's data trading and sharing markets are booming. Here, we survey the status of these developing markets driven by growing demand from artificial intelligence (AI)-related industries, covering government encouragement as well as critical concerns and research opportunities including privacy and security.
— Deep Learning has shown promising performance in a variety of pattern recognition tasks owning ... more — Deep Learning has shown promising performance in a variety of pattern recognition tasks owning to large quantities of training data and complex structures of neural networks. However conventional deep neural network (DNN) training involves centrally collecting and storing the training data, and then centrally training the neural network, which raises much privacy concerns for the data producers. In this paper, we study how to enable deep learning without disclosing individual data to the DNN trainer. We analyze the risks in conventional deep learning training, then propose a novel idea – Crowdlearning, which decentralizes the heavy-load training procedure and deploys the training into a crowd of computation-restricted mobile devices who generate the training data. Finally, we propose SliceNet, which ensures mobile devices can afford the computation cost and simultaneously minimize the total communication cost. The combination of Crowdlearning and SliceNet ensures the sensitive data generated by mobile devices never leave the devices, and the training procedure will hardly disclose any inferable contents. We numerically simulate our prototype of SliceNet which crowdlearns an accurate DNN for image classification, and demonstrate the high performance, acceptable calculation and communication cost, satisfiable privacy protection, and preferable convergence rate, on the benchmark DNN structure and dataset.
Deep Learning has shown promising performance in a variety of pattern recognition tasks owning to... more Deep Learning has shown promising performance in a variety of pattern recognition tasks owning to large quantities of training data and complex structures of neural networks. However conventional deep neural network (DNN) training involves centrally collecting and storing the training data, and then centrally training the neural network, which raises much privacy concerns for the data producers. In this paper, we study how to enable deep learning without disclosing individual data to the DNN trainer. We analyze the risks in conventional deep learning training, then propose a novel idea -Crowdlearning, which decentralizes the heavy-load training procedure and deploys the training into a crowd of computation-restricted mobile devices who generate the training data. Finally, we propose SliceNet, which ensures mobile devices can afford the computation cost and simultaneously minimize the total communication cost. The combination of Crowdlearning and SliceNet ensures the sensitive data generated by mobile devices never leave the devices, and the training procedure will hardly disclose any inferable contents. We numerically simulate our prototype of SliceNet which crowdlearns an accurate DNN for image classification, and demonstrate the high performance, acceptable calculation and communication cost, satisfiable privacy protection, and preferable convergence rate, on the benchmark DNN structure and dataset.
Deep Learning has shown promising performance in a variety of pattern recognition tasks owning to... more Deep Learning has shown promising performance in a variety of pattern recognition tasks owning to large quantities of training data and complex structures of neural networks. However conventional deep neural network (DNN) training involves centrally collecting and storing the training data, and then centrally training the neural network, which raises much privacy concerns for the data producers. In this paper, we study how to enable deep learning without disclosing individual data to the DNN trainer. We analyze the risks in conventional deep learning training, then propose a novel idea -Crowdlearning, which decentralizes the heavy-load training procedure and deploys the training into a crowd of computation-restricted mobile devices who generate the training data. Finally, we propose SliceNet, which ensures mobile devices can afford the computation cost and simultaneously minimize the total communication cost. The combination of Crowdlearning and SliceNet ensures the sensitive data generated by mobile devices never leave the devices, and the training procedure will hardly disclose any inferable contents. We numerically simulate our prototype of SliceNet which crowdlearns an accurate DNN for image classification, and demonstrate the high performance, acceptable calculation and communication cost, satisfiable privacy protection, and preferable convergence rate, on the benchmark DNN structure and dataset.
—The rapid information propagation facilitates our work and life without precedent in history, bu... more —The rapid information propagation facilitates our work and life without precedent in history, but it has tremendously exaggerated the risk and consequences of privacy invasion. Today's attackers are becoming more and more powerful in gathering personal information from many sources and mining these data to further uncover users' privacy. A great number of previous works have shown that, with adequate background knowledge, attackers are even able to infer sensitive information that is not revealed to anyone malicious before. In this paper, we model the attacker's knowledge using a knowledge graph and formally define the privacy inference problem. We show its #P-hardness and design an approximation algorithm to perform privacy inference in an iterative fashion, which also reflects real-life network evolution. The simulations on two data sets demonstrate the feasibility and efficacy of privacy inference using knowledge graphs.
—Voice input has been tremendously improving the user experience of mobile devices by freeing our... more —Voice input has been tremendously improving the user experience of mobile devices by freeing our hands from typing on the small screen. Speech recognition is the key technology that powers voice input, and it is usually outsourced to the cloud for the best performance. However, the cloud might compromise users' privacy by identifying their identities by voice, learning their sensitive input content via speech recognition, and then profiling the mobile users based on the content. In this paper, we design an intermediate between users and the cloud, named VoiceMask, to sanitize users' voice data before sending it to the cloud for speech recognition. We analyze the potential privacy risks and aim to protect users' identities and sensitive input content from being disclosed to the cloud. VoiceMask adopts a carefully designed voice conversion mechanism that is resistant to several attacks. Meanwhile, it utilizes an evolution-based keyword substitution technique to sanitize the voice input content. The two sanitization phases are all performed in the resource-limited mobile device while still maintaining the usability and accuracy of the cloud-supported speech recognition service. We implement the voice sanitizer on Android systems and present extensive experimental results that validate the effectiveness and efficiency of our app. It is demonstrated that we are able to reduce the chance of a user's voice being identified from 50 people by 84% while keeping the drop of speech recognition accuracy within 14.2%.
—Following the trend of data trading and data publishing , many online social networks have enabl... more —Following the trend of data trading and data publishing , many online social networks have enabled potentially sensitive data to be exchanged or shared on the web. As a result, users' privacy could be exposed to malicious third parties since they are extremely vulnerable to de-anonymization attacks, i.e., the attacker links the anonymous nodes in the social network to their real identities with the help of background knowledge. Previous work in social network de-anonymization mostly focuses on designing accurate and efficient de-anonymization methods. We study this topic from a different perspective and attempt to investigate the intrinsic relation between the attacker's knowledge and the expected de-anonymization gain. One common intuition is that the more auxiliary information the attacker has, the more accurate de-anonymization becomes. However, their relation is much more sophisticated than that. To simplify the problem, we attempt to quantify background knowledge and de-anonymization gain under several assumptions. Our theoretical analysis and simulations on synthetic and real network data show that more background knowledge may not necessarily lead to more de-anonymization gain in certain cases. Though our analysis is based on a few assumptions, the findings still leave intriguing implications for the attacker to make better use of the background knowledge when performing de-anonymization, and for the data owners to better measure the privacy risk when releasing their data to third parties.
—Privacy-preserving data publishing has been a heated research topic in the last decade. Numerous... more —Privacy-preserving data publishing has been a heated research topic in the last decade. Numerous ingenious attacks on users' privacy and defensive measures have been proposed for the sharing of various data, varying from relational data, social network data, spatiotemporal data, to images and videos. Speech data publishing, however, is still untouched in the literature. To fill this gap, we study the privacy risk in speech data publishing and explore the possibilities of performing data sanitization to achieve privacy protection while preserving data utility simultaneously. We formulate this optimization problem in a general fashion and present thorough quantifications of privacy and utility. We analyze the sophisticated impacts of possible sanitization methods on privacy and utility, and also design a novel method – key term perturbation for speech content sanitization. A heuristic algorithm is proposed to personalize the sanitization for speakers to restrict their privacy leak (p-leak limit) while minimizing the utility loss. The simulations of linkage attacks and sanitization on real datasets validate the necessity and feasibility of this work.
—The rapid information propagation facilitates our work and life without precedent in history, bu... more —The rapid information propagation facilitates our work and life without precedent in history, but it has tremendously exaggerated the risk and consequences of privacy invasion. Today's attackers are becoming more and more powerful in gathering personal information from many sources and mining these data to further uncover users' privacy. A great number of previous works have shown that, with adequate background knowledge, attackers are even able to infer sensitive information that is not revealed to anyone malicious before. In this paper, we model the attacker's knowledge using a knowledge graph and formally define the privacy inference problem. We show its #P-hardness and design an approximation algorithm to perform privacy inference in an iterative fashion, which also reflects real-life network evolution. The simulations on two data sets demonstrate the feasibility and efficacy of privacy inference using knowledge graphs.
—We propose AccountTrade, a set of accountable protocols, for big data trading among dishonest co... more —We propose AccountTrade, a set of accountable protocols, for big data trading among dishonest consumers. To secure the big data trading environment, our protocols achieve book-keeping ability and accountability against dishonest consumers who may misbehave throughout the dataset transactions. Specifically, we study the responsibilities of the consumers in the dataset trading and design AccountTrade to achieve accountability against the dishonest consumers who may try to deviate from their responsibilities. Specifically, we propose uniqueness index, a new rigorous measurement of the data uniqueness, as well as several accountable trading protocols to enable data brokers to blame the dishonest consumer when misbehavior is detected. We formally define, prove, and evaluate the accountability of our protocols by an automatic verification tool as well as extensive evaluation in real-world datasets. Our evaluation shows that AccountTrade incurs negligible constant storage overhead per file (<10KB), and it is able to handle 8-1000 concurrent data uploading per server depending on the data types.
—Social network data is widely shared, transferred and published for research purposes and busine... more —Social network data is widely shared, transferred and published for research purposes and business interests, but it has raised much concern on users' privacy. Even though users' identity information is always removed, attackers can still de-anonymize users with the help of auxiliary information. To protect against de-anonymization attack, various privacy protection techniques for social networks have been proposed. However, most existing approaches assume specific and restrict network structure as background knowledge and ignore semantic level prior belief of attackers, which are not always realistic in practice and do not apply to arbitrary privacy scenarios. Moreover, the privacy inference attack in the presence of semantic background knowledge is barely investigated. To address these shortcomings, in this work, we introduce knowledge graphs to explicitly express arbitrary prior belief of the attacker for any individual user. The processes of de-anonymization and privacy inference are accordingly formulated based on knowledge graphs. Our experiment on data of real social networks shows that knowledge graphs can power de-anonymization and inference attacks, and thus increase the risk of privacy disclosure. This suggests the validity of knowledge graphs as a general effective model of attackers' background knowledge for social network attack and privacy preservation. Index Terms—Social network data publishing, attack and privacy preservation, knowledge graph.
—Tons of online user behavior data are being generated every day on the booming and ubiquitous In... more —Tons of online user behavior data are being generated every day on the booming and ubiquitous Internet. Growing efforts have been devoted to mining the abundant behavior data to extract valuable information for research purposes or business interests. However, online users' privacy is thus under the risk of being exposed to third-parties. The last decade has witnessed a body of research works trying to perform data aggregation in a privacy-preserving way. Most of existing methods guarantee strong privacy protection yet at the cost of very limited aggregation operations, such as allowing only summation, which hardly satisfies the need of behavior analysis. In this paper, we propose a scheme PPSA, which encrypts users' sensitive data to prevent privacy disclosure from both outside analysts and the aggregation service provider, and fully supports selective aggregate functions for online user behavior analysis while guaranteeing differential privacy. We have implemented our method and evaluated its performance using a trace-driven evaluation based on a real online behavior dataset. Experiment results show that our scheme effectively supports both overall aggregate queries and various selective aggregate queries with acceptable computation and communication overheads.
—Online user behavior analysis is becoming increasingly important, and offers valuable informatio... more —Online user behavior analysis is becoming increasingly important, and offers valuable information to analysts for developing better e-commerce strategies. However, it also raises significant privacy concerns. Recently, growing efforts have been devoted to protecting the privacy of individuals while data aggregation is performed, which is a critical operation in behavior analysis. Unfortunately, existing methods allow very limited aggregation over user data, such as allowing only summation, which hardly satisfies the need of behavior analysis. In this paper, we propose a scheme PPSA, which encrypts users' sensitive data to prevent privacy leakage from both analysts and the aggregation service provider, and fully supports selective aggregate functions for differentially private data analysis. We have implemented our design and evaluated its performance using a trace-driven evaluation based on an online behavior dataset. Evaluation results show that our scheme effectively supports various selective aggregate queries with acceptable computation and communication overheads.
—We propose a graph-based framework for privacy preserving data publication, which is a systemati... more —We propose a graph-based framework for privacy preserving data publication, which is a systematic abstraction of existing anonymity approaches and privacy criteria. Graph is explored for dataset representation, background knowledge specification, anonymity operation design, as well as attack inferring analysis. The framework is designed to accommodate various datasets including social networks, relational tables, temporal and spatial sequences, and even possible unknown data models. The privacy and utility measurements of the anonymity datasets are also quantified in terms of graph features. Our experiments show that the framework is capable of facilitating privacy protection by different anonymity approaches for various datasets with desirable performance.
—Social network data is widely shared, transferred and published for research purposes and busine... more —Social network data is widely shared, transferred and published for research purposes and business interests, but it has raised much concern on users' privacy. Even though users' identity information is always removed, attackers can still de-anonymize users with the help of auxiliary information. To protect against de-anonymization attack, various privacy protection techniques for social networks have been proposed. However, most existing approaches assume specific and restrict network structure as background knowledge and ignore semantic level prior belief of attackers, which are not always realistic in practice and do not apply to arbitrary privacy scenarios. Moreover, the privacy inference attack in the presence of semantic background knowledge is barely investigated. To address these shortcomings, in this work, we introduce knowledge graphs to explicitly express arbitrary prior belief of the attacker for any individual user. The processes of de-anonymization and privacy inference are accordingly formulated based on knowledge graphs. Our experiment on data of real social networks shows that knowledge graphs can strengthen de-anonymization and inference attacks, and thus increase the risk of privacy disclosure. This suggests the validity of knowledge graphs as a general effective model of attackers' background knowledge for social network privacy preservation. Index Terms—Social network data publishing, attack and privacy preservation, knowledge graph.
We are speeding toward a not-too-distant future when we can perform human-computer interaction us... more We are speeding toward a not-too-distant future when we can perform human-computer interaction using solely our voice. Speech recognition is the key technology that powers voice input, and it is usually outsourced to the cloud for the best performance. However, user privacy is at risk because voiceprints are directly exposed to the cloud, which gives rise to security issues such as spoof attacks on speaker authentication systems. Additionally, it may cause privacy issues as well, for instance, the speech content could be abused for user profiling. To address this unexplored problem, we propose to add an intermediary between users and the cloud, named VoiceMask, to anonymize speech data before sending it to the cloud for speech recognition. It aims to mitigate the security and privacy risks by concealing voiceprints from the cloud. VoiceMask is built upon voice conversion but is much more than that; it is resistant to two de-anonymization attacks and satisfies differential privacy. It performs anonymization in resource-limited mobile devices while still maintaining the usability of the cloud-based voice input service. We implement VoiceMask on Android and present extensive experimental results. The evaluation substantiates the efficacy of VoiceMask, e.g., it is able to reduce the chance of a user's voice being identified from 50 people by a mean of 84%, while reducing voice input accuracy no more than 14.2%.
At the same time the European Union is implementing new strict data protection regulations, China... more At the same time the European Union is implementing new strict data protection regulations, China's data trading and sharing markets are booming. Here, we survey the status of these developing markets driven by growing demand from artificial intelligence (AI)-related industries, covering government encouragement as well as critical concerns and research opportunities including privacy and security.
— Deep Learning has shown promising performance in a variety of pattern recognition tasks owning ... more — Deep Learning has shown promising performance in a variety of pattern recognition tasks owning to large quantities of training data and complex structures of neural networks. However conventional deep neural network (DNN) training involves centrally collecting and storing the training data, and then centrally training the neural network, which raises much privacy concerns for the data producers. In this paper, we study how to enable deep learning without disclosing individual data to the DNN trainer. We analyze the risks in conventional deep learning training, then propose a novel idea – Crowdlearning, which decentralizes the heavy-load training procedure and deploys the training into a crowd of computation-restricted mobile devices who generate the training data. Finally, we propose SliceNet, which ensures mobile devices can afford the computation cost and simultaneously minimize the total communication cost. The combination of Crowdlearning and SliceNet ensures the sensitive data generated by mobile devices never leave the devices, and the training procedure will hardly disclose any inferable contents. We numerically simulate our prototype of SliceNet which crowdlearns an accurate DNN for image classification, and demonstrate the high performance, acceptable calculation and communication cost, satisfiable privacy protection, and preferable convergence rate, on the benchmark DNN structure and dataset.
Deep Learning has shown promising performance in a variety of pattern recognition tasks owning to... more Deep Learning has shown promising performance in a variety of pattern recognition tasks owning to large quantities of training data and complex structures of neural networks. However conventional deep neural network (DNN) training involves centrally collecting and storing the training data, and then centrally training the neural network, which raises much privacy concerns for the data producers. In this paper, we study how to enable deep learning without disclosing individual data to the DNN trainer. We analyze the risks in conventional deep learning training, then propose a novel idea -Crowdlearning, which decentralizes the heavy-load training procedure and deploys the training into a crowd of computation-restricted mobile devices who generate the training data. Finally, we propose SliceNet, which ensures mobile devices can afford the computation cost and simultaneously minimize the total communication cost. The combination of Crowdlearning and SliceNet ensures the sensitive data generated by mobile devices never leave the devices, and the training procedure will hardly disclose any inferable contents. We numerically simulate our prototype of SliceNet which crowdlearns an accurate DNN for image classification, and demonstrate the high performance, acceptable calculation and communication cost, satisfiable privacy protection, and preferable convergence rate, on the benchmark DNN structure and dataset.
Deep Learning has shown promising performance in a variety of pattern recognition tasks owning to... more Deep Learning has shown promising performance in a variety of pattern recognition tasks owning to large quantities of training data and complex structures of neural networks. However conventional deep neural network (DNN) training involves centrally collecting and storing the training data, and then centrally training the neural network, which raises much privacy concerns for the data producers. In this paper, we study how to enable deep learning without disclosing individual data to the DNN trainer. We analyze the risks in conventional deep learning training, then propose a novel idea -Crowdlearning, which decentralizes the heavy-load training procedure and deploys the training into a crowd of computation-restricted mobile devices who generate the training data. Finally, we propose SliceNet, which ensures mobile devices can afford the computation cost and simultaneously minimize the total communication cost. The combination of Crowdlearning and SliceNet ensures the sensitive data generated by mobile devices never leave the devices, and the training procedure will hardly disclose any inferable contents. We numerically simulate our prototype of SliceNet which crowdlearns an accurate DNN for image classification, and demonstrate the high performance, acceptable calculation and communication cost, satisfiable privacy protection, and preferable convergence rate, on the benchmark DNN structure and dataset.
—The rapid information propagation facilitates our work and life without precedent in history, bu... more —The rapid information propagation facilitates our work and life without precedent in history, but it has tremendously exaggerated the risk and consequences of privacy invasion. Today's attackers are becoming more and more powerful in gathering personal information from many sources and mining these data to further uncover users' privacy. A great number of previous works have shown that, with adequate background knowledge, attackers are even able to infer sensitive information that is not revealed to anyone malicious before. In this paper, we model the attacker's knowledge using a knowledge graph and formally define the privacy inference problem. We show its #P-hardness and design an approximation algorithm to perform privacy inference in an iterative fashion, which also reflects real-life network evolution. The simulations on two data sets demonstrate the feasibility and efficacy of privacy inference using knowledge graphs.
—Voice input has been tremendously improving the user experience of mobile devices by freeing our... more —Voice input has been tremendously improving the user experience of mobile devices by freeing our hands from typing on the small screen. Speech recognition is the key technology that powers voice input, and it is usually outsourced to the cloud for the best performance. However, the cloud might compromise users' privacy by identifying their identities by voice, learning their sensitive input content via speech recognition, and then profiling the mobile users based on the content. In this paper, we design an intermediate between users and the cloud, named VoiceMask, to sanitize users' voice data before sending it to the cloud for speech recognition. We analyze the potential privacy risks and aim to protect users' identities and sensitive input content from being disclosed to the cloud. VoiceMask adopts a carefully designed voice conversion mechanism that is resistant to several attacks. Meanwhile, it utilizes an evolution-based keyword substitution technique to sanitize the voice input content. The two sanitization phases are all performed in the resource-limited mobile device while still maintaining the usability and accuracy of the cloud-supported speech recognition service. We implement the voice sanitizer on Android systems and present extensive experimental results that validate the effectiveness and efficiency of our app. It is demonstrated that we are able to reduce the chance of a user's voice being identified from 50 people by 84% while keeping the drop of speech recognition accuracy within 14.2%.
—Following the trend of data trading and data publishing , many online social networks have enabl... more —Following the trend of data trading and data publishing , many online social networks have enabled potentially sensitive data to be exchanged or shared on the web. As a result, users' privacy could be exposed to malicious third parties since they are extremely vulnerable to de-anonymization attacks, i.e., the attacker links the anonymous nodes in the social network to their real identities with the help of background knowledge. Previous work in social network de-anonymization mostly focuses on designing accurate and efficient de-anonymization methods. We study this topic from a different perspective and attempt to investigate the intrinsic relation between the attacker's knowledge and the expected de-anonymization gain. One common intuition is that the more auxiliary information the attacker has, the more accurate de-anonymization becomes. However, their relation is much more sophisticated than that. To simplify the problem, we attempt to quantify background knowledge and de-anonymization gain under several assumptions. Our theoretical analysis and simulations on synthetic and real network data show that more background knowledge may not necessarily lead to more de-anonymization gain in certain cases. Though our analysis is based on a few assumptions, the findings still leave intriguing implications for the attacker to make better use of the background knowledge when performing de-anonymization, and for the data owners to better measure the privacy risk when releasing their data to third parties.
—Privacy-preserving data publishing has been a heated research topic in the last decade. Numerous... more —Privacy-preserving data publishing has been a heated research topic in the last decade. Numerous ingenious attacks on users' privacy and defensive measures have been proposed for the sharing of various data, varying from relational data, social network data, spatiotemporal data, to images and videos. Speech data publishing, however, is still untouched in the literature. To fill this gap, we study the privacy risk in speech data publishing and explore the possibilities of performing data sanitization to achieve privacy protection while preserving data utility simultaneously. We formulate this optimization problem in a general fashion and present thorough quantifications of privacy and utility. We analyze the sophisticated impacts of possible sanitization methods on privacy and utility, and also design a novel method – key term perturbation for speech content sanitization. A heuristic algorithm is proposed to personalize the sanitization for speakers to restrict their privacy leak (p-leak limit) while minimizing the utility loss. The simulations of linkage attacks and sanitization on real datasets validate the necessity and feasibility of this work.
—The rapid information propagation facilitates our work and life without precedent in history, bu... more —The rapid information propagation facilitates our work and life without precedent in history, but it has tremendously exaggerated the risk and consequences of privacy invasion. Today's attackers are becoming more and more powerful in gathering personal information from many sources and mining these data to further uncover users' privacy. A great number of previous works have shown that, with adequate background knowledge, attackers are even able to infer sensitive information that is not revealed to anyone malicious before. In this paper, we model the attacker's knowledge using a knowledge graph and formally define the privacy inference problem. We show its #P-hardness and design an approximation algorithm to perform privacy inference in an iterative fashion, which also reflects real-life network evolution. The simulations on two data sets demonstrate the feasibility and efficacy of privacy inference using knowledge graphs.
—We propose AccountTrade, a set of accountable protocols, for big data trading among dishonest co... more —We propose AccountTrade, a set of accountable protocols, for big data trading among dishonest consumers. To secure the big data trading environment, our protocols achieve book-keeping ability and accountability against dishonest consumers who may misbehave throughout the dataset transactions. Specifically, we study the responsibilities of the consumers in the dataset trading and design AccountTrade to achieve accountability against the dishonest consumers who may try to deviate from their responsibilities. Specifically, we propose uniqueness index, a new rigorous measurement of the data uniqueness, as well as several accountable trading protocols to enable data brokers to blame the dishonest consumer when misbehavior is detected. We formally define, prove, and evaluate the accountability of our protocols by an automatic verification tool as well as extensive evaluation in real-world datasets. Our evaluation shows that AccountTrade incurs negligible constant storage overhead per file (<10KB), and it is able to handle 8-1000 concurrent data uploading per server depending on the data types.
—Social network data is widely shared, transferred and published for research purposes and busine... more —Social network data is widely shared, transferred and published for research purposes and business interests, but it has raised much concern on users' privacy. Even though users' identity information is always removed, attackers can still de-anonymize users with the help of auxiliary information. To protect against de-anonymization attack, various privacy protection techniques for social networks have been proposed. However, most existing approaches assume specific and restrict network structure as background knowledge and ignore semantic level prior belief of attackers, which are not always realistic in practice and do not apply to arbitrary privacy scenarios. Moreover, the privacy inference attack in the presence of semantic background knowledge is barely investigated. To address these shortcomings, in this work, we introduce knowledge graphs to explicitly express arbitrary prior belief of the attacker for any individual user. The processes of de-anonymization and privacy inference are accordingly formulated based on knowledge graphs. Our experiment on data of real social networks shows that knowledge graphs can power de-anonymization and inference attacks, and thus increase the risk of privacy disclosure. This suggests the validity of knowledge graphs as a general effective model of attackers' background knowledge for social network attack and privacy preservation. Index Terms—Social network data publishing, attack and privacy preservation, knowledge graph.
—Tons of online user behavior data are being generated every day on the booming and ubiquitous In... more —Tons of online user behavior data are being generated every day on the booming and ubiquitous Internet. Growing efforts have been devoted to mining the abundant behavior data to extract valuable information for research purposes or business interests. However, online users' privacy is thus under the risk of being exposed to third-parties. The last decade has witnessed a body of research works trying to perform data aggregation in a privacy-preserving way. Most of existing methods guarantee strong privacy protection yet at the cost of very limited aggregation operations, such as allowing only summation, which hardly satisfies the need of behavior analysis. In this paper, we propose a scheme PPSA, which encrypts users' sensitive data to prevent privacy disclosure from both outside analysts and the aggregation service provider, and fully supports selective aggregate functions for online user behavior analysis while guaranteeing differential privacy. We have implemented our method and evaluated its performance using a trace-driven evaluation based on a real online behavior dataset. Experiment results show that our scheme effectively supports both overall aggregate queries and various selective aggregate queries with acceptable computation and communication overheads.
—Online user behavior analysis is becoming increasingly important, and offers valuable informatio... more —Online user behavior analysis is becoming increasingly important, and offers valuable information to analysts for developing better e-commerce strategies. However, it also raises significant privacy concerns. Recently, growing efforts have been devoted to protecting the privacy of individuals while data aggregation is performed, which is a critical operation in behavior analysis. Unfortunately, existing methods allow very limited aggregation over user data, such as allowing only summation, which hardly satisfies the need of behavior analysis. In this paper, we propose a scheme PPSA, which encrypts users' sensitive data to prevent privacy leakage from both analysts and the aggregation service provider, and fully supports selective aggregate functions for differentially private data analysis. We have implemented our design and evaluated its performance using a trace-driven evaluation based on an online behavior dataset. Evaluation results show that our scheme effectively supports various selective aggregate queries with acceptable computation and communication overheads.
—We propose a graph-based framework for privacy preserving data publication, which is a systemati... more —We propose a graph-based framework for privacy preserving data publication, which is a systematic abstraction of existing anonymity approaches and privacy criteria. Graph is explored for dataset representation, background knowledge specification, anonymity operation design, as well as attack inferring analysis. The framework is designed to accommodate various datasets including social networks, relational tables, temporal and spatial sequences, and even possible unknown data models. The privacy and utility measurements of the anonymity datasets are also quantified in terms of graph features. Our experiments show that the framework is capable of facilitating privacy protection by different anonymity approaches for various datasets with desirable performance.
—Social network data is widely shared, transferred and published for research purposes and busine... more —Social network data is widely shared, transferred and published for research purposes and business interests, but it has raised much concern on users' privacy. Even though users' identity information is always removed, attackers can still de-anonymize users with the help of auxiliary information. To protect against de-anonymization attack, various privacy protection techniques for social networks have been proposed. However, most existing approaches assume specific and restrict network structure as background knowledge and ignore semantic level prior belief of attackers, which are not always realistic in practice and do not apply to arbitrary privacy scenarios. Moreover, the privacy inference attack in the presence of semantic background knowledge is barely investigated. To address these shortcomings, in this work, we introduce knowledge graphs to explicitly express arbitrary prior belief of the attacker for any individual user. The processes of de-anonymization and privacy inference are accordingly formulated based on knowledge graphs. Our experiment on data of real social networks shows that knowledge graphs can strengthen de-anonymization and inference attacks, and thus increase the risk of privacy disclosure. This suggests the validity of knowledge graphs as a general effective model of attackers' background knowledge for social network privacy preservation. Index Terms—Social network data publishing, attack and privacy preservation, knowledge graph.