Privacy-Preserving Important Passage Retrieval (original) (raw)

Privacy-Preserving Multi-Document Summarization

State-of-the-art extractive multi-document summarization systems are usually designed without any concern about privacy issues, meaning that all documents are open to third parties. In this paper we propose a privacy-preserving approach to multi-document summarization. Our approach enables other parties to obtain summaries without learning anything else about the original documents' content. We use a hashing scheme known as Secure Binary Embeddings to convert documents representation containing key phrases and bag-of-words into bit strings, allowing the computation of approximate distances, instead of exact ones. Our experiments indicate that our system yields similar results to its non-private counterpart on standard multi-document evaluation datasets.

Privacy-Preserving Text Indexing for Search of Documents

Protection of content of sensitive text documents is important in enterprise intranets. An index structure is needed to support efficient search and retrieval, but it can lead to information leakage; by statistical attacks an adversary can draw probabilistic inference about the contents of document collection. Zerr and others present a confidential index structure and the ranking of retrieved documents for the query, but only for singleterm queries. The solution proposed in the paper generalizes Zerr’s method by using an anonymization parameter and query-dependent anonymized inverse document frequency factors; thereby it provides better ranking and gives possibility of multi-term queries.

A Privacy-Preserving Similarity Search Scheme over Encrypted Word Embeddings

Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services

Recent evolution in cloud computing platforms have attracted the largest amount of data than ever before. Today, even the most sensitive data are being outsourced, thus, protection is essential to ensure that privacy is not traded for the convenience provided by cloud platforms. Traditional symmetric encryption schemes provide good protection; however, they ruin the merits of cloud computing. Attempts have been made to obtain a scheme where both functionality and protection can be achieved. However, features provided in existing searchable encryption schemes tend to be left behind the latest findings in the information retrieval (IR) area. In this study, we propose a privacy-preserving similar document search system based on Simhash. Our scheme is open to the latest machine-learning based IR schemes, and performance has been tuned utilizing a VP-tree based index, which is optimized for security. Analysis and various tests on real-world datasets demonstrate the scheme's security and efficiency on real-world datasets. CCS CONCEPTS • Security and privacy → Privacy-preserving protocols; Public key (asymmetric) techniques; • Information systems → Retrieval models and ranking.

Towards Robust and Privacy-preserving Text Representations

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018

Written text often provides sufficient clues to identify the author, their gender, age, and other important attributes. Consequently, the authorship of training and evaluation corpora can have unforeseen impacts, including differing model performance for different user groups, as well as privacy implications. In this paper, we propose an approach to explicitly obscure important author characteristics at training time, such that representations learned are invariant to these attributes. Evaluating on two tasks, we show that this leads to increased privacy in the learned representations, as well as more robust models to varying evaluation conditions, including out-of-domain corpora.

PRIVACY PRESERVING NATURAL LANGUAGE PROCESSING IN THE CLOUD SUPPORTING SIMILARITY BASED TEXT RETRIEVAL THROUGH BLIND STORAGE

In cloud computing, a fundamental application is to preserve outsourced data in cloud through gateway encryption and blind storage, and to implement multi-keyword ranked search over the encrypted data in a secure way by NLP process .By using NLP (Natural language processing) technique used to search multi keyword in cloud its extract the meaning from Word Net tool. In this paper, we develop the searchable encryption for multi-keyword ranked search over the storage data. Efficient multi-keyword search scheme that can return the ranked search results based on the accuracy. Within this framework, we leverage an efficient index to further improve the search efficiency, and adopt the blind storage system to conceal access pattern of the search user. Security analysis demonstrates that our scheme can achieve confidentiality of documents and index, trapdoor privacy, trapdoor unlinkability, and concealing access pattern of the search user. Finally, using extensive simulations, we show that our proposal can achieve much improved efficiency in terms of authentication and access control compared with the existing proposals.

Preserving Privacy in Analyses of Textual Data

2020

Amazon prides itself on being the most customer-centric company on earth. That means maintaining the highest possible standards of both security and privacy when dealing with customer data. This month, at the ACM Web Search and Data Mining (WSDM) Conference, my colleagues and I will describe a way to protect privacy during large-scale analyses of textual data supplied by customers. Ourmethodworks by, essentially, re-phrasing the customersupplied text and basing analysis on the new phrasing, rather than on the customers’ own language.

Lightweight, Secure, Similar-Document Retrieval over Encrypted Data

Applied Sciences, 2021

Applications for document similarity detection are widespread in diverse communities, including institutions and corporations. However, currently available detection systems fail to take into account the private nature of material or documents that have been outsourced to remote servers. None of the existing solutions can be described as lightweight techniques that are compatible with lightweight client implementation, and this deficiency can limit the effectiveness of these systems. For instance, the discovery of similarity between two conferences or journals must maintain the privacy of the submitted papers in a lightweight manner to ensure that the security and application requirements for limited-resource devices are fulfilled. This paper considers the problem of lightweight similarity detection between document sets while preserving the privacy of the material. The proposed solution permits documents to be compared without disclosing the content to untrusted servers. The finger...

Calibrating Mechanisms for Privacy Preserving Text Analysis

2020

This talk presents a formal approach to carrying out privacy preserving text perturbation using a variant of Differential Privacy (DP) known as Metric DP (mDP). Our approach applies carefully calibrated noise to vector representation of words in a high dimension space as defined by word embedding models. We present a privacy proof that satisfies mDP where the privacy parameter ε provides guarantees with respect to a distance metric defined by the word embedding space. We demonstrate how ε can be selected by analyzing plausible deniability statistics backed up by large scale analysis on GloVe and fastText embeddings. We also conduct experiments on well-known datasets to demonstrate the tradeoff between privacy and utility for varying values of ε on different task types. Our results provide insights into carrying out practical privatization on text-based applications for a broad range of tasks.

Privacy-preserving Query-by-Example Speech Search

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015

This paper investigates a new privacy-preserving paradigm for the task of Query-by-Example Speech Search using Secure Binary Embeddings, a hashing method that converts vector data to bit strings through a combination of random projections followed by banded quantization. The proposed method allows performing spoken query search in an encrypted domain, by analyzing ciphered information computed from the original recordings. Unlike other hashing techniques, the embeddings allow the computation of the distance between vectors that are close enough, but are not perfect matches. This paper shows how these hashes can be combined with Dynamic Time Warping based on posterior derived features to perform secure speech search. Experiments performed on a sub-set of the Speech-Dat Portuguese corpus showed that the proposed privacy-preserving system obtains similar results to its non-private counterpart.

Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text

2019 IEEE International Conference on Data Mining (ICDM), 2019

Guaranteeing a certain level of user privacy in an arbitrary piece of text is a challenging issue. However, with this challenge comes the potential of unlocking access to vast data stores for training machine learning models and supporting data driven decisions. We address this problem through the lens of dχprivacy, a generalization of Differential Privacy to non Hamming distance metrics. In this work, we explore word representations in Hyperbolic space as a means of preserving privacy in text. We provide a proof satisfying dχ-privacy, then we define a probability distribution in Hyperbolic space and describe a way to sample from it in high dimensions. Privacy is provided by perturbing vector representations of words in high dimensional Hyperbolic space to obtain a semantic generalization. We conduct a series of experiments to demonstrate the tradeoff between privacy and utility. Our privacy experiments illustrate protections against an authorship attribution algorithm while our utility experiments highlight the minimal impact of our perturbations on several downstream machine learning models. Compared to the Euclidean baseline, we observe > 20x greater guarantees on expected privacy against comparable worst case statistics.