Privacy-Preserving Machine Learning Algorithms for Big Data Systems (original) (raw)

Machine learning has played an increasing important role in big data systems due to its capability of efficiently discovering valuable knowledge and hidden information. Often times big data such as healthcare systems or financial systems may involve with multiple organizations who may have different privacy policy, and may not explicitly share their data publicly while joint data processing may be a must. Thus, how to share big data among distributed data processing entities while mitigating privacy concerns becomes a challenging problem. Traditional methods rely on cryptographic tools and/or randomization to preserve privacy. Unfortunately, this alone may be inadequate for the emerging big data systems because they are mainly designed for traditional small-scale data sets. In this paper, we propose a novel framework to achieve privacy-preserving machine learning where the training data are distributed and each shared data portion is of large volume. Specifically, we utilize the data locality property of Apache Hadoop architecture and only a limited number of cryptographic operations at the Reduce() procedures to achieve privacy-preservation. We show that the proposed scheme is secure in the semi-honest model and use extensive simulations to demonstrate its scalability and correctness.