TY - GEN
T1 - Privacy-Preserving Data Classification and Similarity Evaluation for Distributed Systems
AU - Jia, Qi
AU - Guo, Linke
AU - Jin, Zhanpeng
AU - Fang, Yuguang
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/8/8
Y1 - 2016/8/8
N2 - Data classification is a widely used data mining technique for big data analysis. By training massive data collected from the real world, data classification helps learners discover hidden data patterns. In addition to data training, given a trained model from collected data, a user can classify whether a new incoming data belongs to an existing class, or, multiple distributed entities may collaborate to test the similarity of their trained results. However, due to data locality and privacy concerns, it is infeasible for large-scale distributed systems to share each individual's datasets with each other for data similarity check. On the one hand, the trained model is an entity's private asset and may leak private information, which should be well protected from all other non-collaborative entities. On the other hand, the new incoming data may contain sensitive information which cannot be disclosed directly for classification. To address the above privacy issues, we propose a privacy-preserving data classification and similarity evaluation scheme for distributed systems. With our scheme, neither new arriving data nor trained models are directly revealed during the classification and similarity evaluation procedures. The proposed scheme can be applied to many fields using data classification and evaluation. Based on extensive real-world experiments, we have also evaluated the privacy preservation, feasibility, and efficiency of the proposed scheme.
AB - Data classification is a widely used data mining technique for big data analysis. By training massive data collected from the real world, data classification helps learners discover hidden data patterns. In addition to data training, given a trained model from collected data, a user can classify whether a new incoming data belongs to an existing class, or, multiple distributed entities may collaborate to test the similarity of their trained results. However, due to data locality and privacy concerns, it is infeasible for large-scale distributed systems to share each individual's datasets with each other for data similarity check. On the one hand, the trained model is an entity's private asset and may leak private information, which should be well protected from all other non-collaborative entities. On the other hand, the new incoming data may contain sensitive information which cannot be disclosed directly for classification. To address the above privacy issues, we propose a privacy-preserving data classification and similarity evaluation scheme for distributed systems. With our scheme, neither new arriving data nor trained models are directly revealed during the classification and similarity evaluation procedures. The proposed scheme can be applied to many fields using data classification and evaluation. Based on extensive real-world experiments, we have also evaluated the privacy preservation, feasibility, and efficiency of the proposed scheme.
KW - Data Classification
KW - Machine Learning
KW - Privacy Preservation
KW - Similarity Evaluation
UR - https://www.scopus.com/pages/publications/84986003682
U2 - 10.1109/ICDCS.2016.94
DO - 10.1109/ICDCS.2016.94
M3 - Conference contribution
AN - SCOPUS:84986003682
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 690
EP - 699
BT - Proceedings - 2016 IEEE 36th International Conference on Distributed Computing Systems, ICDCS 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 36th IEEE International Conference on Distributed Computing Systems, ICDCS 2016
Y2 - 27 June 2016 through 30 June 2016
ER -