节点文献

在数据挖掘中保护隐私信息的研究

Research on Privacy Preserving Data Mining

【作者】 杨维嘉

【导师】 黄上腾;

【作者基本信息】 上海交通大学 , 计算机应用技术, 2008, 博士

【摘要】 数据挖掘是当今社会最为重要的知识发现工具,它在为人们揭示出数据中的隐藏规律并创造出财富的同时,也对各类数据有着大量的需求。随着互联网的出现和发展,对所需数据的收集、交换和发布的过程正变得越来越便利。然而,这些丰富的数据资源中也同时包含着大量的个人隐私、商业情报和政府机密。更为令人担忧的是,在这些数据的实际使用过程中,特别是在挖掘过程中,大量的信息却能被不加限制的肆意利用,个人隐私和机密信息的泄露严重影响了人们的日常生活甚至社会的稳定。于是,数据挖掘过程中随手可得的海量信息也就使得人们对滥用隐私的忧虑在挖掘工具的运用上得到了集中的反映。面对在数据挖掘中保护隐私的迫切要求,传统的保护方法却难以胜任,因为它们在保护敏感信息的同时,也妨碍了数据中知识的获取。针对数据挖掘中的隐私保护和知识获取这一对棘手的矛盾,我们研究和提出了一系列变换原始数据的过程、协议和方法,阻止了挖掘过程的参与者对隐私信息直接或间接的获取,同时也使得挖掘算法能够从转换后的数据之中获得原始数据包含的信息和知识。大量仿真实验的测试结果,以及与现有方法的对比成绩也验证了我们方法的有效性。由此,我们不但消除了传统挖掘过程中存在的隐私泄露风险,也使得挖掘过程仍然可以取得准确的结果。我们将本文的创新点和主要工作概括如下:1.提出了隐私信息由数据关联构成的本质,并同时提出了两种保护隐私的策略。通过研究现有隐私保护模型中的不同数据对象,我们发现无论何种数据属性都不能准确的表示出数据集合中所包含的隐私信息。通过进一步的例证、理论分析和比较,我们提出了隐私信息的本质属性:数据间的关联,并由此提出了两类保护隐私的策略:分解隐私信息和转换隐私信息,将它们作为隐私保护研究的指导思想。同时,我们也详细介绍了隐私保护的原因、意义及其模型的应用范围和场景。2.提出了利用随机化技术来分解隐私信息的方法,并提出了平衡隐私保护和知识获取这对矛盾的可调节机制,同时也消除了先验知识对隐私的威胁。我们在发布数据集合的问题中,结合分解隐私信息的策略,提出了一种利用随机化技术来保护隐私的方法。该方法利用原始数据的分布信息,随机选取部分原始数值进行转换,与匿名化和多样化隐私保护模型相比,我们的方法不仅大幅提高了使用者对原始数据的不确定程度,而且还能够保持数据中的大部分有用知识。同时,针对用户掌握的先验知识可能会造成的隐私泄露,我们提供了一种平衡隐私保护和挖掘准确性的可调节方法。3.提出了转换隐私信息的数据变换协议和数据整合方法,在恶意合谋的情况下实现了隐私的保护,并提出了按需定制隐私保护程度的方法。我们结合转换隐私信息的策略,为每一位数据拥有者提出了转换其原始数据的方式和传输数据的协议,同时也为挖掘者提供了整合不同数据源的方法。我们的转换方法和协议都基于数据矩阵的变换,变换方式的正交性质在半诚实的计算环境中完美的避免了隐私保护和准确挖掘之间的矛盾;而在恶意合谋的情况下,我们的随机转换方式成功的将隐私泄露的风险控制在有限的范围内。另外,数据集合的不同属性在实际使用中通常拥有不同的重要程度,因此我们也实现了对隐私保护程度的定制方法,使得数据拥有者可以按照实际的需要,灵活的保护不同的属性。4.提出了能够适应大规模参与者的可扩展隐私保护方法,有效的实现了隐私保护、准确挖掘和可扩展性这三者之间的平衡,同时也进一步提出了适用于高维数据集合的保护方法。可扩展性问题一直是隐私保护研究所面临的挑战。我们量化分析了数据挖掘的参与者数量对隐私保护和准确挖掘所带来的不同影响。并提出了一个能够适应大规模数据提供者的原始数据转换方法,使得隐私保护方法的性能独立于参与者数量的变化。同时,我们也研究了干扰量的独立性对隐私保护的影响,并由此提出了一个能够灵活适应不同数据维度规模的隐私保护方法。

【Abstract】 The recent development of networking and storage technologies make it more and moreconvenient to collect, process or publish large volumes of data which also contains greatamount of personal privacy, business secrets and classified information. When the data isobtained, especially during the mining process, most of it can be used without any restriction.As a result, once the sensitive part is disclosed, it will seriously invade our privacy, disturbour normal life or even threaten the security of our society. Data mining, as one of the mostpowerful technology for knowledge discovery, reveals to us the hidden information and datapatterns from the normal data. Although it brings us knowledge and profits, there are severeproblems in its way of dealing with data. The concerns over data privacy increase extremelysince anyone accessible to the mining process can obtain the original data records, whichfurther leads to a high risk of data misuse.Therefore, in the recent years, a number of techniques have been proposed to solve theseproblems. In our research, we aim at providing a privacy preserving way of data mining bytransforming the original data sets before the mining process. We’ve also developed severalnovel transformation techniques, so that we can still get accurate mining results while theprivacy is well protected. We conclude our main contributions as following:1. We’ve proposed the essence of data privacy and two strategies for protection. In ourresearch, we analyzed most of the current privacy preserving methods, in which thestructure of the privacy objects are discussed in detail. We found that few of theirdefinitions can accurately describe the essence of data privacy, which makes it difficultfor the corresponding methods to provide a comprehensive protection. Based on thisunderstanding, we redefined data privacy by using data associations which are muchmore close to the actual concept of privacy in our normal life. We also proposed twokinds of strategies to protect the new privacy. Also, at the beginning of the thesis, weintroduced in detail the background knowledge of privacy protection and its field ofapplication.2. We’ve proposed a novel method of randomized anonymization to decompose the dataprivacy. Moreover, we’ve also proposed a mechanism to compromise between the level of accuracy and privacy, so that the threats from the priori knowledge are elimi-nated. In the scenario of data publishing, we proposed a method of data randomizationby applying our first strategy. It randomly replaces the data in each record by usingthe distribution of the original data. By comparing with the famous k-anonymizationtechniques, our method not only offers a much higher level of privacy protection, butalso maintains the useful knowledge in the original data set. Furthermore, the usermay use his priori knowledge to infer the sensitive information which he is not al-lowed to know. We also developed a method to counteract the threats from these kindsof knowledge in the problem of data publishing. While the method brings more un-certainties on the inference of original values, it also provides a mechanism to balancebetween the privacy and accuracy.3. We’ve proposed protocols of data transmission and data integration to transform dataprivacy, so that the threats from malicious adversaries are counteracted. Moreover,we’ve also implemented customized privacy. By applying the second strategy, we pre-sented an efficient clustering method for distributed multi-party data sets using theorthogonal transformation and perturbation techniques. The miner, while receivingthe perturbed data, can still obtain accurate clustering results. This method protectsdata privacy not only in the semi-honest situation, but also in the presence of collu-sion. Moreover, each attribute in a data set usually involves a certain level of privacyconcerns. It is necessary to provide the data owner with a mechanism to customize theperturbation of his own data. We implemented the customized privacy, so that eachvariable in the data set can be perturbed according to its own importance which isspecified by the owner.4. We’ve proposed an extendible privacy preserving method which adapts to differentnumber of participants. Moreover, we’ve also proposed a method to generate an inde-pendent perturbation. One of the main technical challenges for privacy preserving datamining is to make its algorithms adaptable to participants while still keeping the pri-vacy and accuracy guarantees. We analyzed the in?uence on the accuracy and privacyprotection when the participants increase in the normal method. And we also pro-posed an improved method to solve the problem with a large number of participants.Moreover, we also proved the importance of independent perturbation, and proposeda method adaptive to large data dimensions.

  • 【分类号】TP311.13
  • 【被引频次】10
  • 【下载频次】1275
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络