节点文献

基于粗糙集理论的隐私保护数据挖掘研究

Privacy Preserving Data Mining Based on Rough Set Theory

【作者】 叶明全

【导师】 吴信东; 胡学钢;

【作者基本信息】 合肥工业大学 , 计算机应用技术, 2013, 博士

【摘要】 数据挖掘在各种数据库应用中扮演着非常重要的角色,它在为人们揭示出数据中的隐含知识并创造出财富的同时,也对人们的隐私和数据安全构成了威胁。随着人们隐私保护意识的加强以及相关法律法规的健全,隐私保护数据挖掘已成为数据挖掘领域中一个比较活跃的研究方向。k-匿名模型是隐私保护中最重要的模型之一,其基本思想是对发布数据集中准标识符的属性值作概化操作,以消除链接攻击,实现个体隐私的保护。但是,数据概化将会增加属性值的不确定性,而且不恰当的概化操作将会造成不必要的信息损失,降低匿名数据的效用性。因此,如何对属性值概化选择合适的粒度层次,实现在保护隐私的同时获得良好的数据效用性,这是亟需解决的一个问题。粗糙集理论是一种能够有效地分析不精确、不一致和不完备信息的粒计算模型。然而,传统粗糙集模型及其扩展研究没有考虑决策表中数据具有层次结构。现实世界中,层次型数据是普遍存在的,并广泛应用于数据仓库、层次知识发现和k-匿名模型中。如何扩展现有的粗糙集理论模型及方法,以适应层次型数据处理,这是粗糙集研究中的一个重要问题。另外,分布式挖掘是数据挖掘应用增长较快的领域,分布式环境下隐私保护属性约简是粗糙集研究中的另一个重要问题。本文分别以决策表中层次型数据和多源决策表作为研究对象,对粗糙集扩展模型及其在隐私保护数据挖掘方面的应用进行了比较系统地研究,主要完成了以下几项工作:(1)针对决策表中层次型数据,结合传统粗糙集理论与全子树概化模式,提出了一种新颖的基于属性值分类的多层粗糙集模型。该模型将传统粗糙集从条件属性集上一个原始单层值域扩展为一个具有不同粒度层次的值域,并将传统粗糙集从论域上一个等价关系扩展为一个嵌套的等价关系序列。同时,研究了多层粗糙集的基本性质和相关嵌套序列度量。这些研究成果将进一步完善和扩展粗糙集理论。(2)针对决策表中属性值概化的粒度层次选择问题,结合粗糙集理论的属性约简思想,从粗糙集理论的正区域、信息熵和知识粒度三个角度,提出一种基于多层粗糙集的概化约简概念。概化约简能够在保持原始决策表分类能力不变的前提下,将所有属性值概化到最粗粒度层次,从而避免数据过度概化或概化不足问题。分析了属性约简和概化约简之间的内在关系,揭示属性约简是概化约简的一种特例,提出两种自顶向下逐步细化的正区域概化约简启发式算法。为了避免概化约简的过程,实现直接从正区域嵌套序列中依次提取层次决策规则,提出一种自顶向下逐步细化的多层决策规则挖掘算法。通过标准UCI数据集验证了上述方法的有效性和实用性。这些工作将进一步促进多层粗糙集模型在数据挖掘中的实际应用。(3)为了权衡隐私保护和数据效用性,结合粗糙集理论的属性约简思想,引入条件信息熵来度量匿名数据质量的准则,分析了该准则在准标识符的属性值自顶向下逐步细化或自底向上逐步粗化时的重要性质。然后,提出一种高效的引导匿名化操作过程的搜索指标,在此基础上,提出一种自顶向下逐步细化的基于层次条件信息熵的k-匿名启发式算法HCE-TDR。通过理论分析和仿真实验,结果表明HCE-TDR算法能够在保证数据发布满足k-匿名要求的同时,提高在匿名数据上进行分类建模的效果。这些研究工作将拓展多层粗糙集模型的应用范围,同时进一步促进k-匿名模型在隐私保护数据挖掘中的实际应用。(4)针对垂直划分多源异构决策表,采用半可信第三方和可交换加密机制,设计了一个安全交集基数协议和安全条件信息熵协议,然后提出了一种基于条件信息熵的垂直分布隐私保护属性约简算法。该算法利用粗糙集信息观的约简理论实现了分布式环境下合作约简任务,使各参与方在不共享其隐私信息的前提下达到集中式属性约简的效果。针对垂直划分多源异构决策表和水平划分多源同构决策表,分别设计一种基于安全交集基数协议的安全相对粒度协议和基于安全点积协议的安全相对粒度协议,并提出一种基于相对粒度的分布式隐私保护分布式属性约简算法。实例验证了上述算法的可行性和有效性。这些研究工作促进了粗糙集理论在分布式环境下隐私保护特征选择方面的应用。

【Abstract】 Data mining plays a key role in many database applications, reveals to us the hidden information and data patterns from the normal data. However, when data mining brings us knowledge and profits, it also poses the threat to people’s privacy and the data security. With the strengthening of people’s privacy protection awareness and the establishment of relevant laws and regulations, Privacy Preserving Data Mining (PPDM for short) has become an active area in data mining research. K-anonymity is one of the most important anonymity models to prevent privacy leakage. The principle is to generalize attribute values on quasi-identifiers before data publishing to avoid linking attacks, and thus to achieve the protection of individual privacy. However, data generalization increases the uncertainty of attribute values, and an improper generalization operation will result in unnecessary loss of information, reduce the utility of the anonymous data. Therefore, it is important to select the right granularity level for providing privacy preservation while improving the data utility.Rough set theory is a granular computing model, and can effectively analyze imprecise, inconsistent and incomplete information. One important application of rough sets is attribute reduction and classification rule acquisition in decision tables. However, most of the previous studies on the traditional rough set model and its extensions do not consider decision tables with hierarchical attribute values. Data with hierarchical attribute values are, however, commonly seen in real-world applications, and have been widely used in data warehouses, knowledge discovery at different levels of granularity and the K-anonymity model. It is becoming an important research topic in rough sets to extend the existing theories and approaches of rough sets to deal with hierarchical data. In addition, distributed mining is an area of rapid growth in data mining applications, and privacy preserving attribute reduction in distributed environments is necessary to address important research topic in rough sets.With decision tables with hierarchical attribute values and multi-sources as the research object, this dissertation systematically studies an extension model of rough sets and its application in PPDM. The main contributions can be summarized as follows.(1) To deal with hierarchical data, we extend Pawlak’s rough set model, and propose a novel Multi-Level Rough Set (MLRS for short) based on attribute value taxonomies and full-subtree generalization. MLRS extends the value domains of condition attributes under the original single-level granulation to a domain generalization hierarchy under a multi-level granulation, and extends one equivalence relation on the universe to a nested sequence of equivalence relations. The relationships between Pawlak’s rough set and MLRS are studied, and the nested sequence measures of MLRS are discussed. The results of these studies will be further refined and extended in rough set theory.(2) To select generalization level of attribute value in a decision table, based on attribute reduction in rough set theory, a novel concept of generalization reduction based on MLRS is presented from three perspectives of the rough set theory, such as positive region, information entropy, and knowledge granulation. Generalization reduction can induce the most abstract multi-level decision table with the same classification ability on the raw decision table, and no other multi-level decision table exists that is more abstract. It can avoid over-generalization or under-generalization. Furthermore, the relationships between attribute reduction in Pawlak’s rough set model and generalization reduction in MLRS are discussed, and develop two kinds of positive region heuristic algorithms for computing the generalization reduction which adopt the top-down refinement. Finally, an approach named RMTDR for mining multi-level decision rules is provided. It can avoid the generalization reduction process and mine decision rules from different concept levels. Through standard UCI data sets, experiment results show the validity and effectiveness of the proposed methods. These efforts will further promote the practical application of MLRS in data mining.(3) To protect individual privacy while maintaining the utility of the data in building classification models, we make use of attribute reduction in rough set theory, apply the conditional entropy to measure the classification ability of an anonymous table, and develop k-anonymity for classification analysis from the information view of rough sets. We extend general conditional entropy under single-level granulation to a hierarchical conditional entropy under multi-level granulation, and study its properties by dynamically bottom-up coarsening or top-down refining attribute values. Guided by these properties, we develop an efficient search metric for the tradeoff principle between information gain and anonymity loss, and present a novel algorithm for achieving k-anonymity, Hierarchical Conditional Entropy-based Top-Down Refinement (HCE-TDR). Theoretical analysis and experiments on real world datasets show that our algorithm is efficient and improves data utility. These studies will expand the application of MLRS, and further promote the practical application of the k-anonymity model for PPDM.(4) To solve the privacy preserving attribute reduction problem for multi-source heterogeneous decision tables, we design a secure set intersection cardinality protocol using semi-trusted third party and commutative encryption, develop a novel secure conditional information entropy protocol, and present a vertical privacy-preserving attribute reduction algorithm based on conditional information entropy. This algorithm can achieve co-reduction using attribute reduction based on the information view of rough set theory, which can perform accurate attribute reduction on the premise of no sharing of private information among participants. Also, to solve the privacy preserving attribute reduction problem for multi-source heterogeneous decision tables, we develop a novel secure relative granularity protocol based on secure set intersection cardinality protocol, and propos a vertical privacy-preserving attribute reduction algorithm based on relative granularity; and to solve the privacy preserving attribute reduction problem for multi-source isomorphic decision tables, we develop a novel secure relative granularity protocol based on secure scalar product protocol, and propos a horizontal privacy-preserving attribute reduction algorithm based on relative granularity. Examples demonstrate the feasibility and effectiveness of the proposed algorithms. These studies promote the rough set theory to applications of privacy protection feature selection in a distributed environment.

  • 【分类号】TP311.13;TP18
  • 【被引频次】1
  • 【下载频次】822
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络