节点文献

不完备信息系统的数据挖掘研究

Research on Incomplete Information System Data Mining

【作者】 田宏

【导师】 王秀坤;

【作者基本信息】 大连理工大学 , 计算机应用技术, 2010, 博士

【摘要】 由于部分数据缺失或者获取真实数据的限制等原因,使得在数据挖掘时往往面临的是不完备信息系统,即信息系统可能存在部分对象的一些属性值未知的情况或者无法获取真实数据信息的情况。粗糙集理论是一种刻画不确定和模糊数据的数学理论,能有效的分析和处理不精确、不一致、不完整等各种信息,并从中发现隐含的知识。本文以不完备信息系统为研究对象,以数据挖掘与知识发现为目的,研究了基于弱模糊相似关系的广义粗糙集理论、基于值的相似关系的粗糙集模型以及不完备信息系统中隐私保护的数据挖掘算法,具体研究工作如下:1.粗糙集理论在不完备信息系统中的扩展是目前研究不完备信息系统数据挖掘的理论基础。基于相容关系的粗糙集认为空值和任意已知属性值都相等;基于相似关系的粗糙集认为空值是不存在的而被忽略;基于限制相容关系的粗糙集虽然认为空值存在而且可以比较,却限制了相容关系中取值不全为空的两个对象无相同属性取值的情况。针对以上问题,本文提出一种基于弱模糊相似关系的广义粗糙集模型,研究表明了该粗糙集模型在不改变原信息系统的信息情况下,能更加客观的刻画不完备信息系统中对象的真实信息,证明了弱模糊相似关系是一个更加一般的二元关系。2.研究了基于相容关系、相似关系在不完备信息系统中的知识发现。研究发现在这两种关系的粗糙集模型中不能精确的描述对象之间相似的差异,导致不能精确地进行知识发现。针对这个问题,本文提出了基于属性值的相似关系粗糙集模型下不完备信息系统的知识发现方法。该方法通过计算出每个对象的属性值之间的相似度,从而能够准确的确定出每个对象相对一个概念集合的上、下近似。如果用户选择一个合适的相似度阈值,就可以通过上、下近似的计算找到满足相似度阈值的对象集合,最后精确的确定出满足条件的知识规则。实验结果说明了该方法是一个有效的不完备信息系统的知识发现方法。3.研究了不完备信息系统的隐私保护数据挖掘算法,基于随机变换的MASK算法、基于属性转换概率矩阵的方法PARD算法和基于部分隐藏的随机化回答方法RRPH算法。对以上算法进行了详细的分析,针对这些算法中存在的局限性,本文提出了一种高效的隐私保护关联规则挖掘算法—基于转换概率矩阵的部分随机化回答方法PRRPM。理论分析和实验结果表明了本文提出的PRRPM方法在隐私性、准确性、复杂度和适用性方面更具有优势。

【Abstract】 Since the data missing or restrictions on access to real data, data mining are often face with incomplete information system, which there are some unknown attribute values and unable to obtain real data in information system. Rough set theory is a new mathematical approach to uncertain and vague data analysis. It can effectively deal with imprecise, inconsistent, incomplete informations, and can discovery the hidden knowledge. In order to study data mining and knowledge discovery in incomplete information system, the general rough set theory based on the week fuzzy similarity relation and the rough set models based on valued similarity relation are studied in this dissertation. Furthermore, the privacy preserving data mining techniques and algorithms are studied in incomplete information system. The research works are listed as follows:1. The rough set theory extension in incomplete information system is the theory foundation for data mining in incomplete information system recently. The rough set based on tolerance relation, in which the vacancy is equal to any known attribute values. The rough set based on similarity relation, in which the vacancy does not exist. The rough set based on the limited tolerance relation, in which the vacancy does exist and can be campared. However, it is limited that the two objects do not have the same attribute values while they attribute values are not vacancy. In the light of the above shortcomings and the lack of theory, we have proposed a general rough set based on the week fuzzy similarity relation. The properties and objectivity are researched and examined in deal with objects in incomplete information system. It is proved that the week fuzzy similarity relation is a more general binary relation.2. In order to mining the knowledge in incomplete information system based on the tolerance relation and the similiarity relation, which can not accurately describe the difference between the two similiarity objects and can not accurately discovery knowledge. Therefore, we present an approach to mining knowledge based on the value similiarity relation, which method can objectively reflect the objects inherent relationship in incomplete information systems. First, we can accurately identify the upper and lower approximation of each object relative to the concept of a set, by computing the similarity degree of attribute values between each object. Second, if user selects an appropriate threshold value of similarity, we can find the set of objects meeting the similarity threshold by computing the upper and lower approximation. Finally, we can precise determine the rules of knowledge meeting the conditions. Experimental results show that this model is a validity model of knowledge discovery in incomplete information system.3. The privacy-preserving data mining algorithems are studied in incomplete information system. The MASK algorithem based on randomized transition strategies, the PARD algorithem based on attribute transfer probility matrix and the RRPH algorithem based on randomized response with partial hiding. In the light of the above shortcomings, we propose a validity privacy-preserving association rules mining method, which are the partial randomized response based on probability matrix or PRRPM. The PRRPM algorithm is explored and its validity examined through theoretical analysis and experiments, experimental results show that the accuracy, privacy, complexity and applicability are more advantages.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络