节点文献
基于粗集理论的KDD技术研究
The Study on KDD Technologies Based on Rough Set Theory
【作者】 赵军;
【作者基本信息】 重庆大学 , 计算机软件与理论, 2003, 博士
【摘要】 尽管对KDD技术的研究已经取得了丰硕的成果,但是进一步研究KDD技术仍然具有重要的实际意义。众多的理论和工具都已经成功地应用于解决KDD过程中的某些具体问题,粗集理论是其中最具发展前景的工具之一。由于基于粗集模型的智能数据分析过程可以不依赖外界参数或先验知识,因此粗集理论具有其它工具无法比拟的优势,研究基于这一理论的KDD技术就有望为KDD过程提供更为理想的解决方案。在KDD的数据集成阶段,数据离散化是其中一件非常重要的工作。有效的离散化可以显著地提高系统对样本的聚类能力,增强系统对数据噪音的鲁棒性。粗集理论已经成功地应用于数据离散化。论文对典型的基于粗集模型的启发式数据离散化过程进行了深入研究:首先提出了新的计算候选断点集合的方法,在同样能够保证系统分辨关系的前提下,按照该方法得到的候选断点集合的基数远远小于按传统方法得到的结果;其次,论文研究了通过“断点分辨矩阵”来度量候选断点重要性的启发式方式。度量候选断点的重要性不但要考虑该矩阵的列方向特征,而且还要以适当的方式考虑矩阵的行方向特征。但是,列和行方向特征对候选断点重要性的反应能力是不对称的,后者不如前者准确。在此基础上,定义“断点选择概率”来度量断点的重要性。断点选择概率不但具有明确的物理意义,而且充分考虑了“断点分辨矩阵”列和行方向特征的差异,将这两个方向的特征合理地统一起来。最后,提出了基于断点选择概率的结果断点集合计算方法。算法分析和仿真实验结果表明,所提出的算法可以高效率和高性能地解决数据离散化问题。在KDD的数据集成阶段,特征子集选择是其中另一件非常重要的工作。特征子集选择不但可以缩减学习系统的规模,而且能够有效地从系统中剔除冗余信息,从而凸显系统中数据之间潜在的相互联系,最终能够提高数据挖掘结果的应用性能和应用精度。论文深入研究了特征子集选择技术,提出了高效的属性核计算方法,定义了“系统熵”概念,并以属性对系统熵的影响为启发式依据来度量属性之间的相对重要性。系统熵的计算较“条件熵”简单,并且能够有效地克服条件熵的不足,不但能够度量系统中非冗余属性之间的相对重要性,而且能够分辨冗余属性之间的相对重要性。论文揭示了系统熵的一些代数性质,研究了它在取值规律上的固有倾向。在有效地抵消了其固有取值倾向的影响之后,基于系统熵概念定义了“属性重要性”概念,并将其应用到反向删除方式的特征子集选择算法<WP=5>中。算法分析和仿真实验结果表明,所提出的算法能够高效率地解决特征子集选择问题,并能够得到比较理想的结果。由于决策规则本质上是一种以决策属性集合为标签的分类规则,因此决策规则的学习过程本质上就是样本分类规则的挖掘过程。由于通过传统的基于粗集模型的学习算法得到的决策规则描述和体现的主要是不同类型样本之间的分辨特征,不能反映同类型样本之间的共同特征,于是,论文提出了一种新的决策规则学习算法,该算法能够产生完备的决策规则系统,在规则的学习过程中,不但考虑了不同类别样本之间的分辨特征,而且也注重提取同类型样本之间的共同特征。仿真测试结果表明,该算法具有较高的学习精度,并且对系统的不一致性具有较强的适应能力。由于对系统的任何智能处理过程都有可能影响到系统的不确定性,因此系统不确定性的度量方法是一个具有实际意义的重要问题。定量地描述系统的不确定性有助于观测和跟踪系统不确定性的变化规律,从而据此来分析相应的处理过程对系统的影响趋势和影响程度,甚至可以在一定程度上反映和评估这些处理方式的合理性。论文首先分析了现有的基于粗集模型的系统不确定性度量方式,然后提出对决策信息系统,可以用条件熵来度量其不确定性,分析了条件熵在其取值规律上与系统不确定性概念之间的一致性;对决策规则系统,首先将系统的不确定性分为随机性和冲突性两种,分别刻画了它们具体的表现形式,然后给出了相应的不确定性度量方法。最后研究了系统不确定性对典型的决策规则学习算法性能的影响,得到了一些有益的结论。
【Abstract】 Though rich achievements of the researches on technologies for Knowledge Discovery from Databases have been reported and seen recently, to further develop new technologies for KDD is still necessary in practice. Rough set theory is one of the most prosperous tools and theories that have already been successfully applied to resolve some specific problems in KDD process. Theoretically speaking, intelligent data analyses based on rough set model could be well done without any extra parameter or external knowledge, therefore rough set theory has incomparable advantages over other theories and tools. Consequently, to further develop technologies based on rough set theory for KDD is hopeful to provide more effective KDD resolution.In the data integration stage of KDD, data discretization is one of the most important jobs. Effective data discretization can obviously improve system ability on clustering instances, and can also make systems more robust to data noise. Rough set theory has been successfully applied in data discretization. Based on the typical heuristic frameworks of rough set based data discretization, some further researches are made in this topic. Firstly, new method of computing the candidate cut set of a learning system is put forward. To compare with other analogous traditional algorithms, candidate cut sets with much smaller cardinalities can be produced through the new method while the system discernibility relation could still be maintained. Subsequently, to heuristically measure the relative importance of candidate cuts, relevant metrics are studied based on "cut discernibility matrix". When measuring the importance of a candidate cut, both the characteristics of columns and rows of this matrix should be reasonably taken into consideration. It is deserved to point out that the contribution of column and row characteristics to candidate cut importance is unbalanced at all, the latter is much inferior. Then a new conception of "Cut Selection Possibility" is defined to effectively measure the importance of candidate cuts. Cut Selection Possibility is not only physically meaningful, but also fully considers the difference between the characteristics of matrix columns and rows, and then harmonically and reasonably put them together. At last, an approach based on Cut Selection Possibility is proposed to find out the result cut sets from candidate sets. To real life databases, theoretical analyses and simulation experiments show that the proposed approach can efficiently<WP=7>and effectively solve the problem of data discretization.In the data integration stage of KDD, feature subset selection is another one of the most important jobs, which can not only diminish the data scale of a system, but also effectively remove the redundant information from the system and then emphasize and enlarge the potential data relation of the system. Consequently, it greatly contributes to improving the application performances of the data mining results. Technologies for feature selection is firstly deeply inspected, and then the notion of "System Entropy" is defined and the influence of a feature on System Entropy is taken to heuristically measure relative feature importance. The notion System Entropy can effectively break the confines of "Conditional Entropy", another notion also based on information theory and rough set theory. System Entropy can not only measure the relative importance of useful features, but also discriminate the relative importance of redundant features. Moreover, its computation is much easier and simpler than that of Conditional Entropy. Some algebraic characteristics of System Entropy are disclosed, and its intrinsic value biases are also studied. Then the conception of "Feature Significance" is clarified based on System Entropy after its value biases are effectively counteracted. Feature Significance heuristically measures feature importance in a new suggested algorithm for feature selection that selects feature subsets in typical "Backward Elimination" way. Algorithm analyses and simulati
【Key words】 Rough Set Theory; Data Discretization; Feature Subset Selection; System Uncertainty Measure; Decision Rule Induction;