节点文献
不完整数据分类知识发现算法研究
The Research on Category Knowledge Discovery Algorithm for Incomplete Data Sets
【作者】 祁瑞华;
【导师】 杨德礼;
【作者基本信息】 大连理工大学 , 管理科学与工程, 2011, 博士
【摘要】 分类知识发现是数据挖掘的基本任务,也是知识发现中最重要的目标之一。据统计,在机器学习和数据挖掘应用过程中不完整数据的理解需要花费大量的时间和精力,因此不完整数据处理是现实世界中分类知识挖掘必须认真对待的重要问题。本文以提高不完整数据的分类知识发现算法性能为切入点,探索充分利用不完整数据集中隐含信息和提高数据挖掘效率的途径。本文具体的研究工作如下:(1)出于提高算法分类正确率的目的,针对朴素信念分类算法忽略属性变量的投票权重,提出了基于相关系数的加权保守推理规则。此规则尝试用权重量化不完整数据中属性变量与类别之间的相关程度,基于此思路改进了朴素信念分类算法,并在国际公开的数据集上与现有的主要分类算法进行了分类对比实验。实验结果表明在不需要对不完整数据进行填充处理,并由此避免因不合理填充方法引起数据倾斜的情况下,该算法能够充分学习不完整数据中蕴含的隐藏信息,学习性能优于朴素信念分类和朴素贝叶斯分类算法,在某些数据集上与支持向量机不相上下。尤其是在朴素贝叶斯分类准确率表现不佳的样本上,不完整数据条件下的加权朴素信念分类算法得到了较好的分类结果。(2)针对目前半监督分类算法中未考虑缺失属性数据项隐含信息和算法复杂度高的情况,本文提出两阶段半监督加权朴素信念分类模型。此模型将半监督分类过程分为两个阶段的加权朴素信念分类,与直推支持向量机和在国际公开标准数据集上的对比实验表明两阶段半监督加权朴素信念分类模型有效地减少了分类时间,而在其能够明确分类样本上的正确率与直推支持向量机相当。(3)为了增强朴素信念分类算法的鲁棒性,提高其明确分类样本比例低的情况,本文提出基于放松区间优势的不完整数据分类模型。此模型在放松区间优势定义的基础上改进了朴素信念分类,在国际公开标准数据集上的对比实验表明此模型在大多数的数据集上起到了改善朴素信念分类和加权朴素信念分类算法明确分类样本比例的作用,有利于做出确切的分类判断,同时保证了较高的分类正确率,总体分类性能优于朴素信念分类、加权朴素信念分类、朴素贝叶斯算法和最近邻法,但是否优于支持向量机要观察其在不同数据集上的表现。最后,本文将加权朴素信念分类、两阶段加权朴素信念半监督分类算法和放松区间优势朴素信念分类算法分别应用于文体风格识别不完整数据集,取得了较理想的实验结果,验证了算法的有效性。
【Abstract】 Category knowledge discovery is the fundamental task of data mining and one of the most important goals in knowledge discovery. According to statistics, the understanding of incomplete data in machine learning and data mining application process need to spend a lot of time and effort. So the processing of incomplete data from real world should to be taken seriously an important issue in classification knowledge discovery. As the starting point to explore the classification of incomplete data, this paper focus on the full use of hidden information in incomplete data sets and efficient way to improve data mining.The detailed contents of the research are as follows:(1) The weighted conservative inference rule based on correlation coefficient is proposed. This rule tries to make use of the correlation coefficient to quantitative analysis the relationship between the attributes and the categories. Based on this idea, the weighted Naive Credal classifier is proposed and tested on the international public data sets. Compared with Naive Bayes classifier and Naive Credal classifier, this algorithm has better learning performance. On some datasets, the weighted Naive Credal classifier is comparable with the support vector machine. Compared with other existing classification algorithms, the weighted Naive Credal classifier performs better owning to the full use of the hidden information in incomplete data.(2) This paper presents a two-stage semi-supervised weighted Naive Credal classification model. For the ignoring of the implicit information in incomplete data and the high complexity of current semi-supervised classifiers, in this model the semi-supervised classification process is divided into two weighted Naive Credal classification stages. Compared with transductive Support Vector Machine (TSVM), this algorithm has lower time complexity and almost the same accuracy.(3) This paper presents a Naive Credal classifier based on relaxed conservative inference rule for incomplete data. For the low proportion of determinate classified samples, the definition of interval advantages is relaxed in this model. Compared with Naive Credal classifier and weighted Naive Credal classifier, this algorithm effectively increases the proportion of determinate classified samples and almost the same accuracy. Overall this algorithm has better classification performance than Naive Bayes classifier, nearest neighbor method, Naive Credal classifier and weighted Naive Credal classifier. But if this algorithm has better performance than support vector machine depends on their performance on different data sets.Finally, the weighted Naive Credal classifier, the two-stage semi-supervised weighted Naive Credal classifier and the relaxed Conservative Inference Rule based weighted Naive Credal classifier are applied to style identification dataset. The validity of the algorithm is verified by the better experimental results compared with the main of existing classification algorithms.
【Key words】 Data Mining; Incomplete Data; Classification Algorithm; Knowledge Discovery;