节点文献

数据挖掘中判定树算法的研究与优化

【作者】 丁悦

【导师】 吴海涛;

【作者基本信息】 上海师范大学 , 计算机应用技术, 2008, 硕士

【摘要】 数据分类是数据挖掘的一个重要方法。数据分类是通过分析训练集数据,产生关于类别的精确描述或模式,这种类描述可以用来对未来的数据进行分类,有着广泛的应用前景。目前常用的分类规则挖掘方法有决策树方法、贝叶斯分类算法、遗传算法和粗集理论等。在上述方法中,决策树算法描述较简单,容易转化成分类规则,但同时存在得不到全局最优解的问题;遗传算法虽然能解决大空间、多峰值和非线性等高复杂度问题,但也存在算法收敛于局部最小值的过早收敛问题。由此,本文提出了一种基于混合遗传模拟退火算法的分类决策树方法(GSDA算法)。GSDA算法将遗传算法引入到已有的分类决策树挖掘算法中,提出了一个新的基于混合遗传模拟的算法。本算法在决策树的编码上,改进了常用的二进制编码方式,采用了决策树直接编码的方式,提高了运算的精确性。与此同时,GSDA算法还引入了混合优化的思想,弥补了常用算法中局部性最优的问题。提出了相应的适应度函数,同时提出了适合本文的剪枝操作,使得挖掘出的规则不但正确性更高,而且整体算法更简洁、更易理解。在随后的初步实验中,本文使用了四个不同的数据库:天气数据库、Cleveland数据库、Heart Disease数据库和Breast Cancer-W数据库,并将GSDA算法的实验结果与经典算法ID3算法进行了比较,获得了较优的结果。

【Abstract】 Data classification has become one of the important research aspects of data mining. Data mining generates precise description or model of the predetermined set of data classes or concepts by giving data object partition according to the features of a group of data objects. These models then can be used to classify future data objects which has a good prospect in application. The most popular classification methods at present include genetic algorithm, decision tress, neural network, etc.Among the three methods mentioned above, the decision tree algorithm is simple in description and is easy to translate it into classification rules. However it can hardly find the global optimum solution. Although the genetic algorithm can solve the problem of huge searching space, multiple-peak value, and non-linearity, it also has the drawback of early convergence. Therefore, a classification rule mining method called GADA based on hybrid genetic and simulated annealing algorithm is proposed. This algorithm introduces direct tree encoding method to improve the accuracy. Meanwhile it introduces hybrid optimization to solve the problem of local optimization. We also improve such aspects of fitness function and pruning operation to make the accuracy of the mining rules much higher and the algorithm simpler and easier to understand. All these are explained in the following experiment.We use four different databases: weather database, Cleveland database,Heart Disease database and Breast Cancer-W database to compare the result of GSDA algorithm and classic ID3 algorithm. It is proved that the GSDA algorithm performs better than ID3 algorithm.

  • 【分类号】TP311.13
  • 【被引频次】1
  • 【下载频次】228
节点文献中: 

本文链接的文献网络图示:

本文的引文网络