节点文献
决策树分类算法的改进及其应用研究
【作者】 潘永丽;
【导师】 王元亮;
【作者基本信息】 云南财经大学 , 计算机应用技术, 2011, 硕士
【摘要】 随着人们对数据挖掘理论的不断探讨和研究,数据挖掘技术在各行各业中的应用日趋广泛和成熟。在诸多的数据挖掘技术和方法中,决策树方法是用于数据分类和预测领域的重要方法之一,它是一种以实例为基础的归纳式学习算法,从一组无次序、无规则的实例中推理出决策树形式的分类规则,进而预测未知数据。ID3算法是决策树构造方法中最为常用的实现方法,它在数据分类和预测领域得到广泛应用,然而,在实际应用中,发现ID3算法存在很多不足之处。因此,本文重点研究决策树方法中的ID3算法,分析ID3及其改进算法的优缺点,给出合理的优化方案,以完善ID3算法,使其具有更好地分类效果。具体的优化方案主要体现在以下两个方面:第一,简化ID3算法的启发式函数。本文通过近似值的方法,对ID3算法的信息增益公式进行近似推导,消除其中复杂的对数运算,最终得到适用于多类的、具有通用性和一般性的简化启发式函数。新的ID3简化算法选择信息增益最小的属性作为测试属性,在计算信息增益时,避免了对数运算,只包含计算机较易处理的基本运算符号,所以,在一定程度上减少了选取最优属性的计算量,提高了算法的执行效率。第二,解决ID3算法的多值偏向问题。本文引入权值函数的概念从根本上克服ID3算法的多值偏向问题。其核心思想是:通过引入基于属性取值个数的单调权值函数,为不同属性自动分配不同权值,以权衡属性取值个数与信息增益之间的关系,进而得到新的最优属性选取标准。通过实例分析和算法比较,改进后的ID3算法选取的测试属性更为合理,进而从形成的决策树中提取的规则更为符合人们的实际需求。最后,本文通过一个实例实现了ID3优化算法在学员续费决策问题中的应用。根据学员分类应用流程,将学员基本信息表和学员反馈信息表整合而成的新数据集作为ID3优化算法的挖掘样本集合,最终形成决策树,并从中提取出知识规则。利用从大量学员相关数据背后挖掘出的知识规则可以辅助企业管理者更准确的做出判断和决策,提高了企业效益。
【Abstract】 The Data Mining technique is widely applied and it becomes more and more mature along with the discussion and research about Data Mining theory. The decision tree method is important one that is used to data classification and forecast domain in many Data Mining techniques and methods. It is an inductive arithmetic which bases instances, and it can find the classification rule through illation from immethodical and ruleless instances. Then we can make use of the rules to forecast unknown data .ID3 algorithm is the most frequently-used achieved method in decision tree constructors, and it is widely applied in data classification and forecast domain. But we find lots of defect about ID3 in practical application. So the paper researches the defect of ID3 and improved algorithm, and gives the rational prioritization scheme to perfect the ID3. The prioritization scheme comprises two aspects as follows:Firstly, we predigest the heuristic function of ID3. The paper approximately derives the information gain formulae to remove the logarithm operation, and we derive the simplified heuristic function that is the same with several sorts and possesses universal property and universality. The new shortcut calculation of ID3 selects the attribute whose information gain is the least as attributetest, and avoids logarithm operation when calculating information gain. So the shortcut calculation of ID3 decreases calculated amount and improves the execution efficiency of arithmetic.Secondly, the paper introduces the weight function to overcome the problem of variety bias. The weight function weighs the relation between number of attribute value and information gain through assigning different weights for different attributes, then we can derive the new standard of Choosing Attributes. After instance analysis and algorithm comparison, the selected attributetest is more logical through modified ID3. Then the rules from decision tree more answer for the needs of people.Lastly, the paper realizes the application of ID3 optimization algorithm in decision problem of students’renewal tuition through an instance. According to the application process, we integrate students’essential information table and feedback table into new data set which is used to ID3 optimization algorithm. Finally, we derive decision tree and distill rules from decision tree. According to these rules, company Manager could more exactly make judgement and decision. And these rules could improve the benefit of company.
【Key words】 ID3 algorithm; variety bias; weight function; ID3 optimization algorithm; students’renewal tuition;