节点文献

ID3算法的优化研究及其在构件库中的应用

The Optimization Research of ID3 and Application in Component Library

【作者】 李冬

【导师】 刘晓燕;

【作者基本信息】 昆明理工大学 , 计算机应用技术, 2011, 硕士

【摘要】 随着信息技术的迅速发展以及人们获取数据手段的多样化,各行各业不断积累了大量数据,面对浩瀚的数据海洋,如何更好地利用这些数据资源,找出大量数据背后隐藏的信息和知识,已成为商业领域广泛关注的问题。因此,在人们的实际需求的推动下,数据挖掘技术应运而生,并得以在社会生活的各个领域蓬勃发展。在诸多的数据挖掘技术和方法中,用于数据分类的决策树方法是数据挖掘研究领域的一项重要课题。ID3 (Interactive Dicremiser versions 3)算法是决策树方法中最为常用的方法之一,它以其自身的多种优势,在机器学习领域得到广泛应用。然而,数据挖掘技术发展至今,在ID3算法的实际应用中,也发现ID3算法存在很多不足。因此,本文重点深入研究决策树方法中的ID3算法,分析ID3及其改进算法的优缺点,给出关于“简化ID3算法的启发式函数”和“解决ID3算法的多值偏向问题”两个方面的合理优化方案,以完善ID3算法。首先,本文通过近似值的方法,对ID3算法的属性选取标准进行简化,消除其中复杂的对数运算,最终得到适用于多类的、具有通用性和一般性的启发式函数简化形式。ID3简化算法选择信息增益最小的属性作为测试属性,在计算信息增益时,避免了对数运算,只包含计算机较易处理的基本运算符号,所以,在一定程度上减少了选取最优属性的计算量,提高了算法的执行效率;其次,本文引入平衡函数的概念从根本上克服ID3算法的多值偏向问题。其核心思想是:通过引入基于属性取值个数的单调平衡函数,平衡属性取值个数与信息增益之间的关系,进而得到新的最优属性选取标准。通过实例分析和算法比较,改进后的ID3算法选取的测试属性更为合理,进而从形成的决策树中提取的规则更为符合人们的实际需求。最后,本文通过一个实例实现了ID3优化算法在构件库中应用。根据算法在构件库中的应用流程,将构件基本信息表和用户反馈信息表整合而成的新数据集作为ID3优化算法的挖掘样本集合,最终形成决策树,并从中提取出构件复用规则。利用从大量构件背后挖掘出的知识规则可以辅助构件复用者更好地理解和选取构件,节约了用户决策时间。

【Abstract】 Every walk of life accumulate mass data constantly along with rapid development of information technique and the diversity method of obtaining data. Facing expansile data sea how to use the data resource, find information and knowledge behind the data have become a widely concerned problem in business domain. Accordingly, with drive of people’s effective requirement, data mining technique emerges at a historic moment, and develops rapidly in every field of life. The method of decision tree used in data classification is an important task of data mining domain.ID3 (Interactive Dicremiser versions 3)algorithm is one of the most frequently-used decision tree methods, and it is widely applied in machine learning domain because of its much advantage. But we find lots of defect about ID 3 in practical application. So the paper indepth researches the defect of ID3 and improved algorithm, and gives the rational prioritization scheme about predigesting the heuristic function of ID3 and overcoming the problem of variety bias to perfect the ID3. Firstly, the paper approximately derives the information gain formulae to remove the logarithm operation, and we derive the simplified heuristic function that is the same with several sorts and possesses universal property and universality. The shortcut calculation of ID3 selects the attribute whose information gain is the least as attributetest, and avoids logarithm operation when calculating information gain. So the shortcut calculation of ID3 decreases calculated amount and improves the execution efficiency of arithmetic. Secondly, the paper introduces the equilibrium function to overcome the problem of variety bias. The equilibrium function balances the relation between number of attribute value and information gain, then we can derive the new standard of Choosing Attributes. After instance analysis and algorithm comparison, the selected attributetest is more logical through modified ID3. Then the rules from decision tree more answer for the needs of people.Lastly, the paper realizes the application of ID3 optimization algorithm in component library through an instance. According to the application process, we integrate the history record table of component and feedback table of user into new data set which is used to ID3 optimization algorithm. Finally, we derive decision tree and distill rules from decision tree. According to these rules, reuser of component can understand and select component, and economize the decision time.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络