节点文献

基于决策树的组合分类器的构建和部署

Construction and Disposition of Combinatorial Classifier Based on Decision Tree

【作者】 胡记兵

【导师】 蔡家楣; 江颉;

【作者基本信息】 浙江工业大学 , 计算机软件与理论, 2008, 硕士

【摘要】 决策树是应用最广泛的数据挖掘方法之一,研究的重点围绕数据处理的准确率、效率及数据降维等方面,增量式学习能力也是决策树算法的主要特征。SURPASS就是高效的增量式算法,能处理超内存的大规模数据集,但它在处理海量数据时也存在效率低下的问题。另外,决策树采用不纯度指标选择最佳分割属性,当数据集很大时,在分割每一步都可能有多个最佳属性,这为在一个数据集上构建决策树森林提供了可能性。传统的单分类器适应不了对高预测准确率的需求,而且数据产生、存储以及利用等方式的改变也促使对分类器研究的不断改进。一些学者发现传统的分类器之间存在着互补的信息,可以利用这些互补的信息来改善分类器的性能。针对SURPASS算法效率上的问题,本文基于信息论提出了一项基于信息量的指标,使用该指标在决策树分割的每一步,计算每个属性的信息量指标值,算法可选取信息量指标值较大的属性作为最佳属性,以减少对磁盘数据的访问从而提高运行效率。实验数据表明,这种方法是有效的。为使信息量指标具有理论依据,本文利用微分方法导出了信息量指标,通过该方法得到了信息量指标的两种计算方法,并指出了信息量指标在运行效率上的优势。本文还以SURPASS为基分类器实现了随机森林,最后通过实验验证了随机森林的性质。

【Abstract】 Decision tree is one of the data mining methods used most popularly. Researches on decision tree emphasize on prediction accuracy, efficiency and decrease of dimensions of datasets. Scalability is a primary feature of decision tree. SURPASS is a decision tree with scalability and ability for dealing with datasets whose size exceed the capacity of main memory, but it lacks in high efficiency when it is dealing with datasets with very large volume. In addition, decision tree uses impurity to select the best split attribute. When dataset dealt by it has large volume, there might be many best split attributes, which provide possibility of building a random forest over the dataset. Traditional single classifier might not meet the requirement of high prediction accuracy and the manners in which data is generated, stored and utilized urge the improvement of classifier. Some scholars have discovered that there exist mutually complementary information between single classifiers and it is suggest that using the information to improve the performance of classifier.In allusion to the efficiency of SURPASS, this paper propose an index based on amount of information in information theory aiming at selecting the attribute with the larger value of amount of information as the best split attribute to reduce the frequency of accessing disk after computing index of amount of information for every candidate attribute. Experiments show that this method is effective. To make index of amount of information feasible abstractly, this paper educe index of amount of information using differential calculus. Through this method two kind of ways in which index of amount of information is calculated are gained and the superiority of index of amount of information is indicated. Also, this paper build a random forest based on SURPASS and verifies the character of random forest by doing some experiments.

  • 【分类号】TP311.13
  • 【被引频次】1
  • 【下载频次】205
节点文献中: