节点文献

随机森林算法优化研究

Study on Optimization of Random Forests Algorithm

【作者】 曹正凤

【导师】 纪宏; 谢邦昌;

【作者基本信息】 首都经济贸易大学 , 统计学, 2014, 博士

【摘要】 随机森林算法(Random Forests)是一种基于统计学习理论的组合分类器,它将bootstrap重抽样方法和决策树算法相结合,该算法的本质是构建一个树型分类器h k (x), k1,的集合,然后使用该集合通过投票进行分类和预测。由于该算法较好地解决了单分类器在性能上无法提升的瓶颈,因此具有较好的性能,能应用于各种分类筛选和预测中。当然,该算法也存在一些有待完善的地方,针对这些不足,理论界主要集中在三个方面进行优化研究,一是引进新的算法,二是对将数据预处理融入到算法中,三是针对算法自身构建过程进行优化。本文在充分查阅国内外相关资料的基础上,对后二个方面开展了优化研究。一、在数据预处理方面,提出了两种改进随机森林的优化算法。首先,针对随机森林不能很好地处理非平衡数据的问题,根据聚类算法思想和物理学的重心理论,本文提出了C_SMOTE算法,该算法能较好地降低数据集的非平衡性,从而提升了随机森林算法的分类性能。该算法针对SMOTE算法在选取“人造”样本时,存在一定的盲目性现象和容易产生边缘化的问题,提出了从负类样本的重心出发,有目的构造“人造”样本的新思路,使得在“人造”负类样本的过程中,新产生的样本有向重心汇聚的趋势,这样就可以有效地解决了SMOTE算法的缺陷,从而实现了既保留原有数据集的信息,又较好地解决数据集的非平衡性问题,从而在很大的程度上提升了随机森林算法在非平衡数据集上的分类性能。其次,随机森林算法在进行节点分裂时常采用C4.5算法,但C4.5算法在处理连续变量时,采用二分离散化的方法,该方法运行效率依懒于连续变量取值的数量,该数量越大,随机森林算法执行时间越长。针对此现象,本文提出了一种降低连续变量取值的数量的新算法,该算法可以很好地为C4.5算法提供简约的数据集,从而提升C4.5算法的执行效率。新算法是在借鉴CHI2系列算法思想的基础上,针对CHI2系列算法没有考虑2统计量和真实值之间存在偏差的问题而提出的。该算法使用2矫正公式较好地处理了CHI2系列算法中的偏差问题。文中通过使用三种通用的UCI数据集,将新算法和没有解决偏差问题CHI2系列算法,在改善随机森林算法性能方面进行了比较分析。实证数据表明,和CHI2系列算法相比,新算法能更有效地约简数据集中的冗余信息,使连续变理取值的数量很大程度地减少,从而提升随机森林算法的执行效率。二、在随机森林自身构建过程优化方面,本文通过分析随机森林算法分类性能的影响因素,针对随机森林在生成过程中,节点分裂算法不同引起的随机森林分类性能不同的现象,提出了一种基于线性组合的节点分裂混合算法。该算法将C4.5算法和CART算法在节点分裂时的函数进行线性组合,通过变换组合函数中的系数,充分发挥了这两种算法优势,较好地实现了随机森林算法分类性能的优化。同时,还详细分析了混合算法的稳定性、相关度和强度。首先,通过构造F统计量进行方差分析,对该混合算法的稳定性进行了检验。统计结果表明,该随机森林的混合算法随着森林中树木个数的变化虽然存在一定的不稳定性,但当森林中树木达到800棵时,算法可以达到稳定的状态。然后对混合算法的相关度和强度进行了理论上的推导和论述,同时实现了随机森林的平均相关度和强度的计算,并使用实证分析的办法,验证了平均相关度和算法分类精度存在负相关,森林的平均强度和算法的分类精度存在正相关的关系,并得了出混合算法对提升森林的平均强度和降低平均相关度较有其他算法具有明显的优势,也从另一个方面验证了混合算法的优越性。在优质股票池选择的实际应用中,该应用的数据集存在大量的连续变量,且该应用对分类算法的精度要求严格。本研究提出的随机森林优化算法,可以很好地处理连续变量及提升随机森林的分类精度。本文在价值成长投资策略的选股指标体系的基础上,通过小波分析和COR_CHI2算法进行数据预处理,使用节点分裂混合算法形成的随机森林成功地实现了优质股票池的选择,可为投资者进行有针对性的投资组合提供统计支持。

【Abstract】 Random forests are a combined classifier based on statistical learning theory. Throughcombining bootstrap re-sampling method and decision tree algorithm, its essence is tobuild a collection tree classifier h k (x), k1,, and then use the set by voting forclassification and prediction. Because the algorithm breaks through the bottleneck that thesingle classifier cannot improve in performance, it has good performance and can beapplied to various kinds of classification filtering and prediction. Certainly there are alsosome points needing to be improved in the algorithm. Aiming at these deficiencies,theoretical research is mainly conducted in three aspects. One is the introduction of newalgorithm, second is to blend in data preprocessing in the algorithm, and the third is theoptimization of the algorithm structuring process. On the basis of full access to relevantinformation from both China and abroad, this paper concentrates on the study of latter twoaspects optimization.In terms of data preprocessing, this paper presents two optimization algorithms toimprove the random forests.First, in view of the random forest’s inability to well deal with the issue of unbalanceddata, and in line with clustering algorithm and the center of gravity in physics, this paperputs forward the C_SMOTE algorithm, which can reduce the data set unbalance, so as toimprove the random forest classification performance. Aiming at SMOTE algorithmhaving certain blindness and prone to the problem of marginalization in the selection of"artificial" sample, the algorithm put forward the starting from the gravity center of thenegative samples and with the new thoughts to purposely structure "synthetic" samples,which makes new samples have the trend of convergence to the gravity center in theprocess of structuring " synthetic " negative samples and effectively solves the defects ofthe SMOTE algorithm. It not only keeps the information of original data set, but also bettersolves the problem of unbalanced data sets, which to a large extent, improves the randomforest algorithm in classification performance of unbalanced data sets.Second, random forest often uses C4.5algorithm for node split, but in dealing withcontinuous variables, C4.5algorithm uses dichotomy discretization method with itsoperation efficiency depending on the number of continuous variable values. The larger isthe number, and the longer is the execution time of random forests. Aiming at this problem,this paper puts forward a new algorithm to reduce the number of continuous variablevalues. This algorithm can provide simple data set for C4.5algorithm, so as to improve the execution efficiency of C4.5algorithm. It uses2correction formula to deal with thedeviation in CHI2series algorithm. By using three kinds of common UCI data sets, thispaper comparatively analyzes the new algorithm and the CHI2series algorithm in terms ofimproving the performance of random forest. Empirical data show that compared withCHI2series algorithm, the new algorithm can reduce the redundant information of data setmore effectively, making the number of continuous value greatly reduced and thus toimprove the execution efficiency of random forests.In terms of random forests structuring process optimization, through analyzing thefactors affecting the classification performance of random forests and aiming at nodesplitting method difference causing random forest classification performance difference inrandom forests generating process, this paper proposes a node split hybrid algorithm basedon linear combination. The algorithm brings the function of C4.5algorithm and CARTalgorithm in the node split and forms a linear combination function.Through theconversion of combination function coefficient, it gives full play to the advantages of thesetwo algorithms and realizes the random forest classification performance optimization. Inthe mean time it is also analyzed in detail the stability, relevancy, and strength of the hybridalgorithm. First of all, by constructing F statistic variance analysis, the stability of thehybrid algorithm is inspected. Statistical results show that the hybrid algorithm of randomforest has certain instability as the change of the number of trees in the forest, but whenthere are more than800trees in the forest, the algorithm can achieve the stable state. Thenthe correlation and intensity of hybrid algorithm are theoretically derived and discussed,and meanwhile the average correlation and strength of random forests are calculated.Furthermore, empirical analysis is used to verify that there is negative relationship betweenthe average correlation and algorithm classification accuracy, the average intensity offorest and classification accuracy of algorithm are positively related, and comparing withother algorithms, the hybrid algorithm has obvious advantages in improving averageintensity and reducing average correlation of the forest, and also from another aspect, thesuperiority of the hybrid algorithm is verified.In practical application of selecting high quality stock pool, there are a large numberof continuous variables in the application of data sets, and it has a high accuracyrequirement for classification algorithm. The optimization algorithm proposed in this paperis a good way to deal with continuous variables and improve the classification precision ofrandom forests. Based on screening stock index system with value growth investmentstrategy, this paper uses wavelet analysis and COR_CHI2algorithm for data preprocessing, using random forests from node split hybrid algorithm to successfully realize the choice ofhigh quality stock pool, so as to provide statistical support to investors for targetedinvestment portfolios.

节点文献中: