节点文献

Logistic回归、决策树和神经网络在预测2型糖尿病并发末梢神经病变中的性能比较

A Performance Comparison between Logistic Regression, Decision Trees, and Neural Networks in Predictiving Peripheral Neuropathy in Type 2 Diabetes Mellitus

【作者】 李长平

【导师】 胡良平;

【作者基本信息】 中国人民解放军军事医学科学院 , 流行病与卫生统计学, 2009, 博士

【摘要】 近年来,数学方法和计算机技术的发展使复杂的模型预测成为可能。目前能够建立预测模型的方法主要有统计学方法和数据挖掘方法,基于这两类方法的预测技术已逐渐被应用在生物医学研究领域中,但对其预测性能(即泛化能力的大小)进行比较的研究却很少,因此将数据挖掘方法与统计学方法的泛化能力进行比较是一个非常值得研究的方向。本研究以2型糖尿病并发末梢神经病变(Diabetic Peripheral Neuropathy, DPN)的病例对照研究数据(数据来源情况详见本文第2章)为例,采用Logistic回归(Logistic Regression, LR)、决策树(Decision Trees, DT)和神经网络(Neural Networks, NN)对DPN发生的概率进行预测,并就建模和预测性能比较研究中的几个难点,提出了较为理想的解决方案。本研究的难点及相应的解决方案如下:(1)科学地实现连续变量离散化。在一些科学研究中,人们通常对一些连续变量的一个单位值的变化不感兴趣,或根据专业知识需将连续变量进行离散化,因此如何科学的实现连续变量离散化是一个值得研究的问题。本文采用χ2分割法对连续变量进行离散化,不仅使离散化后的变量各个等级之间划分得有意义,而且使等级之间的区分度尽可能地大,很好地实现了连续变量离散化的目的。(2)在建模过程中充分利用数据信息、防止过拟合。在数据量有限的情况下,能尽量多地利用数据信息是很重要的。在决策树和神经网络构建过程中,如何在小样本时既能达到充分利用数据信息,又能防止过拟合现象的发生是一个重要的问题。本研究采用100次5~7折分层交叉验证方法,将分类和回归树(CART)与卡方自动交互式检测树(CHAID)相结合,建立起决策树模型,既充分利用了数据信息,又避免了过拟合现象的发生。此外,在选取神经网络模型隐含层数和隐含层节点数目时,以SBC准则作为选择的标准,在建模过程中利用L-M优化技术,采用权重衰减和预训练的方法,也可充分利用数据信息,有效避免过拟合现象和局部最优现象的发生,从而建立起较为准确可靠的模型。(3)快速有效地构建Logistic回归模型。常规的Logistic回归建模筛选变量的方法有向前选择法、向后剔除法、逐步法、最优子集法,前三种筛选变量方法均涉及到变量进入和(或)剔除的P值大小的选择问题,显然P值的选取存在一定的主观性。例如,有些研究认为变量进入方程的显著性水平(SLE)0.05过于严厉,经常将重要的变量排除在外。针对所有原因变量的组合情况,最优子集法均可以给出其对应的χ2值,但却无法指出哪种组合最佳。因此,如何快速有效地进行变量筛选,构建准确可靠的模型是很重要的。本研究中采用最优子集法与AIC信息准则相结合对变量进行快速方便的筛选。此法既考虑了模型的泛化能力又避免了人为选取P界值点带来的“烦恼”,建立的模型也优于用常规筛选变量方法建立的模型。(4)小样本情况下的模型泛化能力比较。大量文献资料显示,迄今为止,在生物医学领域中,关于多种不同模型预测、分类技术的比较研究,或针对于较大的数据量(如从几百例观测至几十万例观测),或对模型泛化能力比较时采用保持法(将数据集随机分成两部分,一部分建模一部分测试),并没有涉及到小样本时如何有效利用数据信息以及基于小样本时如何对模型泛化能力进行比较。而在实际工作中,很多数据集较小(如100例左右),且变量较多,此时采用保持法进行模型泛化能力的比较就会损失数据信息,导致比较结果的可靠性降低甚至不可靠(本研究中也证实了这一点,详见本文第5章)。因此,如何针对小样本的特性,有效地构建模型并对模型的泛化能力进行客观评价,是一个很值得研究的问题,也是本次研究的重点。在本研究中针对小样本的特性,采用Monte Carlo模拟抽样(10~100次的2~10折分层交叉验证法、刀切法、100~1000次自引导法(具体为0.632自引导法))的校正技术,对模型的泛化误差作出可靠的评价,进而对三种预测方法(LR、DT、NN)的泛化能力进行比较,并客观地评价三种模型的泛化能力,有效弥补了应用保持法对模型泛化能力进行比较时存在的上述不足。就本资料而言,结果表明,总体来说NN泛化能力最好,LR次之,DT最差。(5)调整过抽样。当样本的获取方式是来源于过抽样(即分离抽样)时,模型估计的概率值是基于样本而不是基于总体的,此时对总体人群疾病发生的概率进行预测可能会存在较大的偏差。本文针对过抽样的特点,利用先验概率对后验概率进行调整,从而使调整后的结果能够更客观准确地预测疾病发生的可能性。综上所述,本研究采用三种方法(LR、DT、NN)对DPN发生的概率进行预测,在基于小样本条件下,从五个方面(即①科学地实现连续变量离散化、②充分利用数据且又防止过拟合、③快速有效地构建模型、④有效利用数据信息提高模型泛化能力、⑤有效调整过抽样获得更客观准确的预测结果)进行比较研究和改进,均取得了比较理想的结果,其建模思想和技术方法可方便成功地移植到生物医学甚至其它研究领域中去。

【Abstract】 The development of mathematical methods and computer technology has made it possible to use complex models for prediction in recent years. There are two primary methods employed–statistical methods and data mining methods. The predicting technology based on the two methods has been applied to the field of biomedical studies, but there are quite a few studies on comparing performance of prediction, that is, the generalization ability. It is worth comparing the generalization ability of data mining methods with that of statistical methods. In this study, by taking the case control study data of Diabetic Peripheral Neuropathy(DPN) in type 2 Diabetes Mellitus(It is introduced in chapter 2 of this thesis.) for example, some desirable solutions are proposed to several difficulties in building models and in comparing the performance of the logistic regression, decision tree, and neural network for predicting the probability of DPN.The difficulties and the corresponding solutions in predicting the probability of DPN are as follows:(1) Discretization of continuous variables. In some scientific studies, there is no interest in change of specific value of a unit of continuous variables, or discretization of continuous variables is required according to professional knowledge. However, how to scientifically discretize the continuous variables is a problem worthy of study.In this thesis, we use the chi-square partitioning method to discretize the continuous variables. The distinction between classes is maximized after the variables are discretized.(2) Utilization of data information and avoiding overfitting in the process of establishing a model. When the amount of data is limited, it is particularly important to use as much data information as possible and avoiding overfitting. This is very important for the decision tree and neural network.In this research, we combine the classification and regression tree with the chi-squared automatic interaction detector tree by means of the 100 times 5~7 fold stratified cross-validation method to establish the decision tree model to make full use of data information and also to avoid overfitting. In addition, we use Schwarz Bayes Criterion to choose the number of hidden layers and hidden layer units, and use Levenberg-Marquardt optimization algorithm, weight decay, and preliminary training method to train the neural network. We also establish a reliable neural network model to make full use of the data information and also to avoid overfitting and inferior local minima.(3) Quick and efficient establishment of a logistic regression. Conventional screening variable methods of the logistic regression include the forward-entry method, backward-elimination method, stepwise and best subset method. The first three methods are related to how to choose the P value, which is the cutoff point where the variables enter and (or) are removed from the model. It is obvious that P value is subjectively chosen and it is thought that 0.05 slentry is too stringent to include some important variables from the model in some cases. For all the combination of variables, The best subset method can give the corresponding chi-square value, but fails to decide which kind of combination is optimal. Therefore, it is very important to select variables quickly and effectively to establish an accurate and reliable model.In this thesis, we combine the best subset method with the Akaike Information Criterion to screen variables quickly and easily. The method not only takes into account the generalization ability of the model but also saves the“trouble”of artificially choosing of P value. Thus we have built a logistic regression model which is superior to conventional screening methods.(4) Comparison of the generalization ability in case of small sample. Large amounts of reference show that multiple different studies on model prediction and classification technique in the biomedical field have been either applied to a larger data volume(from several hundred to several hundred thousand observations) or use the holdout method(one part of set for training and the remainder for testing) to asses the generalization ability of the model. But these studies are not related to how to make full use of the data information or how to compare the generalization ability when the sample is small (as one hundred observations or so). But the data set is likely be small and the number of variables is large at work. Data information will be lost and low confidence or even unreliable results will ensue(It is proved in chapter 5 of this thesis.) if the holdout method is adopted to evaluate the generalization ability of the model. Therefore it is worth studying how to establish an effective model and make an objective evaluation of the generalization ability when the sample is small, which is the most one of the primary focuses of the study.We adopt Monte Carlo simulation of sampling (2~10 fold stratified cross-validation, jacknife method, 100~1000 times bootstrap method (to be more specific, 0.632bootstrap)) validation technology in the research to make a reliable assessment of generalization errors and make a comparison of the generalization ability in order to achieve an objective evaluation of the three models and avoid the drawback of the holdout method. On the whole, the result shows that the generalization ability of NN is the strongest, followed by LR’s, DT’s in terms of the data of DPN.(5) Adjustment of oversampling. When the data is obtained by oversampling (that is, separate sampling), the probability of model estimation is not based on the population but on the sample. There may be larger deviation if we predict the probability of disease of the overall population.Because the source of the data is from oversampling, we use prior probability to adjust the posterior probability so that the adjusted results can predict the possibility of outbreak of diseases more objectively and accurately.In a word, we use three methods (LR, DT, NN) to predict the probability of DPN. In case of small sample, we have made comparative studies and improvement in terms of (①Discretization of continuous variables scientifically,②Utilization of data information sufficiently and avoiding overfitting,③Quick and efficient establishment of a model,④Utilization of data information efficiently and improvement the generalization ability,⑤Efficient adjustment of oversampling to acquire more objective and accurate results ), and have achieved the desired results. The concept of modeling and techniques could be conveniently and successfully applied to the field of biomedical studies and other fields.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络