节点文献
基于GMDH的缺失数据插补方法研究
The Research for Method of Missing Data Interpolation Based on GMDH
【作者】 张智勇;
【导师】 贺昌政;
【作者基本信息】 四川大学 , 管理科学与工程, 2007, 硕士
【摘要】 随着信息技术的发展与人们收集数据能力的不断提高,数据库、数据仓库以及internet技术的应用普及,人们积累的数据越来越多,数据挖掘技术应运而生并不断发展。现有的数据挖掘算法大部分是建立在理想的数据集上的,而在实际中,由于各种原因,我们收集的数据往往是不完全的,或多或少存在数据缺失。在这种情况下对缺失数据的通常处理方法就是先估计缺失数据,然后在完全数据集的基础上进行数据挖掘。现在应用最多的缺失数据插补方法有回归插补方法,神经网络插补方法,K最邻近插补方法等。但是,这些方法在处理噪声数据时存在一些不足,比如,在噪声数据下回归插补缺失数据与神经网络插补缺失数据容易产生过拟合;在K值较小的情况下,K最邻近算法插补缺失数据容易受到噪声数据的干扰。GMDH方法具有有效处理噪声数据的特点。本文以缺失数据的理论为基础,引入了面向噪声数据的GMDH方法,建立了基于GMDH的缺失数据插补方法体系,用于噪声数据下的缺失数据插补。在用GMDH来插补缺失数据的过程中,根据数据缺失模式的不同,假设了不同的数据缺失机制,从而采用了不同的方法与GMDH结合来插补缺失数据。在单变量数据缺失模式,随机缺失机制下,用EM算法与GMDH结合,建立变量之间的GMDH模型,根据模型来估计缺失数据。在多变量数据缺失模式,忽略数据缺失机制的情况下,用K最邻近算法与GMDH结合,建立相似样本之间的GMDH模型,通过模型估计缺失数据。本文的主要工作如下:1.首先在数据缺失模式为单变量数据缺失,数据缺失机制为随机数据缺失情况下:(1)提出了用EM算法与GMDH算法结合来插补缺失数据的新方法,并给出了该方法的基本假设,设计了该方法的基本步骤,编制了该方法的相应程序。(2)通过理论分析、数值实验和对中国经济数据的实证研究,对基于GMDH的缺失数据插补与回归插补进行了比较研究,揭示了用该方法来插补在噪声数据下的单变量数据缺失的有效性,显示了该方法较回归方法的优越性。2.其次在数据缺失模式为多变量数据缺失,数据缺失机制为可忽略数据缺失情况下:(1)提出了用K最邻近算法与GMDH算法结合来插补缺失数据的新方法,并给出了该方法的基本假设,设计了该方法的基本步骤,编制该方法的相应程序。(2)通过理论分析,中国各省国内生产总值的实证研究对基于GMDH的缺失数据插补与K最邻近算法插补进行了比较研究,揭示了用该方法来插补噪声数据下的多变量数据缺失的有效性,显示了该方法较K最邻近算法的优越性。因此,在这些工作的基础上,本文的创新点主要体现在下面几个方面:1.在对缺失数据的插补过程中,本文研究了噪声数据下的缺失数据插补:(1)在对单变量缺失模式,随机缺失机制下情形下,将GMDH算法与EM算法结合,通过迭代来插补缺失数据减小了噪声数据对缺失数据插补的影响;并在实际例子中通过对缺失数据的范围增加限制性条件,加快了迭代速度,克服了缺失数据比较多,而已观察数据比较少时不能建立模型的问题。(2)在对多变量缺失模式,忽略数据缺失机制情形下,将GMDH算法与最邻近算法相结合,消除了噪声数据对缺失数据插补的影响,减小了K值选取在插补过程中的重要性;并通过GMDH算法的内外准则提高了对缺失数据估计的准确性。2.在对缺失数据的插补过程中,本文还将数据缺失模式和机制与缺失数据的插补方法联系起来,从而为不同缺失数据下选用不同的方法来插补缺失数据提供了理论依据。
【Abstract】 With the development of information technology and the continuous improvement of people’s capacity to collect data, the wider use of database, Data Warehouse and internet technologies, People accumulate more and more data.Data mining technology Came into being and go on development alone with data.However, the majority of data mining algorithms are based on the ideal data set, but in reality, Due to various reasons, the collected data is often incomplete, and there is more or less missing data, In this case, the usual methods for handling missing data is to estimate missing data, based on estimates, We conducted data mining.Now the most widely used method of missing data interpolation is regression interpolation,neural network interpolation, K-nearest interpolation.But when processing noise data, these methods exists certainly insufficient, for instance, under the noise data, regression interpolation and neural network interpolation are vulnerable to over fitting to noise interference. When K is very small, K nearest interpolation is vulnerable to noise interference.GMDH method is a good way to deal with small samples and noise data.Based on the theory of missing data, this paper introduced the GMDH method oriented noise data, and established the missing data interpolation method on system noise data. According to different model of missing data, assuming a different mechanism of missing data, this paper combined different algorithm with the GMDH algorithm to estimate missing values. In a single-variables missing model and MAR missing mechanism, this paper combined GMDH algorithm with the EM algorithm, according to the the relationship between the variables, established GMDH models to estimated the missing data.In the multi - variable model,and ignored the missing data mechanism, this paper combined GMDH algorithm with the K-nearest algorithm, according to the the relationship between the samples, established GMDH models between the samples to estimate missing data according to the similar models.Therefore, the main task of this article is:1. At first, the data loss model is single - variable missing data, the data loss mechanism is MAR loss:(1) This paper presents the new methods based GMDH and EM, gives the basic assumption of this new methods to establish missing data, designs the basic steps of interpolation algorithm, and write the corresponding procedures.(2) Through a theoretical analysis, numerical study and the Experimental of the Chinese economy, this paper compare the interpolation method based on GMDH missing data and the interpolation method based on regression., and show the effectiveness and superiority to the estimates of the missing values in the interpolation algorithm-based GMDH in the noise data through a comparison.2. Secondly, the data loss model is multi - variable missing data model, the data loss mechanism can be neglected.(1) This paper presents the new methods based GMDH and K-nearest algorithm, gives the basic assumption of this new methods to establish missing data, designs the basic steps of interpolation algorithm and write the corresponding procedures.(2) Through a theoretical analysis, and the Experimental of the Chinese economy,this paper compare the interpolation method based on GMDH missing data and the interpolation method based on regression.and show that the effectiveness and superiority to the estimates of the missing values in the interpolation algorithm-based GMDH in the noise data through a comparison. According to the interpolation process of missing data, the paper points to the main innovation in the following areas:1. In the process of missing data interpolation, this paper study the missing data interpolation under the noise data(1) When the data loss model is single - variable missing data, the data loss mechanism is MAR loss, We combined GMDH algorithm with the EM algorithm to estimate missing values, though iterative algorithm, reduce the noise impact on the estimated data of the missing data, and through adding restrictions in the actual conditions, therefore accelerated the iterative pace and overcome the shortcomings of not building modle in the circumstances of more missing data,only relatively few observations.(2) When the data loss model is multi - variable missing data model, the data loss mechanism can be neglected, We combined GMDH algorithm with the K-nearest algorithm to eliminate missing data, reduce the noise impact on the estimated value of missing data, and the importance of the K value in the interpolation process, and improve accuracy of estimates through the internal and external criteria of GMDH algorithm.2. In the process of missing data interpolation, we combined the models and mechanisms of missing data with the interpolation method of missing data, and provide a theoretical basis to choose different interpolation algorithm to estimate the missing values under different missing data models and mechanisms.
【Key words】 GMDH algorithm; EM algorithm; K-nearest algorithm; Missing data;
- 【网络出版投稿人】 四川大学 【网络出版年期】2008年 05期
- 【分类号】TP183
- 【被引频次】3
- 【下载频次】514