节点文献

带缺失数据的半监督图学习

The Graph-based Semi-supervised Learning with Missing Data

【作者】 王伟

【导师】 朱锋峰;

【作者基本信息】 华南理工大学 , 概率论与数理统计, 2011, 硕士

【摘要】 在数据分析和机器学习中,经常会遇到数据有缺失的情况,通常的做法都是先对数据进行插补,如均值插补、近邻插补、热卡插补、冷卡插补、回归插补、多重插补等。然后再在插补后的完全数据集上进行建模。然而,对数据进行插补费时费力,有时候会由于插补不当而造成与原始数据出现偏差,进而影响后续的整个建模分析。本文以分类为例,对数据有缺失时的处理办法进行了初步研究,目的在于构造一个不用插补的分类模型。本文首次将缺失数据与基于图的半监督学习结合起来,通过构造数据有缺失时的相似权值,提出了能自动处理带缺失数据的半监督图算法,并通过R语言实现本文算法。在机器学习常用数据库UCI中选取Letters,Spam,Diabetes,Wine,Segment数据集进行实验,得出以下结论:一:利用统计中处理缺失数据的经典补值方法(随机插补、均值插补、中位数插补)将特征数据补齐,利用插补后的半监督方法与本文不补的数据处理方法进行结果比较,实验结果表明本文方法略好于这些经典方法。二:将不缺失的数据集,人为删除一部分数据,形成缺失数据,将本文方法用于处理缺失数据,将经典的监督学习方法处理原有的完全数据集,进行结果比较。实验结果表明本文方法的效果略弱于监督方法,而本文方法是在数据有缺失时进行的,从而证明了本文方法在处理这类带缺失数据的分类问题时是一种合理的方法。三:将本文方法与传统的先对数据进行插补,再在完全数据集上建模的方法比较。实验结果表明本文方法要略好于传统的处理缺失数据建模问题的方法。且本文方法不用填充缺失数据,省去了插补的麻烦,具有相对的优势。

【Abstract】 Missing data handling is often encounted in data analysis and machine learning,the usual practice is first to impute the data,such as mean imputation, KNN imputation, hot deck imputation, cold deck imputation, regression imputation,multiple imputation,then modeling in the completed data.However,imputaton is time-consuming and sometimes inappropriate imputation may cause large errors or false results,thereby affecting the subsequent analysis of the model.In this paper,we study the methods of treating missing data for classification,the aim is to constructing a classification model without imputation.We firstly combine Graph-based semi-supervised learning with missing data and construct a Graph-based semi-supervised learning model which can handle missing data automatically by constructing similar weights in missing data.Then,we realize our algrithom by R. Finally, I perform some exeriments in UCI data(including Letters,Spam,Diabetes,Wine,Segment).The experiment conclusion as follows:1:To deal with missing data using claasical statistic imputation(stochastic imputation, mean imputation,median imputation)fistly,then compare with Graph-based semi-supervised learning after imputation.The experiment results show that our method is slightly better than classical methods.2: Compare with classical supervised learning model(where data have none missing value) ,the proposed method (where data is incomplete by remove some data artificially) has similar results ,indicating that our methods is reasonable,which is very convenient (needn’t imputation)when data contaning missing value.3: Compare with traditional methods(impute the data firstly,then model on the complete data), The experiment results show that our method is bettter than traditional methods,And our method do not fill missing data ,has a comparative advantage.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络