节点文献

基于特征和实例的海量数据约简方法研究

Research on Data Reduction for Massive Data Based on Instances and Characters

【作者】 彭涛

【导师】 李磊; 陈晓苏;

【作者基本信息】 华中科技大学 , 计算机系统结构, 2011, 博士

【摘要】 随着信息和计算机技术的不断发展,出现了大量的海量数据。在对这些海量的数据进行复杂的数据分析时,通常需要耗费大量的时间。虽然目前已经提出了一些通过减少海量数据集中实例的特征数和数据集中的实例数的方法来降低处理海量数据集所消耗的时间和存储海量数据需要的空间,但是这些方法都具有一定的局限性,因此,探索具有一定普适性的减少特征数和实例数的数据约简方法,对提高海量数据处理的效率有着重要的理论意义和现实意义。针对现有的基于实例选择的数据约简方法所存在不足的基础上,结合Hausdorff距离在度量两个数据集相似度方面展示出来的优越性,利用Hausdorff距离作为原始数据集中实例是否被选择的标准,从原始数据集中选取具有“代表”性的实例。同时考虑到Hausdorff距离的计算复杂度较高,借助K-NN (K-Nearest Neighbor)搜索算法将数据集进行均匀切割,在此基础之上,提出了一种基于局部Hausdorff距离的面向实例的数据约简方法。围绕多特征海量数据集的高效约简,从增强目标实例邻近数据的均匀度、减少近邻表示误差及提高降维效果的角度出发,提出了一个基于中心点的K-NN搜索算法,该算法能够保证对某一实例进行近邻搜索时得到的近邻实例最大限度的均匀分布在该实例的周围。为了保证K值的变化符合非均匀数据集的特性,提出了根据数据集局部均匀度的变化动态调整K值的方法,制定了满足非均匀数据集中LLE (Locally Linear Embedding)算法降维需要的K值变化规则。以此为基础,提出了一种基于LLE算法的变K值的数据约简方法。考虑到数据集中实例的变化一般会影响数据集分类的效果和数据集上的统计特征,建立了利用分类方法和空间统计的相关技术对约简方法进行综合评价的方法。为了实现对分类方法效果的度量,在分析了类半径、类间距和类实例数对分类精度影响的基础上,给出了一种综合评价分类精度的计算方法。通过对数据集中实例的频数分布、数据集的分位数和数据集之间的距离三个统计特征的分析,对约简前后数据集之间的相似性进行了评价。同时,基于对Moran’sI参数的作用分析,提出了利用Moran’sI对数据集约简前后的自相关性的变化进行度量,以实现对约简效果的评价。通过对基于特征选择、实例选择的数据约简方法以及约简效果评价方法的研究,取得了若干研究成果,对提高海量数据的处理效率、快速获取有效信息具有积极意义。

【Abstract】 With the development of the computer technology, information recording and spreading become more and more convenient, and information collecting technology is more and more advanced; this fact generated massive data. It consumes more time to process the massive data; In order to reduce the instances and the characters of the instance, many methods have been proposed to reduce processing time and storage, but those methods have their own limitation and boundary. So, researching a universal method to reduce the characters and instances and to give a reasonable evaluation is of great theoretical and practical significance.By analyzing the limitation of current instance selecting method and the advantage of measuring the similarity of two data sets, this paper proposed a data reduction method based on local Hausdorff distance. The method uses the Hausdorff distance as the criteria to select the representative instances. At the same time, in order to reduce the complexity of computing Hausdorff distance, this paper uses a K-NN search method to split the original data set into smaller datasets, and then uses the Hausdorff distance to select the representative instance in smaller data sets.In order to overcome the shortcomings of traditional LLE algorithm in processing the un-uniform data set for dimensionality reduction, the paper analyzed the impact of the error between an instance and its adjacent representation. In order to improve the result of dimensionality reduction, the paper introduces a self-adapting algorithm to calculate the parameter K to reduce the error between the instance and its adjacent representation, and a LLE-based data reduction schema by using the variant parameter K. The paper also proposes a schema to adjust the K dynamical with the variance of the local uniformity, which can reduce the error between the instance and its adjacent representation and satisfy the needs of dimensionality reduction for un-uniform data set. At last, this paper gives out a search algorithm based on center point to improve the uniformity of the near neighbors of the instance and to reduce the.Due to the fact that changes on the instance in data set will affect the result of the classification and the statistical nature of the data set, the paper proposes a schema by using the classification and spatial statistics to evaluate the result of the data reduction schema. By analyzing the affect of class radius and the distance between classes and number of instance in the data set for classification accuracy, the paper gives a method to compute the classification, which can evaluate the result of the classification schema. According to the analysis on the frequency distribution, quintile fractals and distance between the instances, this paper proposes a schema to evaluate the similarity between the data sets. Since spatial autocorrelation can reflect the distribution of the instance in data set, Moran’sl is used to measure the autocorrelation and to evaluate the change of autocorrelation of the original data set and reduction data sets.The research work on data reduction based on characters selection, instance selection and reduction evaluation archieves theoretical and practical values. These archievements also have a positive significance to improve the efficiency of processing massive data and obtaining the valuable information.

  • 【分类号】TP274;TP18
  • 【被引频次】1
  • 【下载频次】379
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络