节点文献

空间离群点挖掘技术的研究

Study on Spatial Outlier Mining

【作者】 薛安荣

【导师】 孙家广; 鞠时光;

【作者基本信息】 江苏大学 , 计算机应用技术, 2008, 博士

【摘要】 空间离群点是与其空间邻域中其它空间对象的非空间属性值存在明显差异的空间对象。空间离群点挖掘是空间数据挖掘的一个重要分支,在交通控制、遥感图像分析、气象预报和人口统计数据分析等应用中可揭示重要现象。随着传感器设备技术的发展,数据采集设备的数量越来越多,精度越来越高,采集的项目也越来越多,因此数据量越来越大,维数越来越高。然而现有的空间离群点挖掘算法主要是针对单维或中低维的中小规模数据量的挖掘,难以适应高维大数据量的挖掘,并且现有算法没有充分考虑空间数据的特点,挖掘的不是真正意义上的空间离群点,而是全局离群点。算法存在用户依赖性大,检测精度低,挖掘效率低等局限。此外,随着网络技术、传感器技术和无线通信技术的发展,数据的采集、收集、保存和处理都呈现分散状态,因此,基于分布环境的数据挖掘也引起人们的关注,但基于分布环境的空间离群点挖掘算法还未见报道。本文将根据空间数据自身的特点,研究属性划分方法和属性的权值设置方法,空间离群程度的度量方法,实现挖掘精度高、用户依赖性少的高效的空间离群点挖掘算法。针对现有算法主要局限在数值型属性数据处理上的不足,通过将非数值型数据转化为数值型数据,实现基于混合型属性的统一算法。针对高维大数据量,采用剪枝策略、基于子空间的离群点挖掘和集成学习的方法实现高维大数据量的挖掘:针对分布环境下的空间离群点挖掘,提出了基于隐私保护的空间离群点挖掘算法。论文的主要贡献如下:(1)提出基于属性划分的方法解决局部离群点的挖掘问题。一般的局部离群点的挖掘采用的是满维属性的挖掘方法,如LOF(Local Outlier Factor)方法,其结果是局部邻域的确定非常耗时,由于所有维属性不加区分地等同看待,所以离群度度量的准确性受到影响,影响了挖掘的精度和速度。提出将数据对象的属性划分为标识属性、环境属性和固有属性,标识属性起着标识对象的作用,如数据对象名称等;环境属性决定了对象所处环境,如地理位置、时间、序列等,可利用环境属性确定邻域;固有属性是数据对象特有属性,包括行为属性和状态属性,决定了对象的行为和状态特征,可利用该类属性确定对象的离群程度。(2)提出空间数据对象的离群程度的新的度量方法,即基于空间数据特性的空间局部离群系数SLOF(Spatial Local Outlier Factor)的度量方法;提出基于空间离群度的空间离群点挖掘算法ASLOF(Algodthm based on SLOF)。将数据对象的属性分为标识属性、空间属性和非空间属性,利用空间属性确定空间邻域、建立空间索引,利用非空间属性确定对象的离群程度,并在离群度的度量中引入属性的权值,提高度量精度,据此提出了基于空间离群度的空间离群点挖掘算法。理论证明和实验测试结果表明,ASLOF在挖掘的精度、用户依赖性和算法性能上均优于现有算法。(3)提出混合属性的统一的空间离群度的度量方法和挖掘算法。从离群点性质入手,通过统计分类属性的频度,将分类属性转化为数值型,并通过属性的权值设置和属性的标准化等处理后,实现基于混合属性的空间离群点的统一挖掘算法。实验结果表明,算法可有效实现混合属性的空间离群度的统一度量计算和有效挖掘。(4)提出基于集成学习的子空间离群点集成的高维大数据量的空间离群点快速挖掘算法S2OEAHL(Subspace Spatial Outlier Ensemble Algorithm baSed High-dimensional Large data sets)。由于很多空间数据对象的标识属性中含有空间对象所在的地域标识,根据地域标识构建对象的层次编码树,基于层次编码树,实现数据的分区和对象的快速检索,通过计算分区的上下界和使用包围盒检测方法,剪除明显不含有离群点的分区,保留可能含有离群点的分区作为候选分区,实现了分区的快速剪枝,从而降低数据处理数量。对候选分区采用子空间挖掘方法,为避免与属性维度成指数关系的大量搜索,采用指定子空间挖掘和基于子空间权值的集成融合方法来解决高维数据的离群点挖掘问题。算法的实现中采用了基于单维子空间的离群系数挖掘方法,并利用优化计算的方法求得被检测对象所对应的各属性的权值,在此基础上通过集成融合函数求得被检测对象的离群度,根据离群度的排序可获得所求离群点。理论证明和实验结果均表明算法的有效性和计算的高效性。(5)提出基于分布环境的隐私保护的空间离群点挖掘算法DPPASLOF(DistribuIcd Privacy Preserving Algorithm based on SLOF)。算法中利用空间数据的局部性,发挥各数据方的主动参与的能力,借助于空间索引技术和隐私保护协议以提高搜索能力和隐私保护能力。理论证明算法的安全性,计算的高效性和低通信代价。

【Abstract】 A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from the values of its neighborhood.Spatial outlier mining is an important branch of spatial data mining,it can reveal important phenomenon in the applications of traffic control,sensed image analysis,weather forecasting and analysis of demographic data and others.With the development of sensor technology,the number of equipment for data acquisition is more and more,the desired precision is higher,more and more projects collected,therefore increasing the amount of data,the higher dimension.However,the existing spatial outlier mining algorithm is mainly for the small and medium-sized datasets which is one-dimensional or low-dimensional,difficult to adapt to the large high-dimensional data mining,and did not fully consider the characteristics of spatial data,the data it mined is not the true spatial outliers,but the global outliers.Their disadvantages are the high user-dependency,low detection accuracy,low efficiency of mining.In addition,with the development of network technology,sensor technology and wireless communication technology,the acquisition,collection,preservation and processing of data appear a state of decentralization,so the data mining based on the distributed environment is also cause for concern.However,spatial outlier mining algorithm based on the distributed environment hasn’t been reported.According to the characteristics of spatial data,this article will research on the methods of attribute partition and weight value setup,the measurement of spatial outlier score,achieving the high-performance spatial outlier mining algorithms with high mining precision,less user-dependency.The disadvantages of existing algorithms mainly limited to numerical data,by transforming the non-numerical data into numerical data,make the unified algorithm based on the mixed attribute come true.For high-dimensional large amount of data, use pruning strategy,the outlier mining based on subspace and ensemble learning methods to achieve the data mining of high-dimensional large amount of data sets; For the spatial outlier mining of distributed environment,the privacy preserving spatial outlier mining algorithms were proposed.The main contribution of the paper is as follows:(1) Propose the method based on the attribute division to resolve the problem of local outlier mining.The general local outlier mining uses the method of full-dimensional attributes,such as LOF(Local Outlier Factor) method.As a result,it is very time-consuming in determining the local neighborhood,since all-dimensional attributes are indiscriminately equated,the accuracy of the measurement of outlier score affected,the mining accuracy and speed of data mining also affected.The attributes of data object can be categorized as the ID attributes,context attributes and inherent attributes.The ID attributes play the role of marking the data object,such as the name of data object and so on.The context attributes decide the environment of the object,such as location,time,sequence,it can be used to identify neighborhood. The inherent attributes is the unique attributes of data object,including behavior attributes and status attributes,decide the behavior and characteristics of the status of the object,we can use it to determine the spatial outlier score of data objects.(2) Propose a new method for the measurement of the spatial outlier score of data objects.That is,the measurement method of SLOF(Spatial Local Outlier Factor) which is based on the characteristics of spatial data.Propose the spatial outlier mining algorithm ASLOF(Algorithm based on SLOF).The attributes of data object can be categorized as the ID attributes,spatial attributes and non-spatial attributes,use the spatial attributes to determine the spatial neighborhood,establish the spatial index,use the non-spatial attributes to determine the spatial outlier score,and introduce the weight value of attributes in the measurement of outlier score,improving the measurement accuracy.Based on these,propose the spatial outlier mining algorithm based on the spatial outlier score.The theory and experimental results show that the proposed ASLOF algorithm outperforms the other existing algorithms in mining accuracy,user-dependency,and efficiency.(3) Propose a unified measurement of the spatial outlier score and mining algorithm of mixed attributes.Start with the nature of outliers,through counting the frequency of classified attributes,transform the classified attributes into numeric attributes,and through weight value setup and standardization of the attributes,after the above mentioned deal,make the unified mining algorithm of spatial outlier which based on the mixed attribute come true.The experimental results show that it can effectively achieve the unified measurement of spatial outlier score with mixed attributes and mining.(4) Propose the subspace spatial outlier ensemble algorithm based highdimensional large data sets(S2OEAHL).Due to a lot of geographical identity contained in the ID attributes of the spatial data objects,according to the geographical identity to construct of the hierarchy coding tree of object,based on the tree,achieve the division of data and rapidly search of the object,by calculating the upper and lower bound of the division and minimum bounding rectangle(MBR) method,cutting the division which obviously not contain outliers,reserving the division which may contain outliers as a candidate division,it realizes the rapid pruning of the division, consequently reduce the number of data processing.Adopting the subspace mining method for the candidate division,in order to avoid a large number of search which has an exponential relationship with the dimension of the attributes,using a subspace-based mining and ensemble learning based on subspace-weight to address the issue of outlier mining of high-dimensional data.Algorithm use the outlier factor mining method of one-dimensional subspace,and use the optimizational method of calculation to achieve the corresponding weight of attributes of the detected object. On this basis,the outlying-ness of each data object is measured by fusing outlier factors in different subspaces using a combination function.According to the sort of outlier factors we can acquire the outliers.The theory and experimental results show the effectiveness of the algorithm and the high efficiency of calculation.(5) Propose the spatial outlier mining algorithm DPPASLOF(Distributed Privacy Preserving Algorithm based on SLOF) of the protection of privacy based on distributed environment.The algorithm using the locality of spatial data,exert the ability of active participation of every data holder party,with the spatial index technology and privacy preserving protocols in order to improve the ability to search and privacy preserving.Theory shows the safety of the algorithm,the high-performance of computing and the low cost of communications.

  • 【网络出版投稿人】 江苏大学
  • 【网络出版年期】2009年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络