

Research on Clustering Preprocessing of Data Resource and Its Application

【作者】 夏骄雄

【导师】 吴耿锋;

【作者基本信息】 上海大学 , 控制理论与控制工程, 2007, 博士

【摘要】 水呵水,到处都是水,船上的甲板却在干涸;水呵水,到处都是水,却没有一滴能解我焦渴。数据呵数据,到处都是数据,各类用户却在迷茫;数据呵数据,到处都是数据,却没有任何提示能帮我决策。美国前副总统Al Gore在1998年1月31日所做《数字地球:二十一世纪认识我们的星球》[Gorel 998]的演讲中指出:一场新的技术革新浪潮正允许我们能够获取、储存、处理并显示有关地球的空前浩瀚的数据以及广泛而又多样的环境和文化数据信息,而充分利用这些浩瀚数据的困难之处在于把这些数据变得有意义——即把原始数据变成可理解的信息。今天,我们经常发现我们拥有很多数据,却不知如何处置。现在,我们贪婪地渴求知识,而大量的资料却闲置一边,无人问津。没有物质,就什么都不存在;没有能源,就什么都不会发生:没有信息,就什么都没有意义[Oet1965]。作为三大资源之一的信息,对于我们的生活越来越具有深远的影响。面对如此丰富、繁杂的数据,如何才能从中提取有价值的信启、和知识,由此诞生了一个新的研究方向:基于数据库的知识发现KDD(Knowledge Discoveryin Database)以及相关的数据挖掘DM(Data Mining)理论和技术。数据资源(Data Resource)作为信息领域基本的研究对象,是从资源的角度对数据及其本身所存在的状态给予的重新认识与高度概括。综合利用各类有效的KDD和DM技术来提高数据资源本身的质量、增强数据对象的利用效率成为数据资源有效开发利用的主要研究方向。数据资源的预处理作为KDD和DM过程的重要环节,聚类分析作为KDD和DM领域成熟的技术,这两者相结合的研究具有重要的探讨意义和应用价值。本文将聚类分析引入数据资源的预处理,进行了多方面的研究,取得以下主要成果:1.借鉴分裂型层次化聚类方式,分别从平面、立面、空间等三个层次综合构建基于层次分析法的数据库聚类预处理DCP-AHP方法,突出运用层次化思维来迭代评估目标,剔除相异度高的数据对象集合,达到聚类清理数据对象集合的目的,减少定性问题定量化后误差的影响。2.按照相关性最小原则,提出数据库主成份提取的聚类预处理DCP-PCE方法进行高维数据系统的降维处理,获得数据对象变异最大方向的投影作为特定数据对象集合中的各个主成份,实现分层次的主成份聚类提取;同时DCP-PCE方法也验证了主成份对于原有信息全面覆盖的特性,同步解决了综合变量覆盖和降维问题,降低了数据对象集合的相异度和维度,实现了数据对象集合的聚类归约。3.利用数据对象的物理存储属性本身所具有的“0、1”特性,针对同体不同源数据对象SEDS提出同体不同源数据对象聚类数化NC-SEDS算法,将数据资源中所有数据对象都通过数据对象预处理的过程转换成数字状态,然后利用数化后数据对象的数字状态作为聚合归类的依据,在不考虑数据对象其他属性的情况下,提高同体不同源数据对象SEDS的凝聚程度,达到降低比较次数、总体执行时间的目的,实现数据对象的聚类集成。4.为了贯彻“复杂问题求解”的思想,提出了基于本体核与直方图的聚类预处理CPOKH方法。在对数据对象进行聚类预处理时,首先得到弱量本体核的客体数据频数,然后根据用户明确的需求信息,获得所有需要的弱量本体核,并将其结合成强量本体核,最后通过“直方图”的构建与分析,明确数据对象的相关类属。5.借鉴“能量”与“碰撞”的基本理念,以数据资源预处理得到的数据对象类或簇作为主要研究对象,构建了基于能量的“有效”动态阈值,实现了基于能量碰撞的聚类优化COEH策略;对已经具备聚类初步特征的数据空间进行用户主题需求的能量驱动,把聚类内部的数据对象与孤立点数据对象放在统一的认识平台中加以统筹处理,保证了数据对象的聚类优化。同时,作为理论成果的应用研究,本文选择了高校教育评估体系作为应用研究对象,将聚类分析技术引入高校数据资源的预处理环节,给出了应用实例,为有效利用现有数据资源,理性分析高校各方面工作的成效,深入探索学生培养的模式提供了有效的分析方法。

【Abstract】 A new wave of technological innovation is allowing us to capture, store, process and display an unprecedented amount of information about our planet and a wide variety of environmental and cultural phenomena. The hard part of taking advantage of this flood of geospatial information will be making sense of it. - turning raw data into understandable information. Today, we often find that we have more information than we know what to do with. Now we have an insatiable hunger for knowledge. Yet a great deal of data remains unused. (The Digital Earth: Understanding Our Planet in the 21st Century[O~re1998], by U. S. Former Vice President A1 Gore, on January 31, 1998.)Without materials, nothing exists. Without energy, nothing happens. Without information, nothing makes sense[~et1965]. As one of three resource(materials, energy and information), information brings more and more important influence on our life. For the wide availability of huge amounts of data and imminent need for turning such data into useful information and knowledge, Knowledge Discover in Database(KDD) and Data Mining(DM) have come into being attracted a great deal of attention.Being the fundamental object of information field, Data Resource can be the cognition and recapitulation of data and its statement on resource. With the effective utilization of KDD and DM, improving the quality on data resource and strengthening the efficiency on data object has naturally become the main target. Preprocessing of data resource is the necessary stage of KDD and DM, as also clustering analysis is the perfect technique on KDD and DM. Therefore, Research on preprocessing of data resource with clustering analysis has the significance on practice and discussion.In the dissertation, some discussion on clustering preprocessing of data resource has carried out and the main research results are as following.Firstly, according to the divisive hierarchical clustering, a method of Database Cluster Preprocessing on Analytic Hierarchy Process(DCP-AHP) is constructed. Standing on the plane, section and space, DCP-AHP emphasizes the hierarchy on the target. With the DCP-AHP, the data object sets with the higher dissimilarity can be ignored, clustering cleaning on the data object sets can be achieved, and the error from qualitative analysis to quantitative analysis can be reduced.Secondly’, according to the lowest relativity of the data object, a method of Database Cluster Preprocessing on Principal Component Extraction(DCP-PCE) is submitted to carry out the clustering’extraction of principal component by hierarchical analysis. The projection on the most differentiation of the data object is defined as principal component, which can be proved to include all the original information of the data object sets. By the DCP-PCE, integrality of information and lower dimension of principal component are solved synchronously, dissimilarity.and dimension of the data object sets are decreased, and clustering reduction of the data object sets are reached.Thirdly, making use of the characteristic "0" and "1", which is the physics storage attribute of the data object, an algorithm of Numerical Cluster on Same Entity from Different Sources(NC-SEDS) is put forward to turn all the data object into numerical statement. Not considering other attribute of the data object, the numerical statement will be known as the basis of clustering to improve the clustering state of SEDS. Through the exercise of method, the times of comparison among the data object is played down, the executing time is dropped off and the clustering integration is taken.Fourthly, following out the "complicated problem’s solution", a method of Cluster Preprocessing on Ontic Kernel and Histogram(CPOKH) is brought forward to cluster preprocessing of the data object. In the method, the Weak Ontic Kernel(WOK) comes from Object Data Time by the user’s demands, and will be combined into Strong Ontic Kernel(SOK). Based on the SOK, the histogram will be made up to analyze and detect the clustering on material ascription of the data object.Fifthly, refer to the distillation of "energy" and "hit", a strategy of Clustering Optimized by Energy Hit(COEH) is taken to make the valid dynamic threshold among the cluster by energy. With the function of COEH, energy driven about the user’s demands will be brought into effect in all data-space, and all the data object, including the outlier, is planed as a whole at the unified cognition platform. Therefore, the clustering optimized of the data object can be ensured on the unification and overall environment.Finally, an evaluation system on colleges and universities education is confirmed as the application research in practice. All the work in the dissertation is verified by the real experiments. By leading the clustering analysis into the preprocessing of data resource on the colleges and universities, it is possible to discuss the effect on all fields validly, in particular the gain and loss about the student training.

  • 【网络出版投稿人】 上海大学
  • 【网络出版年期】2008年 04期