节点文献

分布式环境下聚类分析新方法的研究

New Methods for Cluster Analysis in Distributed Environments

【作者】 李成安

【导师】 吴铁军;

【作者基本信息】 浙江大学 , 控制科学与工程, 2006, 博士

【摘要】 随着计算机和存储技术的快速发展,人们已经积累了大量的历史数据,迫切需要将这些历史数据转化为知识。聚类分析,基于“物以类聚”的朴素思想,将物理或抽象对象集合划分为由相似对象组成的多个类,在数据挖掘领域得到了广泛的研究,并成功应用于各个领域。近年来,数据库规模持续增长,分布范围日益广泛,而大多数现有聚类分析方法需要一次性将所有数据载入内存,耗费大量计算时间,无法满足海量、分布式数据环境下的知识提取需要,因此分布式环境下聚类分析方法的研究是当今聚类分析领域富有挑战性的前沿课题。本论文致力于这一研究课题,以大规模、分布存储的数据集为研究对象,采用机器学习、人工智能和层次优化等技术和分布式计算相结合的方法,探索分布式环境下新的聚类技术,为高效、合理利用分布的、大规模数据提供理论和技术基础。本文的主要研究内容和创新点包括以下几个方面:1.对分布式环境下的聚类分析,从产生背景、算法研究、应用研究等方面进行了较为全面系统的分析和总结。2.针对分布式聚类的易实现性问题,利用弱聚类算法的易实现性,提出了一种基于Boosting技术的分布式聚类算法DBCA。DBCA算法在每次迭代中,将不同子数据库基于弱聚类算法建立的局部模型组装生成全局模型,各子数据库基于全局模型对其数据进行划分,再根据划分的质量确定下一次迭代的采样概率,通过加权投票集成前些次迭代的划分,并将最后一次集成得到的划分作为最后的聚类结果。分析表明DBCA算法具有可并行计算、良好的伸缩性和通讯代价小等特点,不仅有助于科学家对聚类分析的深入研究。还有助于普通工程技术人员利用分布式聚类技术来解决真实世界中的问题。实验表明DBCA算法可得到与集中数据库相似的结果。3.针对分布式聚类的集成伸缩性问题,根据数据库的网络分布、网络带宽等特点,利用层次设计思想,对OIKI DDM模型进行扩展,提出了基于移动代理的层次优化集成挖掘模型—HOIKI DDM模型,并相应提出一种分布式聚类算法HOIKIDC。实验和分析表明,HOIKIDC对于分布式环境具有更好的伸缩性,实现更加灵活,效率更高,并可有效降低通讯代价,特别适合于大规模异构分布式数据聚类问题。4.对分布式聚类的集成有效性问题进行研究。首先提出了集成有效性概念和局部结果不一致性概念,分析了局部结果不一致性的产生原因,提出了协同算法来降低这种不一致性,并相应地提出了一种分布式聚类算法CDCA,通过局部站点之间的信息交互和协同使全局聚类质量得到改善。实验结果表明,CDCA算法使结果集成更为有效。5.针对应用领域中的时间序列存在数据规模大且分布存储的特点,提出了一种分布式模糊短时间序列聚类算法DFSTS来分析这些时间序列的形状相似性从而更好的揭示序列的结构,并分析了该算法的收敛性。仿真结果表明DFSTS算法具有良好的伸缩性,具有与集中数据集同样的聚类质量,计算效率更高。6.以国家863计划项目为背景,以冶金生产过程质量预测与操作优化为研究对象,对分布式聚类技术在冶金工业中的应用进行了研究。首先设计了一个分布式数据挖掘系统原型。针对大规模、分布存储的连续退火生产过程数据,应用本文提出的分布式聚类算法完成了两个挖掘任务:1)带钢断带建模与预报;2)离群检测。实验结果表明,该方法对于连续退火过程数据的分析是有效的,对大规模冶金工业生产过程数据分析具有十分广阔的应用前景。

【Abstract】 With the rapid development of computer and memory technologies, there is growing interest in clustering theories and applications in data mining due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Cluster analysis is, based on the naive idea-things of one kind come together, a division of data into groups of similar objects and widely applied to many fields.In recent years, databases are persistently growing and distributed physically or geographally in more and more locations connected with computer networks. However, it is difficult for most of existed clustering algorithms to extract knowledge from huge amounts of distributed data because they need to load all data into the main memory and huge computational overhead. Thus new methods of discovering knowledge are necessary to be developed in large-scale, distributed environments and distributed clustering method is just one. Distributed clustering is the applications of cluster analysis in distributed computing environments and a challenge topic in data mining fields. This dissertation explores new clustering techniques in distributed environments so as to provide theoretic and technical foundations for utilizing efficiently and suitely large-scale, distributed data. And several novel distributed clustering menthods are proposed to cluster large-scale, distributed datasets in distributed environments using many techniques such as machine learning, artificial intelligence, distributed computing techniques, etc. The main work and results of the paper are showed in the following:1. Clustering methods in centralized and distributed environments are surveyed in three aspects, which are backgrounds, algorithms and applications of clustering methods.2. For easy implementation of distributed clustering algorithm, a novel distributed clustering algorithm (DBCA) is proposed using some simple and easily-implemented algorithms such as K-means algorithm and boosting techniques. At each iteration of DBCA algorithm, a set of clustering models are first generated from sub-databases at those sites using a weaker clustering algorithm and combined into a global model which is transmitted to the sites and used to partition the sub-database at each site. Then, in terms of partitioning qualities, sampling probabilities of the next iteration are updated at the sites. Finally, the partitions are integrated into an aggregated partition by a weighted voting. The final clustering result is the aggregated partition at the last iteration. DBCA algorithm is parallelly computable, scalable and has a low communication overhead. It is not only helpful for scientists to investigate cluster analysis but also helpful for common engineers to solve real-world problems using distributed clustering techniques. Experimentalresults show that DBCA algorithm is effective and can achieve results comparable to the algorithms in which boosting techniques are applied to the centralized databases.3. Integration scalability in large amount of sites which contain large-scale, distributed data sets is studied. First, a new hierarchical optimization mining model (HOIKI DDM model) based on mobile agent is proposed. Based on hierarchical idea and divid-and-conquer strategy, the proposed model extends OIKI DDM model according to network topology and bandwidth, and integrates multiple local results among the sites using mobile agent and incremental optimization. Then, a novel distributed clustering algorithm (HOIKIDC) with the proposed model is presented to cluster large-scale, distributed heterogeneous data sets. The experimental results demonstrate that HOIKIDC algorithm is scalable, flexible and efficient and particularly suited to large-scale distributed environments. In addition, HOIKIDC algorithm can reduce dramatically communication cost based on network characteristics.4. Validity of knowledge integration in distribute clustering is studied. First, integation validity and inconsistency amongst local results from different sites are defined. Then, analysis of inconsistency amongst local results and a coordination algorithm to reduce the inconsistency are proposed. Forethermore, based on the coordination algorithm, a novel distributed clustering algorithm (CDCA) in which information is exchanged amongst the sites is presented to improve clustering quality and integation validity. Experimental results show that CDCA algorithm outperforms the algorithms without cooridination in integation validity.5. For large-scale, distributed short time-series data sets in many fields sach as industries and DNA databases, a distributed clustering algorithm (DFSTS) is proposed to cluster short time series in distributed environments for analyzing the shape similarity hiding amongst the data so as to find its structure. Based on fuzzy clustering, the proposed algorithm is performed in multiple sites without transferring all data to a single dataset. The simulated results demonstrate that the proposed algorithm is effective, efficient and scalable and provides the same clustering quality as the single centralized data set.6. The distributed algorithms proposed in the dissertation are applied to steel plant in a real-world project (National "863" Project) to sovle the real industrial problems. First, a prototype system of distributed data mining is designed to apply distributed algorithms to metallurgy industries. Then, for large-scale, distributed data from continuous-anneal processes, two distributed mining tasks which employ distributed clustering algorithms: 1) modeling and prediction of strip-rupture after data-preprocessing; 2) detection of outliers, are performed. The performed results indicate that the distributed approaches are effective and only need to transfer models and knowledge rather than original data. According to the results, great application prospect of distributed clutering approaches proposed in this dissertationcan be expected to analyze large-scale, distributed data from metallurgy process industries.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2007年 02期
  • 【分类号】TP311.132
  • 【被引频次】7
  • 【下载频次】1152
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络