节点文献
基于图模型的聚类算法研究
Study on Clustering Algorithm Based on Graph Model
【作者】 张绪青;
【作者基本信息】 浙江大学 , 计算机系统结构, 2008, 硕士
【摘要】 随着互联网技术的高速发展和网络资源信息的日益丰富,海量互联网网页同时涵盖文本、图像、视频和音频等媒体数据以及多种语言并存,呈现出跨媒体的特性。如果缺乏一套有效的检索机制,从海量的跨媒体资源中搜寻信息无疑是大海捞针,因此,研究海量信息的跨媒体检索机制至关重要。一般而言,用户需要的检索信息并不直接存在于被检索资源中,而是需要搜索引擎对潜在的检索结果作诸如摘要生成、分类、排重、聚类等智能处理,才能满足用户的检索需求。本文在广泛阅读相关文献、深入了解聚类算法的原理与应用的基础上,主要针对基于图模型的聚类算法,在算法的改进、应用上做了如下工作:1.结合基于因数图的仿射传播聚类算法和k-中值多数投票算法的优点,提出了使用松弛多根最小生成树分配算法的投票分割式仿射传播聚类算法,实验结果证实了该算法的有效性。2.提出了一种基于随机分块和投票聚类融合策略的聚类大规模数据集合的算法框架,并使用该算法框架对仿射传播聚类算法进行了扩展,使其能够处理任意形状的更大规模的数据集合,并验证了该扩展的可行性和有效性。3.对聚类分析在图像搜索领域的应用进行了探索,提出了一个基于投票分割式仿射传播聚类算法的图片搜集模型。本文的贡献和创新主要体现在算法的改进和应用上:1.提出了分割式仿射传播聚类算法,它在实际聚类个数大于本质聚类个数时能在本质聚类上产生随机的划分,从而使其满足投票策略对聚类生成器的随机性的要求。2.提出了松弛多根最小生成树分配算法,在基本保持聚类结果的随机性的基础上,减小了误分配的概率。3.将使用松弛多根最小生成树算法的分割式仿射传播聚类算法和投票策略结合在一起,并结合划分一致性索引指标讨论了如何选取合适的域值参数的问题。4.提出了一种将时间复杂度和空间复杂度较高的聚类算法扩展到大规模数据集合应用上的算法框架——随机分块再融合,并用其对仿射传播聚类算法进行了扩展,使其能处理更大规模的、任意形状分布的数据集合。5.提出了一个基于投票分割式仿射传播聚类算法的图片搜集模型,使开发人员能够基于此模型构造一个应用程序来帮助用户从现有图片搜索引擎上方便的获取相关主题的图片资料集合。
【Abstract】 With the fast development of internet technique, more and more kinds of meta-data are displayed in mass web pages such as text, pictures, video and audio, which indicates a trend of cross-media. If there is not an effective search mechanism, it is very hard for us to gain useful information from such a large scaled web based database. Therefore, it is essential for us to study methods for searching in large scale of cross-media information. Generally, the information which user searches for is not stored in the database in a straightforward form. To meet the search demand of customers, some intelligent process is needed for the search engine, such as abstract generating, classifying, duplication removing and clustering.Based on survey of related papers and further study of clustering algorithms, the following work has been done in improvement and application of clustering algorithm based on graph model.1. We present an algorithm called voting partition affinity propagation which combines voting scheme with affinity propagation clustering algorithm. It is composed of three parts: relaxed multi-root minimum spanning tree assign dicipline, partition affinity propagation clustering algorithm and voting clustering ensemble scheme. It has advantages of both affinity propagation and voting k-means. Experiments on both simulated and real data set approve its validity.2. We display a scheme that can deal with problem of clustering large scaled data set. It was based on voting clustering ensemble algorithm and random partition. And this framework is used to extend affinity propagation clustering algorithm to deal with larger scaled and arbitrary shaped data set. Experiment results validate the feasibility and effectiveness of this extension.3. We propose a model for image collection which is based on voting partition affinity propagation.The main contribution and innovation of this paper are displayed on the improvement and application of the existing algorithms: 1. Proposing partition affinity propagation clustering algorithm. The algorithm can generate random partition within the substantial cluster when output number of clusters is larger than the intrinsic one, thus to meet the demand for randomicity of voting clustering ensemble scheme.2. Presenting relaxed multi-root minimum spanning tree data point assign algorithm. The discipline can reduce the possibility of mis-assign while mostly keep the randomicity of clustering result.3. Proposing to combine affinity propagation clustering algorithm with voting clustering ensemble scheme through partition affinity propagation using relaxed multi-root minimum spanning tree assign discipline. The problem of how to choose an appropriate threshold for the voting scheme to get a good consistent clustering result is also discussed under the measurement of partition consistent index.4. Displaying a framework that can extend clustering algorithms of relative high time and space complexity so that they can deal with larger scaled data set. Firstly, it divides the data set into several sub-blocks randomly and clusters these sub-blocks. Such operation is done for certain times. Then, the above clustering results are ensembled using voting scheme. Affinity propagation clustering algorithm is extended by this framework to deal with larger scaled and arbitrary shaped data set.5. Introducing a model for image collection which is based on voting partition affinity propagation. Developers can build an application based on this proposed framework to facilitate people to get image resources of certain subject from existing image search engines more conveniently.
【Key words】 clustering; graph model; affinity propagation; clustering ensemble; voting;
- 【网络出版投稿人】 浙江大学 【网络出版年期】2008年 07期
- 【分类号】TP301.6
- 【被引频次】6
- 【下载频次】669