

Research on Clustering Algorithm for Web Document by Incorporating Distribution Information

【作者】 孙春红

【导师】 杨明;

【作者基本信息】 南京师范大学 , 计算机应用技术, 2008, 硕士

【摘要】 随着Internet的迅速发展,Web信息资源已涵盖了社会生活的各个方面,网络信息过载问题日益突出,这促使Web挖掘技术迅速发展。本文从Web文档聚类的角度,围绕文档分布信息表示及其相似性度量方法、多角度聚类及核理论在多角度学习中的应用三个方面展开研究,主要工作包括以下几个方面:1.提出一种嵌入分布信息的文档相似性度量方法。现有的Web挖掘技术大部分是基于传统的VSM(Vector Space Model)向量空间,虽然能达到一定的效果,但是忽略了Web文档中其它有用的信息。针对此问题,本文引入了文档中单词的分布信息,提出了新的相似性度量方法。实验结果表明,新相似性度量方法能较好的提高聚类效果。2.提出一种多角度学习算法。该方法在传统多角度Kmeans算法的基础上,采用经典及新的相似性度量,尝试在不同角度上使用不同的学习算法,可更好地反映出数据集中文档的分布特征。实验结果表明,本文提出的多角度学习算法取得了较好的效果。3.提出一种基于核方法的多角度聚类算法。核化理论主要是通过不同核函数在原空间中诱导出不同的距离。本文分别采用多项式核和高斯核,进行了大量实验,实验结果表明,核化后的多角度聚类算法性能得到了明显改善。

【Abstract】 With the rapid development of the internet, the information resources on the Web have covered all the fields of the society, the issue of overloading information becomes more serious day by day, which boosts the development of the Web Data Mining Technique. In this paper, from the viewpoint of web document clustering, we do our research on the representation of distribution information of a document and the corresponding similarity measurement, and multi-views clustering, and kernel based multi-views learning. The main contributions of this paper are as follows:1. Propose a similarity measurement method which incorporates distribution information. Most of the existing Web Data Mining techniques are based on VSM, which only achieves some effects, and does not concern other useful information contained in the web document. In this thesis, we introduce a new similarity measurement method with the distribution information of the word contained in the document, which is an extension of the traditional similarity measurement. Experiments show that, the new similarity measurement in this thesis has better clustering performance than the traditional similarity method.2. Propose a new mult-view algorithm. In this method, different algorithms have been applied on various views, which can express the distributional features of the document in the data set more clearly. Experimental results show that the accuracy of the classification has been improved.3. Propose a kernel-based co-training clustering algorithm. The different kernel functions can induce different distances of the original samples in original space. In this thesis, plenty of tests have been performed by using Polynomial Kernel and Gaussian Kernel; the results show that after adopting the kernel methods, the multi-view algorithm of clustering have been apparently improved.

  • 【分类号】TP301.6
  • 【下载频次】50

