节点文献

基于信息论的特征加权和主题驱动协同聚类算法研究

Information Theoretic Feature Weighting and Topic-Driven Co-Clustering for Text Dataset

【作者】 吴彪

【导师】 叶允明;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2008, 硕士

【摘要】 文本数据常用文档-词二维共现矩阵表示,大多数传统聚类算法属于单向聚类,即要么是对样本进行聚类,要么是对特征进行聚类,没有考虑到样本和特征之间自然存在的相互关系。尤其对高维、稀疏、带噪声数据,传统单向聚类方法在精度上很难满足实际需求。基于信息论的协同聚类算法从信息论的角度捕获了行列之间自然关系,同时从行向和列向进行聚类,相互协助、相互约束,对高维、稀疏数据也能起到高效聚类的效果。但该方法也存在一些不足,如没有考虑特征的重要性,另外该方法是一个无监督的学习过程,聚类后簇的可解释性不强,在聚类精度上也有提高的空间等。本文在基于信息论的协同聚类算法以及参考已有研究方法的基础上,做了两点探索性改进,即在原有无监督聚类的基础上,引入了主题知识,并对特征进行了加权处理。提出了无监督的特征加权的协同聚类算法和半监督的主题驱动的协同聚类算法两个改进算法。特征加权协同聚类算法用互信息计算特征权值,突出有效特征的重要性,在聚类精度和运行时间上得到了提高。在主题驱动的协同聚类算法中,首先建立了一个基于维基百科和开放分类目录的主题语料库,该语料库中定义了每个主题的描述和层次;然后通过协同聚类的方法将主题知识传播到文本聚类过程中,我们的目标是将相同主题下的文档聚在一起。通过实验证明,在聚类精度上我们提出的两个改进算法能得了更好的聚类结果。

【Abstract】 Text samples usually are represented by document-word co-occurrence tables, and most traditional clustering algorithms have no consideration of the nature relationship between samples and features, which are single dimension clustering to cluster samples or features independently. Most of them are hardly to reach the demand of real application, especially when meeting high dimension, sparse and noise data.Information theoretic co-clustering captures the nature relationship between rows and columns from mutual information aspect. It is simultaneously clustering rows and columns, and at all stages, the row cluster prototypes incorporate column clustering information, and vice versa, it has a high performance on clustering precision. But it also has some defects, for example, it has no consideration of the importance of features, and it is an unsupervised method in nature, the explanation of the clustering result is not so good and the precision may be improved when some prior knowledge are available, etc. In this thesis, we present two exploratory improvement methods based on Information Theoretic Co-clustering and other former researches, one is an unsupervised clustering method called Feature Weighting Information Theoretic Co-clustering (FWITCC), and the other is a semi-supervised clustering method called Information Topic-driven Theoretic Co-clustering for Text Dataset (TDITCC). We build a topic thesaurus repository as prior knowledge based on Wikipedia and Open Directory Project, which defines the hierarchy and description of each topic, and then use co-clustering methodology as a bridge to propagate the topic knowledge from topic repository to text dataset. Our general goal is to organize each document to its corresponding topic. The experimental analyses show that our methodology can produce high quality clustering results.

  • 【分类号】TP18
  • 【下载频次】113
节点文献中: 

本文链接的文献网络图示:

本文的引文网络