节点文献

文本聚类分析若干问题研究

Study on Several Issues of Text Clustering

【作者】 高茂庭

【导师】 王正欧;

【作者基本信息】 天津大学 , 管理科学与工程, 2007, 博士

【摘要】 面对大规模的、高维的文本数据,如何建立有效的、可扩展的文本聚类算法是数据挖掘领域的研究热点。针对这些问题,本文对文本聚类分析所涉及的若干问题进行了较深入的研究,主要包括如下几个方面:提出了一种基于投影寻踪的文本聚类新算法,该方法利用遗传算法寻找最优投影方向,将文本特征空间投影到一维空间上,从而以直观的方式显示出数据的结构特征,实现文本聚类分析的可视化。针对文本特征向量维数高和k-means等方法需要预先确定聚类数的问题。提出了基于LSA、CI、RP及NMF的RPCL文本聚类算法,先运用LSA等方法对文本特征矩阵进行降维处理,再运用RPCL算法进行文本聚类,这些新方法不仅可以有效地降维,还可克服k-means等方法需要预先确定聚类数的困难。基于向量空间模型,提出了一种基于双词关联的文本特征选择新模型,这种模型在向量空间模型的基础上,增加了文本的双词关联信息,使得向量空间模型中所包含的文本特征信息更加丰富、更加准确,结合隐含语义分析方法降维后,不仅有效地降低了维数,还进一步减少噪声凸现文本的语义特征,从而提高文本挖掘的质量。基于文档标引图特征模型,提出了一种新的基于短语的相似度计算方法,并采用变换函数对文档相似度值进行调整以使其获得了更好的可区分特性,从而更加有利于文本的聚类分析、分类等处理。将基于后缀树的聚类方法用于中文文本聚类中,这种方法将文本看成是一些短语的集合,通过后缀表达文本的相似关系,实现文本聚类。这种方法可以解决多主题的文本聚类问题,并克服了k-means等硬聚类算法将文本严格划分类问题,实现文本的软聚类。

【Abstract】 Facing the massive volume and high dimensional text data, how to build effec-tive and scalable algorithm for text clustering is one of research directions of data mining. Aiming at above issues, some basic problems of text clustering have been studied substantially as follows.A new pursuit projection based text clustering algorithm is proposed. It looks for the optimal projection direction by using genetic algorithm, projects text feature vector in high dimensional into a low dimensional space. The structure features of the texts can be shown intuitionisticly and the results of text clustering can be visu-alized.Aim at the problems of high dimensional and predetermined cluster number, several LSA, CI, RP, NMF based RPCL text clustering algorithms are also proposed, which reduce dimension with LSA etc. and cluster texts with RPCL. It can not only reduce dimension effectively, but also overcome the problem of partitoning cluster in advance.Based on Vector Space Model, a new double-word relation based text feature selection model is proposed in this dissertation. This model adds double-word rela-tion information of texts to Vector Space Model so that it contains more abundant and more exact text feature information. Combining with Latent Semantic Analysis, it not only reduces dimension effectively, but also cuts down some noises and stands out the semantic feature in the text. So, it can improve the quality of text mining greatly.Based on Document Index Graph feature expression model, a new text similar-ity calculating method is proposed, in which text similarity can be adjusted to get better distinguishability by using a proper transformation function and to be in favor of text clustering analysis and classification.Suffix Tree Clustering is used in Chinese text clustering, in which text is re-garded as a set of phrases and the similarity of texts is denoted by suffix tree. This can solve the problems of multi thematic text clustering, overcome the problem of predefined cluster number, and realize soft text clustering.

  • 【网络出版投稿人】 天津大学
  • 【网络出版年期】2009年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络