节点文献

基于词共现的文本主题挖掘模型和算法研究

Research on Terms Co-occurrence Based Models and Algorithms for Text Mining

【作者】 常鹏

【导师】 李敏强;

【作者基本信息】 天津大学 , 管理科学与工程, 2010, 博士

【摘要】 随着信息技术的发展与社会信息化进程的加快,数字化的信息呈爆炸式的增长,已经远远超出了人类的理解与概括能力。利用计算机从大量的文本资料中自动发掘有价值的知识与信息,是解决这一难题的有效途径。本文以数据挖掘理论为基础,重点研究了文本主题挖掘的相关模型及算法。主要研究内容包括:首先,研究了文本的表示模型。通过分析词共现现象,从理论上证明了词共现现象与主题之间的相关关系,从而提出了基于词共现组合的文档表示模型(Co-occurrence Term Vector Space Model, CTVSM)。利用关联规则挖掘,抽取出文本集上的共现词组合集合,进而定义了基于CTVSM的文本表示向量,以及文本相似性的度量方法。其次,以CTVSM为基础研究了文本聚类问题,提出了基于CTVSM的文档层次聚类方法,将文档和文档的聚类表示为共现词组合的向量,利用文本相似性度量方法,设计了文档聚类之间的相似性度量方法。为了快速判断层次聚类过程中的最优划分层,定义了文档聚类的中心点,提出了基于聚类熵的最优划分层判断准则。实验证明,基于CTVSM的文档聚类取得了较好的效果。然后,研究了文本空间中的词聚类问题,根据文本集上的抽取出的共现词组合集合,定义了文本集上的词共现图,将词映射为图中的点,词与词的共现度映射为图中的连接两点的边,从而将词聚类问题转化为在图中划分点簇的问题。提出了基于图密度的词聚类方法,在聚类过程中,一个词加入一个词类的依据为该词的加入是否能显著提高该词类的图密度,直到所有词都被划分到词簇中。实验结果表明本文提出的方法与一般方法在算法复杂度(实验进行的时间)以及聚类效果上均有显著提高。最后,研究了文本集上挖掘出的主题在信息推荐与信息检索中的应用问题。以文本的主题抽取为例,利用文本空间中的主题信息,提高了文本主题抽取的质量。通过对文本主题的预测,确定文档所属的主题域,进而确定了该文本主题词抽取的领域词范围,据此对文档中的词的权重进行调整,从而使主题领域词汇得以较高的权重,保证了抽取出的主题词的主题精确度。实验证明,算法提高了文本主题词抽取的质量,特别是在词频权重区别度不明显的短文本中,抽取质量有显著提高。

【Abstract】 There has been a phenomenal growth of information during past decades. The work of understanding the massive information has been a hopeless for human-beings. To obtain information automatically from the text information has become a key problem in our information research society. The main research work of this thesis is based on statistical machine learning methods with the usage of co-occurrence, especially the Text Mining models and algorithms. The main contents are as follows:First, a novel model of document is presented which is built with co-occurrence term, named co-occurrence term vector space model (CTVSM). The algorithm of mining associate rules is employed to extract the co-occurrence terms in the document space. Then the document model is defined with these co-occurrence terms and measurement of the similarity between two documents is defined further. Experimental results show that the distance of documents which are less similar is farther than distance in Euclidean space basis of VSM, and the distance of documents are more similar is closer than the one in Euclidean space.Second, on the basis of CTVSM, a novel document clustering algorithm is proposed. In this algorithm the document and cluster are presented by CTVSM and the measurement of different clusters is given according to the measurement of documents. In order to decide the optimal number of clusters, clustering gain as a measure for clustering optimality is advanced. It shows good performance producing intuitively reasonable clustering configurations in document clustering according to the evidence from experimental results.Third, another focus of this thesis is on using CTVSM to cluster large scale terms in document space. A map of co-occurrence terms is defined, in which words are mapped into dots and relationship between the co-occurrence words is mapped into edges. An algorithm of word clustering is proposed based on this map. It joints the word with the cluster on the basis of the change of the cluster’s density. It shows that this algorithm is better than the normal word clustering method in both performance and efficiency.Finally, an application of the topic map extracted from the document space is proposed. An algorithm of subject words extraction is improved by using topic map. Topics of a document are identified by means of estimation of statistical topic model. Thus the document’s topic term fields are identified. The weight of terms is adjusted according to the topic term fields. Experimental results indicate that the proposed method significantly outperforms methods that combine existing techniques.

  • 【网络出版投稿人】 天津大学
  • 【网络出版年期】2010年 11期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络