节点文献

投影寻踪模型在文本聚类算法中的应用研究

【作者】 陆鹏

【导师】 高茂庭;

【作者基本信息】 上海海事大学 , 计算机应用技术, 2007, 硕士

【摘要】 快速、高效的文本聚类算法有助于从大量非结构化的文本源中发现和挖掘其所蕴含的巨大潜在知识。文本数据以向量空间模型表示成特征向量,往往呈现出高维特征。利用投影寻踪模型实现文本特征降维,把高维文本特征投影到二维或三维的可视化空间当中,不仅可以表现出文本的结构特征,还可以大大简化文本聚类算法的计算复杂性,提高算法效率和精度。利用投影寻踪模型对文本特征向量进行降维的过程中,关键是最优投影方向的搜索。本文提出两种改进的基于遗传算法的投影寻踪文本聚类算法,结合遗传算法来确定最优投影方向,将高维文本特征向量投影到二维和三维空间上,实现文本特征降维,使得文本的结构特征在可视的空间中凸现出来,从而能够直观地观察文本集的结构分布情况,直观地确定文本类数目。实验表明,这种方法可以得到较好的聚类结果。

【Abstract】 The efficient and high quality Text Clustering Algorithms would help to discover and mine the huge latent valued knowledge from a great deal of unstructured text sources. Vector Space Model is usually used to express text feature with high dimensional characteristic.Applying the Projection Pursuit Model in text feature dimension reduction to project high dimensional feature vector into visualization space with two or three dimension. It not only can express text structure features, but also reduce computation complexity, improve efficiency and precision of the text clustering algorithms. The key in this process is to find the global optimal projecting directions.This paper proposed two kinds of improved genetic algorithm based projection pursuit text clustering algorithm, which uses accelerating immune genetic algorithm to determine optimal projection direction and project the high-dimensional text feature vectors into two or three dimensional space. It can merge text structure features in a visualization space, and determine the text cluster number intuitionisticly. Experiments demonstrate this algorithm can get better clusting result.

  • 【分类号】TP18;TP391.1
  • 【被引频次】4
  • 【下载频次】240
节点文献中: 

本文链接的文献网络图示:

本文的引文网络