节点文献

基于粗集模型的聚类方法及其在文献过滤系统中的应用

The Application of Rough-Set-Model Based Text Clustering Algorithm in the Text Filtering

【作者】 谷波

【导师】 张永奎;

【作者基本信息】 山西大学 , 计算机应用技术, 2004, 硕士

【摘要】 信息过滤(Infonnation Filtering)是一种个性化的、主动的信息服务机制,是对传统信息检索服务的有益的补充。信息过滤包括许多内容,如声音、图像和文本等等,在本文中,我们主要指对文献的过滤。聚类(Clustering)是将一组问题空间的对象按相似度进行分类,把相似的对象归为一类,尽可能使得类内的对象间的平均距离最小,而使类间的距离最大。本质上,聚类属于一种无监督的学习,将聚类技术应用于信息过滤中可以在一定程度上提高系统的过滤效率,同时也对信息过滤的查准率与查全率有积极的作用。将聚类技术用到文本信息过滤中,本质上属于文本挖掘范畴。 自然语言的不确定性和模糊性造成了计算机对自然语言处理的困难,由于粗糙集不仅具有描述不精确概念能力,而且还给出了对不精确度的度量,因此将粗糙集的有关理论用于对自然语言的描述有一定合理性。 本文在粗糙集理论的背景知识下,对于文本的粗糙集表承模型和基于此模型下的聚类在信息过滤系统中的应用,进行了深入的研究。所作的工作和创新点总结如下: 1.提出了一种新的文本表示模型,该模型基于粗糙集的对知识的等价划分的思想,试图保持文本的概念信息:定义了该模型下的粗糙相似度;并提出了基于该模型的计算文本相似度的方法。 2.将文本聚类技术应用到信息过滤中。对文档进行了聚类,在检索的期间,对用户提出的检索词先进行和每一类的类心比较,得到与之最近的类别,仅将属于该类别中的文档与用户提出的检索词进行运算,从而缩小了检索的范围,提高了检索的效率,也在一定程度上克服了检索结果的偏差。 3.将文本聚类技术应用到信息过滤中。借鉴了协作过滤的思想,不再把用户看成是独立的个体,而是看成按一定的相似兴趣联系的群体类,对用户模型进行了聚类,这样在发送文献时不再以单个用户模型作为计算对象,而是以用户兴趣类作为计算对象,同时进行文献推荐时也是以用户兴趣类作为推荐对象的,以期提高过滤效率和准确率。 实验结果表明,引入本文提出的基于粗糙集的聚类方法之后的信息过滤系统较原来的系统在性能上有所提高。

【Abstract】 Information filtering is a unique and active information service mechanism, a useful supplement to the traditional information retrieval service. Clustering makes a classification to a set of subjects of question space, and put the similar subjects into a category, which makes the average distance between the subjects within one category as minimum as possible, and while makes the distance between clusters as maximum. The application of clustering into information filtering, to a certain degree, promotes the filtering efficiency of the system, and plays an active role in the examination of the precision and recall of the text.The indeterminacy and vagueness of natural language cause difficulty to NLP. The rough set is capable of describing the vague concepts, and measuring to the extent of vagueness, so it is appropriate to describe the natural language through the rough set. With the rough set theory as background, this paper has studied deeply the rough set representation model and the clustering based on this model. The main innovation and work of this paper are as follows.(1) puts forward a new text representation model, which originates from the theory of equivalence division of the rough set, defines the similitude of this model, and proposes the approach to calculate the text similitude of this model.(2) puts the text clustering techniques into the practice of information filtering. After clustering of the documents, in the process of retrieval, we make a comparison between the retrieval words the users point out and cluster center of the documents, and as a result, achieve a cluster that is most similar to retrieval words. Through the calculation of both the selected documents and those retrieval words, thence the retrieval range will be reduced, the efficiency of retrieval be increased, and the retrieval deviation be overcome to a certain extent.(3) puts the text clustering techniques into the practice of information filtering. In virtue of the cooperation filtering theory, this paper no longer look on the user as separate, but a group of people whose interests are in common in some aspect. Besides, it makes cluster to the user profile, so that the separate user profile will no longer be taken as the calculation subject when the documents are sent out, but the user classified in terms of their interest, which can be used as the recommended subject when the documents are sent out in order to promote the filtering efficiency and precision.The results of the experiment demonstrate the current information filtering system based on the rough set clustering is more efficient than the previous ones in light of its operation.

  • 【网络出版投稿人】 山西大学
  • 【网络出版年期】2004年 03期
  • 【分类号】TP391.1
  • 【被引频次】2
  • 【下载频次】216
节点文献中: 

本文链接的文献网络图示:

本文的引文网络