节点文献

基于Internet的中文文本过滤系统的研究与实践

Research and Practice of Chinese Text Filtering System Based on Internet

【作者】 孙岩国

【导师】 余冬梅;

【作者基本信息】 兰州理工大学 , 通信与信息系统, 2004, 硕士

【摘要】 本文简要介绍了文本过滤的背景,系统地探讨了文本过滤与文本检索及机器学习等领域的紧密联系,以一种典型的中文文本过滤逻辑模型为例,深入研究了实现中文文本过滤系统所涉及的各个方面的理论和技术,其中包括概念扩展,文本结构分析和特征抽取,潜在语义标注及自适应学习等技术。文章借鉴了其它文本过滤系统的优点,充分考虑了系统的召回率,查准率,运行效率及可实现性,给出了一种改进的中文文本过滤系统的体系结构,增加了类匹配模块和用户兴趣反馈模块。并详细阐述了一种混合式的中文文本过滤方法,给出了实现该系统主要模块的数学模型及其相关的算法。 利用Java技术对整个中文文本过滤系统的功能模块进行了尝试性的实践。在实践中,实现了自动构建反向词频库,改进了关键词权重计算方法,增加了主题句权重计算方法,调节了数学模型中的相关系数,还增加了其它传统的过滤引擎所没有的同义扩展及查询修正等功能,取得了一定的过滤效果。 最后,针对本系统在过滤的精确率上不太理想的特点,对本课题下一步要研究的内容进行了系统的总结,并提出了自己的一些看法。

【Abstract】 This paper briefly describes the background of text filtering and systematically discusses the relationship of text filtering and text retrieval, machine learning, etc. Taking the example of a kind of typical Chinese text filtering logic model, it studies the related theory and technology that can realize Chinese text filtering system thoroughly, including concept expansion, Chinese text structure analysis and feature extraction, latent semantic indexing, self-adaptive learning, etc. Then considering the systematic recall, precision, operational efficiency and feasibility, an improved Chinese text filtering system architecture is proposed, the clustering matching modules and the feedback modules of users’ interests are added. The approach of the hybrid Chinese text filtering is explained in detail. In addition, the main mathematical models and the relevant algorithms of the system are put forward.The tentative practice to some functions of the whole system has been carried on using Java technology. In practice, the reverse term frequency database is constructed automatically, and the technique of the keywords’ weight is improved, and the calculating method of the subject sentences’ weight is increased, and the coefficients of the mathematics models are regulated. Furthermore, it has also increased such functions as synonymy expansion and modification, which obtain certain results.Finally, the precision of filtering is not ideal, so the next contents of this subject are summarized systematically and some one’ s own views are also presented.

  • 【分类号】TP391.1
  • 【被引频次】9
  • 【下载频次】193
节点文献中: 

本文链接的文献网络图示:

本文的引文网络