节点文献

中文网页自动分类的一种实现

【作者】 王崑崙

【导师】 黄德根;

【作者基本信息】 大连理工大学 , 计算机应用, 2002, 硕士

【摘要】 搜索引擎是网络信息检索的重要工具,在中文搜索引擎的实现中,中文网页的自动分类是一个很重要的研究方向。通过自动分类不仅仅可以将网页按照类别信息分别建立相应的数据库,提高中文搜索引擎的查全率和查准率,而且可以建立自动的分类信息资源,为用户提供分类信息目录,并且,自动分类的好与坏,对后面的相关性排序过程也有一定的积极作用。 本文分析了网页中对分类过程有贡献的结构成分,并针对中文网页的特点和网页分析过程中的对分词质量的要求,对现有的最长次长分词算法进行了相应的简化和调整,使其更加适用与自动分类过程。并将信息检索领域中用于计算关键字与相关文献相关权重的IDF(Inverse Document Frequency)公式应用于自动分类过程,结合对中文网页的分析结果,得出具有可调参数的权重计算公式,根据公式要求,设计并建立了用于保存分类训练结果的分类权重向量库。利用语料训练的结果并运用VSM模型,实现了一种有实践意义的中文网页自动分类方法。 经过闭式和开式测试,本方法在进行大量语料训练后可以使相关网页的识别准确率达到90%以上,比原有的概率分布算法有了明显的提高,而在算法效率方面基本与原有的算法相当,显示出其相当的实际应用价值。

【Abstract】 Search engine is a capital tool of Internet information retrieval. Automatic categorization of Chinese web page is an important study direction in the implementation of Chinese search engine. By the automatic categorization, web pages is distinguishingly created into corresponding data bases according to category info, which improve recall and precision ration of Chinese search engine. In the meantime, automatic categorization info resource is established to provide category message catalog for users. In addition, the quality of automatic categorization in some measure has positive effect upon sequent relativity sort process.This paper analyzes structure components on the web page contributing to categorization process and, aiming at characteristics of Chinese web page and requirement of participle quality in web page analysis process, accordingly simplifies and adjusts the in being algorithm about longer/longest participle, thereby it further applies in automatic categorization process. By utilizing the IDF (Inverse Document Frequency) formula in automatic categorization process, which was used in information retrieval field to calculate the relativity term weight between keywords and relevant documents, and combining with analysis result of Chinese web page, the formula carrying adjustable parameter for calculating the correlative degree is obtained. Categorization correlative degree vector library, which is used to conserve categorization-training result, is designed and established to meet demands of the formula. An automatic categorization method of Chinese web page, which has practical signification, is achieved by using corpus training result and VSM model.Through close and open cycle tests, the results of experiment show that, this method can improve the correct recognition rate of correlative web pages to upward of 90% with little decline in efficiency, which is superior to the former one ?Probability Distributing Algorithm. It is supposed to have a good application prospect.

【关键词】 自动分类搜索引擎IDFVSM模型
【Key words】 Automatic categorizationSearch EngineIDFVSM
  • 【分类号】TP393.092
  • 【被引频次】2
  • 【下载频次】240
节点文献中: 

本文链接的文献网络图示:

本文的引文网络