节点文献
基于链接信誉分析的网页权威排序分类算法研究
Web Authority Sort Classification Algorithm Based on the Analysis of Link Credibility
【作者】 赵航;
【导师】 杨天奇;
【作者基本信息】 暨南大学 , 计算机应用技术, 2012, 硕士
【摘要】 随着互联网普及,网页数量呈指数增长,用户通过现有搜索引擎进行网页搜索时存在很大困难。究其原因,一是搜索引擎返回结果存在主题混杂,没有根据主题对网页搜索结果进行分类,这增加了用户搜索所需主题类型信息的困难。二是搜索引擎返回检索结果存在网页质量参差不齐(存在垃圾网页,垃圾广告),增加用户筛选高质量信息的困难。针对上述问题,本文做了一下工作。首先,为了解决搜索引擎返回结果中的网页主题混杂现象,本文将对网页进行主题类别标识,用户可以选择自己需要信息主题类别搜索,从而更快更准确定位到所需信息。其次为了提高网页文本分类准确度,提出基于特征噪声加权的特征权重算法方法,该方法通过降低用词不规范特征噪声对网页文本分类影响,提高网页文本分类的准确度和健壮性。再次,针对用户检索的网页质量参差不齐问题,本文把市场经济中的商家信誉模型引入到对网页权威的评价排序。通过挖掘历史链接信誉评价,建立与PageRank算法结合的评价模型对网页进行调整排序,有效提高搜索结果排在前面网页的质量,有效激励网页生产者专注创造高质量的网页。最后,应用本文思想建立一个系统模型,从而证明本文思想的可用性。
【Abstract】 With the popularity of the Internet, the number of web pages has grownexponentially, and it is greatly difficult to get information through the existing searchengines. First of all, the search results with the search engine contain mixed themes,which are not classified according to the themes and make the users more difficultlyto get the topic type information. Secondly, the quality of search results is uneven(containing junk pages, junk advertisings and so on), which make the users difficultlyto filter the high-quality information. Aiming at these problems, this article makessome work as follows.First, in order to solve the mixed subjects of the pages returning from the searchengine, this article will make web pages with category identifiers. Then the users canchoose their categories to search, which is faster and more accurate to locate thedesired information.Secondly, in order to increase the accuracy of classifying the page text, the paperwill propose a feature weight algorithm basing on feature noise weighting. Thisalgorithm reduces the impact on webpage text classing caused by non-standardfeature noise. The method improves the accuracy and robustness of the page textclassification.Again, to address the problem that the quality of search results is uneven, thepaper will introduce the business reputation in the market economy model to the sortof evaluation on the web authoritative. Through mining the evaluating the credibilityof historical links, the paper adjusts the ordering of the pages with the evaluationmodel combined with the algorithm of PageRank, which improves the quality of thetop search results page and encourages the web producers effectively to take focus oncreating high-quality pages.Finally, this article will build a system model with the thinking, thus which will provethe availability of the ideas.
【Key words】 Text Classification; Link analysis; link reputation; the PageRank; CategorySearch;