节点文献

基于链接相似度的网页排序算法研究

【作者】 方旭

【导师】 王树梅;

【作者基本信息】 南京理工大学 , 计算机应用技术, 2008, 硕士

【摘要】 本文主要讨论网页排序相关算法,重点讨论了链接分析技术。首先,介绍了网页排序的基本原理,对几种较为常用的网页排序技术进行了对比分析;着重剖析了两种典型的链接分析算法:PageRank和HITS,分析了它们各自的优劣。PageRank算法主要缺陷是将PageRank值在所有的出链接上进行平均分配,没有很好地考虑语义信息,很容易受到无关链接的影响,产生主题漂移。本文设计了一个简单的计算模型改进PageRank算法,该计算模型在PageRank算法平均分配的基础之上,考虑了链接相似度信息,并利用朴素贝叶斯模型对链接相似度信息进行评估。由于考虑了出链接与目标网页相似度信息,使得那些没有价值的页面(广告页面)被分得较少的PageRank值,提升了真正有价值的页面所分得的PageRank值。最后,本文应用上述模型实现了一个模拟的搜索引擎。该模拟系统包含了搜索引擎的几乎全部功能,并在互联网真实环境下请一些用户进行实际测试,对上述算法进行验证。小范围用户测试结果表明:融入了链接相似度信息之后,提升了搜索结果的用户满意度。

【Abstract】 This paper focuses on the relevant page sorting algorithms. We discuss the link analysis technique with emphasis.First of all, we have introduced the basic principle of page sorting algorithms. We have carried on the contrastive analysis to several kinds of more commonly used page sorting technologies. We have analyzed two kinds of typical link analysis algorithm emphatically: PageRank and HITS, and have analyzed their respective advantages and disadvantages.A major flaw of the PageRank algorithm is that the algorithm distributes the PageRank value to all out-links equally. It does not consider the semantic information very well, so it will be influenced by the irrelevant links and bring the subject drifting.In this paper, we design a simple model to improve PageRank algorithm. We consider the similarity of links based on the original PageRank algorithm with average distribution and evaluate the link similarity with the naive Bayesian model. With consideration of the similarity between the link and the target page, we give less PageRank to those pages with less value (such as advertisement pages), and promote the PageRank of truly valuable pages.Finally, we construct a simulative search engine with the improved model above. The simulation system includes almost all of the features of a search engine. We invite some users to test the system in the real Internet environment for validation. The small-scale test results show that it enhances the customer satisfaction when we use the link similarity.

  • 【分类号】TP391.3
  • 【被引频次】3
  • 【下载频次】320
节点文献中: 

本文链接的文献网络图示:

本文的引文网络