节点文献

WWW上链接分析算法的若干研究

Studies of the Hyperlink Analysis Algorithms in WWW

【作者】 刘悦

【导师】 李国杰;

【作者基本信息】 中国科学院研究生院(计算技术研究所) , 计算机系统结构, 2004, 博士

【摘要】 WWW的出现对传统的信息检索技术提出了挑战,在传统的信息检索技术没有突破性进展的现状下,从Web数据本身的特点出发,充分地挖掘Web上最充足的资源——超链接,通过超链接进行搜索,建立有效的Web信息检索的模型,找到我们需要的信息,本文正是本着这样一个前提,对页面的链接分析算法作了深入细致的研究,从理论,算法和应用三个层次上,发掘超链接在Web检索方面的作用,主要包括以下几个方面: 首先,在对当前已有的链接算法进行分析和实现的过程中我们发现:基于不同的数据环境和检索要求,对不同类型的链接,算法所采用的预处理方法、迭代规则和迭代的终止条件都会影响查询的结果。提出对于封闭数据集合链接分析算法的约束条件,通过对比封闭数据集合和实际的Web环境中的超链接的分布,将这些约束扩展到实际Web环境中,更准确地预测链接分析算法的作用;实验表明在此约束条件下,链接分析算法能够有效地提高检索效率。 其次优化与查询无关的事前链接分析算法,得到优化的事前链接分析算法Modilink(),该算法给出了超链接的预处理方法,调整的归一化方法,完备的迭代终止判定规则,实验表明该算法可以从整体上提高算法的迭代效率。 提出了基于页面质量因素扩展的与查询相关的事后链接分析算法QHA1(quality based hyperlink analysis algorithm),该算法将算法Modilink()得到的结果作为评价页面质量的因素引入超链接的权值指定算法中,使超链接能够比较客观地反映所链接的页面之间互相影响的程度:此外,将超链接的来源也考虑到超链接的权值指定上,结合页面质量因素提出另外一个优化的事后链接分析算法QHA2。对于优化的事后链接分析算法我们从理论上证明了算法的正确性和可行性,并在实验中验证了这些算法。 借鉴潜在语义分析中的方法,本文将矩阵奇异值分解引入事后链接分析算法中,提出基于SVD分解的滤噪算法,运用矩阵的奇异值分解的方法进行无关页面和超链接的滤噪,并将其应用于与查询相关的事后链接分析算法的初始基本集合的构造;提出了优化的事后链接分析算法QHA3,QHA4,算法有效地控制了主题漂移现象的产生,为准确的查找提供了一个很好的途径。

【Abstract】 The emergence of WWW introduced new challenges to the traditional information retrieval (IR) technologies. Web searching involves in the theories and technologies of applied mathematics theory (such as graph theory, matrix theory and analysis), data mining, AI, NLP, etc. The core of the search engine technology is to find a better searching algorithm. From the characteristics of the Web data, hyperlinks among the web pages can be used to mine more useful information. Searching with the hyperlinks can create more effective Web information retrieval model. This dissertation studies how hyperlinks affect the Web IR theories, algorithms and applications.First, by comparing the hyperlink analysis algorithms against different data environment and retrieval requirements, I analyzed how the search results are affected by the methods to process different types of link and the methods to set the iteration rules and terminating conditions. Then I proposed restricting conditions for the hyperlink analysis algorithms in closed data set. By comparing the hyperlink distributions of the closed data set and the real Web environments, I expanded the restricting conditions to the real Web environments. In this way the effect of the algorithm can be predicated quantitatively and the experiment results show that the retrieval efficiency can be improved greatly.Then, new optimized hyperlink analysis algorithms are proposed. One of them is the Modilink. This query-independent approach introduced new preprocessing algorithms adjusting standardization methods and iterative terminating conditions. It also modified the iterative formula of PageRank algorithm to improve the whole iterative efficiency of the algorithm. The experiment results show that the Modilink can convergence faster than the PageRank algorithm and under the restricting conditions the retrieval efficiency can be improved.Other optimized hyperlink analysis algorithms are relative to the queries. Considering relationship between the web page quality and the characteristics of the hyperlink analysis algorithm, I proposed QHA1, a quality based hyperlink analysis algorithm. The core of this algorithm is to take the value from the Modilink as the web page quality factor in the

节点文献中: 

本文链接的文献网络图示:

本文的引文网络