节点文献

Web链接结构挖掘中HITS算法的分析与改进

Analysis and Improvement of HITS Algorithm on Web Hyperlink-structure Mining

【作者】 张阿红

【导师】 王治和;

【作者基本信息】 西北师范大学 , 计算机应用技术, 2009, 硕士

【摘要】 近年来,随着Internet/Web技术的快速普及和迅猛发展,它为人们提供了丰富的信息资源的同时,其所具有的海量数据、复杂性、极强的动态性和用户的多态性等特点也给Web资源的发掘造成了相当的难度。因此,将数据挖掘技术和Web结合起来,进行Web数据挖掘也就随之成为解决Web挖掘问题的重要途径。在传统的信息检索技术己经成熟的现状下,从Web数据本身的特点出发,充分地挖掘Web上庞大的超链接资源,通过超链接进行搜索,建立有效的Web信息检索模型,从而找到我们需要的信息。但传统的基于超链接的网页搜索排序算法是纯粹地基于链接分析(即Web结构挖掘)来发现权威网页,没有考虑网页的具体内容,存在所谓的“主题漂移”问题,即算法的结果往往包含这样一些网页,它们相互链接密度较高,但在内容上却偏离了查询主题。本文通过对经典的Web结构挖掘算法HITS算法的研究学习,针对HITS算法中只考虑Web页面之间的超链接分析而忽略了Web页面的内容,从而导致分析结果出现“主题偏移”和主题之间的多重加强关系等不足,提出了一种结合超链接分析和内容相关性分析的关于HITS算法的改进算法——G-HITS算法,该算法通过对不同Web页面进行内容分析并赋予链接之间不同的权重来实现对HITS算法的改进,一定程度上改善了HITS算法的不足,更好的实现了权威网页的查找。最后通过实验证明G-HITS算法的有效性。

【Abstract】 Recently, along with the quick popularization and development of the Internet and Web technology, it supplies people with abundant information. Internet constructed based on huge volume of data and its complexity, extreme dynamic and all kinds of clients have made the internet source development difficult.Therefore,locating valuable information in the Web has become the important issue in the area of Web Data mining.The traditional method of information browser has been mature and under the circumstance, we mine huge linkage resource on the Web according to the attribute of it.Then we search and build the Web indormation retrieval model to find information we need.The current method of locating the ring web page is based on the hyperlink ranking algorithm.However,such method may cause the topic drift problem,which is the results of algorithm is often irrelevant with the searching topic,but has high link density.By studying the classical Web structure mining algorithm HITS and considering that the HITS only calculates the hyperlink among the web and ignores the content of web result in the drawback of topic drift, we propose an improved HITS algorithm—G-HITS that combines hyperlink analysis and content analysis.The new algorithm improves the HITS by analyzing the content of the web and giving the hyperlinks with different weight.And the experiment proves the new algorithm effective.

【关键词】 Web结构挖掘超链接HITSG-HITS
【Key words】 Webstructure mininghyperlinkHITSG-HITS
节点文献中: