

Customizable Focused Crawler

【作者】 邹海亮

【导师】 孙莉;

【作者基本信息】 东华大学 , 计算机软件与理论, 2009, 硕士

【摘要】 互联网中,用户对信息的需求往往是针对某个领域和面向特定主题的,在这些方面传统搜索引擎的召回率和精确率都不能令人满意。面向主题的垂直搜索引擎的目的是提供分类精确、数据全面、更新及时的搜索服务,在满足用户个性化需求方面有独特的优势。在性能卓越的搜索引擎背后,都有强大的网络爬虫做后盾,它的性能直接影响搜索引擎的查全率、查准率。聚焦爬虫在传统爬虫的基础上实现了对web页面的主题相关度的计算和链接的主题相关度评价。聚焦爬虫作为当前的研究热点之一,由于人类语言概念的模糊、多义性,网络信息资源的半结构化特性,使得在主题判断与评价、自然语言理解、隧道穿越方面存在一些公认的难题。本文提出了一种可定制的聚焦网络爬虫(Customizable FocusedCrawler,CFC),主要内容有:(1)研究并实现了主题的定制算法。在用户和计算机交流的基础上,采用基于向量空间模型的方法描述用户主题信息,让计算机更好地理解和表达用户的兴趣。(2)实现了Ajax页面的解析。web2.0已成为互联网的主流技术,越来越多的页面采用Ajax技术,对于这样的页面,浏览器中丰富的文字信息没有在HTML源文件中出现,因此实现Ajax页面的解析势必能提高爬虫的查全率。本文主要针对在页面加载函数中出现的Ajax操作进行处理。(3)对于隧道穿越,本文提出了简单有效的宽容算法。此算法模仿人的行为特征,在遇到主题不相关页面或链接时并不立即的抛弃,而是根据宽容阀值的大小,试探性的包容当前不相关的链接。(4)研究与实现了基于链接价值的搜索策略。在此方法中利用了基于链接结构和内容的评价方法,综合考虑链接的主题性和权威性来决定链接在队列中的排名。

【Abstract】 Requirement for information asked by user in intemet is normally aimed at some field and a specific subject oriented,the ratio of recalling and exactness for some traditional search engine can not be turned up trumps in all these aspects.The aim of subject oriented for verticalsearch engine is to provide a search service of classifying in exactness, all-around data,and updating in time so that there is a specific advantage in satisfying individuation requirement aspect.At the back of a powerful search engine,there is always a powerful crawler,whose performance determines the satisfaction of the search engine for users in such aspects as recall ratio and exactness.Based on traditional crawler,a focused crawler evaluates the topic relevance of the web page context and URL.As one of the current research focus,many problems,for example:the ambiguity and polysemy of human language,, semi-structured of the network information resources blocks the further progress,there are many difficulties in topic judgement and evaluation, natural language understanding and tunneling.This paper presents a Customizable Focused Crawler,CFC,Mainly including:Study and implementation of customization algorithms,on the basis of communication between users and computer,A topic model is formed with vector space model,which expresses the user’s interest more explicitly and allow the computer to better understand.Implementation of Ajax interpretor.Web2.0 has become a mainstream technology,more and more of the pages using Ajax,for such a page,rich information saw in browser can not be found in HTML source file.Hence, the Ajax interpretor is bound to improve the recall ratio.In this paper,the page load function in Ajax operation is handled. For the tunnelling,this paper presents a simple and effective algorithm called tolerance.This algorithm imitates the behavior of people, a page or a link not related to the topic is not abandoned immediately,it will be handled as the relative according to the threshold size. Implement the search strategy based on the value of link.This method makes use of link structure and content-based methods of evaluation, considering both the topic-relevance and authority of links in order to give a priority to the more valuable links.

  • 【网络出版投稿人】 东华大学
  • 【网络出版年期】2009年 10期

