节点文献
搜索引擎中网络爬虫的研究
Research on the Crawler of Search Engine
【作者】 龚勇;
【导师】 刘东飞;
【作者基本信息】 武汉理工大学 , 计算机应用技术, 2010, 硕士
【摘要】 搜索引擎作为信息检索技术在互联网时代的应用,使人们能够更有效的从互联网获取各种资源。但随着互联网的发展,传统的搜索引擎,即通用搜索引擎渐渐不能满足人们对信息检索服务日益增长的需求。近年来,面向主题的搜索引擎应运而生。本文围绕主题搜索引擎,对主题搜索引擎中占有重要地位的主题爬虫相关技术进行了研究和讨论。网络爬虫用来从互联网上抓取页面。通用爬虫会从一些种子链接开始,目标是获取互联网上所有的页面。而主题爬虫的目标是获取与特定主题内容相关的页面,因此除了具有通用爬虫的基本功能外,还需要对页面的内容和链接进行分析从而能够对爬虫爬行的路径进行指导和预测。主题网络爬虫选择什么样的爬行策略对互联网进行访问,直接影响着其爬行的效率。本文着重研究并改进了基于Context Graph的主题爬行算法,研究工作主要有以下几个方面:(1)研究了搜索引擎中通用网络爬虫和主题网络爬虫的技术原理、工作流程,着重分析了主题网络爬虫的主题爬行策略,对主题网络爬虫常用的基于链接分析的爬行策略和基于内容分析的爬行策略进行分析比较。(2)针对传统的主题爬行算法不能很好解决“隧道现象”的问题,本文详细介绍了一种基于Context Graph的主题爬行算法,它通过预测新抓取页面在Context Graph中所处的层次,能够指导网络爬虫沿着最有可能找到目标页面的路径爬行,进而较好地解决“隧道现象”的问题。(3)使用一种基于词频差异的特征选择方法和改进的TF-IDF公式对基于Context Graph的主题爬行算法进行了改进,加入词的类别权重作为对TF-IDF公式的调整,以提高特征选择和评价的质量。(4)实现了一个主题爬虫原型,通过实验对各算法进行了分析和比较,验证了本文改进的算法能够得到更加准确的文档集特征及权重,进而提高主题爬虫的性能。
【Abstract】 The search engine as the information retrieval technology in the Internet time’s application makes the people more effective to gain network resources. But with the development of Internet, the traditional search engine, namely the general search engine cannot satisfy the people’s increasingly demand to the information retrieval service. This thesis research and discuss correlation techniques to the focused crawler which held the important position in focused search engine.Web crawler is used to download web pages from Internet. Starting from some seeding links, general web crawler searches all the web pages throughout the internet. The focused crawler aim to get more pages related to topic, apart from the fundamental function of general web crawler, the focused crawler should able to analyze links and content in web pages to guide and forecast crawler’s crawling path. What crawling strategy does the crawler used to visit the Internet have a significance impact on the focused crawler’s efficiency. This thesis studied and improved the focused crawling algorithm based on the Context Graph. The main research works as follows:(1) Research on general crawler and focused crawler’s technical principle and workflow; make a careful analysis of focused crawler’s crawling strategy. This thesis introduce and analysis good and bad points of the crawling strategies based on link analysis and based on content analysis which are usually used by focused crawler.(2) To resolve the problem that traditional focused crawling algorithm cannot deal with " the tunnel", this thesis introduced in detail a crawling algorithm based on the Context Graph, by predicting the level of web pages in the context graph, the crawling algorithm advances along the most promising path that leads to target documents at low cost of crawling irrelevant pages to find target documents quicker and resolve "the tunnel".(3) To improve the feature selection and appraisal quality used in the crawling algorithm based on Context Graph, This thesis used a feature selection method based on the word frequency difference and a modified TF-IDF formula joined the word’s category weight. (4) A demo system—Focused crawler was proposed in this paper. The experiment results show that the feature selection quality and the focused crawler’s performance can improve by the improved algorithm proposed in this paper.
【Key words】 Search engine; Focused crawler; Context Graph; Feature selection;