

Research Personalized Web Crawler Based on Rules Engine

【作者】 赵思佳

【导师】 吴敏; 方胜;

【作者基本信息】 中南大学 , 计算机技术, 2010, 硕士

【摘要】 目前互联网已经成为公众生活的必需品,大家的工作生活都需要从互联网上查找信息,搜索引擎在互联网信息查找的过程中起了非常重要的作用。以Google为首的各种综合搜索引擎在帮助用户从互联网上查找信息,但是搜索的结果只能是信息所在的网址,这种方式非常适合静态网页,但是现在动态网页越来越多,用户搜索需要的是非结构化网页里的结构化信息,例如不同网站的票务信息、房产信息、商品信息等等,目前要得到这些信息可以通过垂直搜索引擎的主题爬虫实现,但是现在的垂直搜索引擎对这些信息的提取一般分为两种策略,一种是先用主题爬虫抓取网页,再对抓取的网页进行分析提取;另一种是主题爬虫在抓取网页时就进行提取。前一种抓取网页比较广泛,但是分析时速度较慢,无关网页较多,效率比较低,现在一般采用的是后一种方式,这种方式精确度高,抓取准确,页面信息提取也较快。不论采用哪种方式,信息的提取都具有很强的针对性,但目前主题爬虫广泛存在配置不灵活,用户参与度不够等问题,论文通过研究搜索引擎和规则引擎技术,提出了利用规则引擎建立搜索引擎的配置机制,以实现能个性化配置的主题爬虫的目的。论文中将个性化主题爬虫的爬行过程设计为由规则编辑器模块、规则引擎模块和爬虫抓取模块三个部分组成。先由规则编辑器模块制定爬行所需要的规则库,然后在抓取任务执行过程中将事实数据和规则库都提交给规则引擎模块,最后由规则引擎模块根据规则指导爬虫抓取模块的运行。为了简化规则库的设定,将爬虫抓取模块分成了由五个小任务完成,分别是预抓取处理、抓取处理、内容抽取处理、写入和索引处理、后置处理,每一个小任务都将对应的常用算法转换了规则引擎处理模式,使得用户可以通过设定规则库文件,灵活调整爬虫的工作方式,最后将整个个性化主题爬虫加上用户控制,从而使得每个用户都可设定自己的爬虫,而不会影响到其他用户,还可以共享自己设置的规则库。通过这种方式替换传统的配置模式,达到提高配置的灵活性,降低用户使用难度的目的,最后利用实例证明这种方式的可行性。

【Abstract】 Currently the Internet has become a public necessity of life, everyone’s working life need to find information from the Internet, search engines to find information in the course of the Internet played a very important role.Google led to a variety of comprehensive search engine to help users find information from the Internet, but the search results only where the site is information, this approach is ideal for static pages, but now more and more dynamic pages, users need to search is unstructured and structured information in web pages, for example, information about the different ticketing websites, real estate information, commodity information, etc., now to get this information through the vertical search engine focused crawler to achieve, but now these vertical search engines Information from two of the general strategy is to use the theme of a web crawler to crawl, and then the analysis of web pages crawled extraction; the other is focused crawler when crawling web pages to extract. A wider front crawl the web, but the analysis is slow, nothing more pages, the efficiency is relatively low, the latter now generally used in a way, this way, high accuracy, capture accurate information extraction can page faster.Either way, information extraction are highly relevant, but the current widespread theme crawler configuration is not flexible, user participation is not enough and other issues, the paper by studying the search engines and rule engine technology, is proposed to establish by rule engine search engine configuration mechanism, to achieve the configuration of the subject can be personalized reptiles purposes.Papers will be focused crawler to crawl personalized ground rules for the process design editor module, the rule engine module and reptiles crawling module composed of three parts. Developed first by the rule editor module rule base needed to crawl, and then the facts will crawling task execution data and rule base are submitted to the rule engine module, and finally from the rules engine module reptiles crawl under the rules govern the operation of the module.To simplify the rule base settings to reptiles crawling module into small tasks by the five completed treatment were pre-crawl, crawl, content extraction processing, write, and the index processing, post processing, each small Common tasks will correspond to the rules engine conversion algorithm processing mode, so users can set rules for libraries, work flexibility to adjust the reptile, and finally focused crawler with the personalized user control, so everyone can make their own set their own reptile, without affecting other users can also share their own set of rules library.In this way replace the traditional configuration mode, to achieve greater configuration flexibility, the purpose of reducing the difficulty of users, the last example shows use of the feasibility of this approach.

  • 【网络出版投稿人】 中南大学
  • 【网络出版年期】2012年 02期
  • 【分类号】TP393.092;TP391.3
  • 【被引频次】4
  • 【下载频次】169
  • 攻读期成果