

Research of Focused Crawler about Group of University Website Based on RSS

【作者】 张睿涵

【导师】 林振荣;

【作者基本信息】 南昌大学 , 计算机应用技术, 2012, 硕士

【摘要】 网络发展迅速,网页数量越来越庞大,人们为了获取需要的信息,往往需要翻阅大量的网页,浪费时间和精力,并且还不一定能够获取最新最全的信息,而网络信息的发布者也希望有更多的用户能够实时的阅读自己的信息,为此有很多针对该需求的研究孕育而生,例如由网络爬虫支持的搜索引擎、RSS信息推送等。但是它们都各有各的局限性,例如我们需要按照分类得到某高校的所有网站中的最新通知,比如该高校所有科研类别的最新通知。使用搜索引擎进行搜索,结果差强人意。而RSS虽然可以实现分类的推送最新信息,但是它推送的信息仅限于那些提供RSS feed的网站。对于一些类似于高校网站群这种早期建立的时候就没有实现RSS推送功能的对象来说,它就爱莫能助了。因此,本文主要研究基于RSS的聚焦网络爬虫来解决上述问题,并将其应用在高校网站群中,取得了较好的效果。它的原理是用聚焦网络爬虫对目标网站群的数据进行抓取、分析和处理,然后提供RSS推送。通过这种方式,对于即使没有提供RSS feed的网站,用户也可以通过RSS阅读器分类订阅其最新的信息。免去了大量翻阅网页查找信息的麻烦,以及查找疏忽对信息的遗漏。本文的主要研究内容包括:(1)提出一种新的基于RSS的聚焦网络爬虫的研究,使得用户可以使用RSS阅读器,订阅并阅读到没有提供RSS feed的网站的最新的信息。过滤无用的广告等垃圾信息,免去查找信息的麻烦。(2)基于TF-IDF算法对抓取的网页文本进行分类,并且在用TF-IDF提取不同类别的特征向量部分,针对网页的特征对其进行了改进。使得提取出的特征向量更能好的代表类别,分类结果更准确。(3)对网络爬虫的增量式爬取进行改进,基于传统的增量式爬取算法提出了一种新的计算预测更新时间的算法,使得预测时间更贴近实际更新时间的值,减少系统的开销,提高效率。(4)将基于RSS的聚焦网络爬虫的研究应用到高校网站群中,针对高校网站群的特征对PageRank算法进行改进,提高网络爬虫的查全率。

【Abstract】 Internet is developing much faster and the number of pages is increasing, so when people want to get the information they need, they have to read a large number of web pages. It wastes people’s time and energy, and also makes people unable to get the latest and most complete information. Network of information publishers hope that more users can read their information in real time. To meet this demand, a lot of research comes out, such as the search engine supported by the web crawler, RSS information pushing technology. But they have limitations, for example, we need to get the latest notice from all the sites of a university by category, such as the latest notice of the research category. A typical search engine can’t return the satisfactory result. RSS can push the latest information in accordance with the classification, but the information which it pushed is limited to the websites which provide the RSS feed. So the RSS can’t work on the websites which do not provide RSS feed at all such as university website group. Therefore, the focus of this study is the research of focused crawler based on RSS, and it’s application insolving the above problem, and expansion to the group of the university website, which will achieved good results. Its principle is to use the focus web crawler to crawl, analyse and process the data of the site group, and then offer RSS feed. In this way, for those websites without RSS feeds, people can also use the RSS reader to subscribe their latest classification information. The research will reduce a lot of time spant in flipping through the pages to find the latest information and will reduce negligent omission of information.The main study contents are as follows:(1) To propose a new research of focused crawler based on RSS, the user can use a RSS reader, subscribe and read the latest information from the sites which did not provide the RSS feed. It filters unwanted ads and spam, and eliminates the trouble of finding information.(2) Use TF-IDF algorithm to classify the pages’text, and improve it on extracting category feature vector based on the characteristics of the web page, improving the accuracy of the feature vector, and making the classification more accurate.(3) The research improved incremental crawled of the web crawler. Proposed a new computing forecast update algorithm based on the traditional incremental algorithm, making the prediction closer to the actual update time, reducing system overhead and improving efficiency.(4) Applied the research of focused crawler based on RSS to the university website group, and improved the PageRank algorithm baseds on the characteristics of the university website group to raise the recall rate of Web crawler.

  • 【网络出版投稿人】 南昌大学
  • 【网络出版年期】2012年 12期