节点文献

面向专业搜索引擎的主题爬行技术研究

Research on Topical Web Crawling Technique for Topic-Specific Search Engine

【作者】 彭涛

【导师】 左万利;

【作者基本信息】 吉林大学 , 计算机应用技术, 2007, 博士

【摘要】 本文针对面向专业搜索引擎的主题网页信息获取问题,对主题爬行技术进行了深入的研究,提出了基于链接上下文的自适应主题爬行方法,该方法采用(?)ζ-IDOM链接上下文方法,在主题爬行过程中不断使主题特征集合自我完善。实验结果表明,该方法在不断增强自适应性的情况下,不会发生主题漂移,所以具有一定的鲁棒性。将原始的粒子群优化算法针对本文研究内容进行了改进,即BWPSO。测试实验显示,BWPSO和标准的PSO算法相比,在得到相同结果的情况下所需迭代次数更少。可见,采用BWPSO来求解最优化问题是可行的,而且效率要更高。对训练过程中迭代产生的网页分类器利用BWPSO进行优化组合,产生最终分类器。实验结果表明,通过对迭代产生的分类器进行优化组合,大大提高了网页分类性能。针对互联网上的网页频繁发生着增加、更改及删除等变化,提出具有增量特性的主题爬行方法,即算法增量和数据增量。算法增量解决在初始训练集不完备的状况下,通过训练过程来自我完善;数据增量研究主要寻找和识别网页的动态变化规律,通过主题爬行保持网页的时新性。实验验证了该方法的有效性。将隧道穿越(Tunneling)分为灰色隧道穿越(Grey Tunneling)和黑色隧道穿越(Black Tunneling),同时提出了两种隧道的穿越方法,实验结果显示,对上述两种隧道的穿越达到了预期的效果。构建了一个专业搜索引擎:LookClearTSSE。通过本文建立的基于多种爬行策略主题爬行器LciSpider来获取特定领域网页信息,之后采用本文提出的增量索引结构来建立检索查询接口,对查询结果进行排序。实验验证了该方法的优越性。

【Abstract】 With the rapid expand and growth of web pages information from the WWW, it gets harder to retrieve the information and knowledge relevant to a specific domain. Therefore, topical web crawling technique for retrieving the specific-domain information has got more attention and development in recent years. Topical web crawling has been applied for not only a topic-specific search engine, but also other field such as digital library, etc. Accordingly, the research on topical web crawling will be an academic signification and a broad application perspective. The major contents could be summarized as follows:(1)This dissertation makes a general summary of the research on topical web crawling and the correlative techniques, analyzes the derivation background and the course of development. After introducing and analyzing the development of search engines and the text classification, the virtues and necessary of a topic-specific search engine be presented. Furthermore, the future of search engines is also discussed in this dissertation. The basic theory and strategies of topical web crawling and text classification technique are also introduced and analyzed, which are the groundwork of farther research works.(2)It is an effective topical web crawling approach that the relevance of a target web page is evaluated by using web page information. Generally, a knowledgeable human being can identify the target web page before its display, which is natural. How to make a computer imitate a human being to identify the target web page will be a challenge. Some anchor text and link information are too short, not enough informative, so it is not reliable forecasting target web pages only by anchor text. We expand it to link-contexts, which is the best method to enrich anchor text. There are several link-contexts extracting methods now. After analyzing and comparing them, a new link-contexts extracting method ?ζ?IDOM is presented in this dissertation. The experimental results show that the performance of the topical web crawling using the method with some appropriate parameters is improved. To get the initial feature set of the topical web crawler, we adopt a method to extract the contexts of backward links of seed URLs, which are ordinarily the summary of the content of seed URLs. By this method we can get small but exact feature set of the topic, which is used to instruct the crawling of the topical web crawler. Based on the strategy of link-contexts-based topical web crawling, this dissertation presents a self-adapted method, which evolves the feature set of the topic during the process of crawling. The experimental results show that the method is able to strengthen the self-adaptability of the crawler and meanwhile avoids the phenomenon of topic drift.(3)Particle Swarm Optimization (PSO) is applied in many fields because it is effective, understandable and easy to realize. PSO is a computational intelligence method motivated by the social behavior of organisms, which is an organism-based and iteration-based optimization strategy. Based on original PSO algorithm, this dissertation presents an improved algorithm named BWPSO. The experimental results show that BWPSO require less times of iteration than standard PSO to get the same results, so we can conclude that BWPSO is applicable and more efficient to solve optimization problems.(4)Text classification that is used to instruct topical web crawler is a key technique in the research of web information retrieval. In this dissertation we assign different weights to different content in the same web page according to its structure, and search and utilize the rules that web pages have in common to compute the feature weights of the pages. Because the web pages are diverse and the training set is always not representative enough, we build a set of classifiers by iteratively applying SVM algorithm on training set. Because PSO is an iteration-based optimization strategy, we use BWPSO to synthesize classifiers result from the iteration of training to get the final classifier. The experimental results show that through the synthesization of classifiers, the partition of web page structure and the using of common rules to compute feature weight, the performance of the classifier is great improved.(5)Since the web pages are frequently increased, deleted and modified, the research on incremental topical crawling is of great importance. In this dissertation, the technique of incremental topical web crawling contains two parts: incremental learning in algorithm and incremental web pages updating. Incremental leaning in algorithm presents the self-adaptability and self-improvement of incremental learning. In topical web crawling, it is impossible to predefine the entire sample set relevant to a specific topic. In this situation, incremental learning has to use some strategies to select favorable training data to improve itself during the process of training. In the research of incremental learning in algorithm, this dissertation applies improved 1-DNF algorithm to extract reliable negative data from the training set of PU classification problem, and then constructs classifier by using SVM iteratively in incremental learning. In SVM classification algorithm, the determinant data is Support Vector (SV). The partition of Support Vector set equals to the partition of the whole data set. Accordingly, during the process of training, this dissertation uses SV set, samples of false classification results, and unclear samples of the correct classification results generated from the process of iteration as the training set in the next iteration. In doing so, we can save training time and space, and this method is not detrimental to classification accuracy. The research on incremental web pages updating concentrates on the deducing and identifing of changed (increased, modified) web pages in latter crawling. The key problem is to search and identify the rules by which the web pages change. According to the rules, the crawler can only crawl the pages that have changed, then we can save the bandwidth of internet and meanwhile make the web pages up-to-date. After the definition of incremental web pages updating, this dissertation presents the rules used to determine whether the page has changed. We use a method based on DOM tree to estimate the content of web page, and then analyze the randomness of the change of web pages. Through the sampling estimation based on experiments, we evaluate the change frequency of web pages according to relevant mathematic models. Finally, we present the crawling structure and algorithm using incremental web pages updating. Experimental results show that through self-adjusting parameters, this method is able to get the rules by which web pages change.(6)Due to the complexity of the web environment and topic-multiplicity of the contents of web pages, it is quite difficult to get all the web pages relevant to a specific topic. It is possible for an irrelevant web page to link a relevant web page, so we need to traverse the irrelevant web page to get more relevant pages. This procedure is called Tunneling. This dissertation partitioned Tunneling into Grey Tunneling and Black Tunneling. Grey Tunneling resolves the problem that the topic-multiplicity of a web page makes the relevance of the highly relevant page been weakened. So during the process of crawling, in order to avoid the effect caused by the web page that is irrelevant to the specific topic as a whole but relevant partially, we divide a multi-topical page into several blocks and process the blocks individually, and then we can traverse the page that is irrelevant as a whole to expand the scope crawer reached and get more relevant pages. In Black Tunneling, we present a probing method, in which we first store all the irrelevant pages temporarily, and meanwhile extract all the out-links to crawl. During the process of crawling, we assign a depth value used to determine whether the page should be kept to each irrelevant page according to the relevance of its father page, and then we can broaden the scope of the topical crawler. The experimental results show that the two tunneling methods have achieved the effect we expected.(7)Similar to general search engine, the construct of topical search engine is composed of three parts: the collection of information relevant to a specific topic on internet, the building of index and the service of information retrieval. Based on topical crawling techniques mentioned above, this dissertation constructs a topical search engine: LookClearTSSE (LookClear Topic-Specific Search Engine). We constructs a topical web crawler LciSpider (LookClear Intelligent Spider) using several crawling strategies introduced in the dissertation to collect topic-relevant information. During the process of crawling, we compute the weights of uncrawled urls extracted from downloaded pages to determine their priorities. In LciSpider, we realize several strategies such as Breadth-First, Best-First, Link-context and Content Block Partition. Then we build incremental index structure for the web pages crawled after pre-processing, and provide information retrieval interface. Before the construct of the incremental index structure, firstly the web pages are pre-processed to get the set of original pages, and then inverted index is built using the result of forward index. This dissertation presents a block-based link structure to store index. This structure makes the time cost in index updating only relate to the quantity and sizes of added pages. The space-for-time approach outperforms continuous storage of index in update rate, and provides much higher query efficiency than native linked list structure, which also supports real-time update.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2007年 03期
节点文献中: