

Research on the Focused Crawling Combining Synthetic Web-Page Information and Domain Ontology

【作者】 关鑫

【导师】 欧阳丹彤;

【作者基本信息】 吉林大学 , 计算机软件与理论, 2010, 硕士

【摘要】 主题爬行是在背景知识的指导下,根据一定的网页分析算法过滤主题无关的网页,预测并抓取主题相关的网页。主题爬行对于解决从海量信息中提取需要的信息及在特定领域搜索信息具有重要的意义。本文的主要工作是研究利用本体作为背景知识来指导主题爬行策略,将URL的综合信息与本体结合以求提高主题爬行的效率。在传统爬行框架的基础上,本文对网页内容做了具体的分析,指出网页某些位置的信息对于揭示网页主题具有很重要的意义。算法从网页文档提取出特征向量,并将特征向量加上文档位置权重因子与本体的概念进行匹配从而得到网页主题相关度;利用扩展锚文本来预测超链接的主题相关度。根据计算的网页主题相关度与预测链接的主题相关度结合来设计一个爬行策略,并与现有的基于本体的爬行策略对比。通过实验表明,本文的爬行策略收获比明显优于对比实验中的其他爬行策略。通过大量的实验数据对比分析:利用网页综合信息与领域本体结合来指导主题爬行策略,可以有效提升网页主题爬行的收获比。

【Abstract】 Since 1994, Search engine on the web has been developed significantly. It solves the problem of mass resource to be indexed and fast located on the web. The effect of search engine has more and more important in the people’s live. However, with data increasing more and more rapidly, traditional search engine will not meet user request. Search engine with poor semantic processing ability will not meet users’accuracy demand.Focused crawling is a improve technique for search engine. It is an intelligent search application in the search domain. The aim of focused crawling is to find the web pages which are defined previously. It classifies web pages by using text categorization and predicting hyperlink technique to get a good search effect.If we integrate focused crawling with semantic technique, then during the progress of the search, crawler would like be guided by domain specialist. The search engine will not only return search result, but also give resources concerned with topic. Designing focused crawling strategy is based on common search engine. Actually, it is extension for traditional search engine. Under direction of background knowledge, crawler gets as more as possible web page. Range of focused crawler is smaller than the common crawler. However, focused crawling will get more precise result. Focused crawler filter un-relevant web pages to get and save lots of relevant web pages under limited web resource. The main orientation of focused crawling is how to filter off-topic web pages and how to get more topic web page.Marc Ehrig proposes an approach of document discovery building on a frame for ontology-focused crawling of web documents. Ontology is a description for conception and properties. It can describe background knowledge precisely, and it is a tool for knowledge representation. Ontology-focused crawling will get more satisfied search result. From research of computing relevance of web page, we find that combining document term location on the web and ontology will get more precise relevance of web page. Traditional methods do not give more research on the link. The approach predicts topic relevance of link using extend anchor text and relationship of links. The whole algorithm centre on above two points.The main work of this paper is based on the ontology-focused crawling. Firstly, we analyze text of web page to extend this approach, and point that information of specific location in web page plays an important role to the topic of web page. Secondly, this approach gives a analysis for topic relevance of link which is contained on the page.Anchor text is the hyperlink text. It is summarize of information of hyperlink. Because anchor text usually distributes on other web pages, it represents the intension of web authors. They want to guide users to know subject of web pages and visit URL by using brief information. Comparing with web page which is selected randomly, anchor text has stronger ability to describe goal page. So, Predicting topic relevance of web page based on the anchor text is a hot pot for researcher.The thought of Algorithm is that when get web page, it delete the tag which is not important. Then system extracts text from page, counts high frequency and convert text to vector. When computing topic score of web page, it judge each term of vector to belong to conception of ontology. And it judges it to map the conceptions, properties and instances of ontology. It gives the vector the topic score by combining web location weight and ontology. If topic score of page is higher than the threshold, all of hyperlinks of the page will be extracted. And each hyperlink has been judged whether it has been crawled. For the hyperlink which has not been crawled, the algorithm predicts its topic score. Hyperlinks are made to enter different queue according to the score.We get deep research for this problem. And we proposed a strategy based on ontology background and anchor text information. It will improve accuracy and this paper integrates search engine with semantic web such as resource description frame, ontology, and reasoning technique and so on. I construct finance domain ontology to realize search strategy. To test advantage of ontology-based search strategy, I do experiment with finance information.The most important standard to measure effect of focused crawling is how to select relevant web pages and how to filter topic-off web pages. Harvest rate represents the fraction of web pages crawled that satisfy the target among the crawled page.The paper designs three groups of experiments. The first compute topic score of web page by combining term location weight and ontology. The second predict relevant score of hyperlinks. In the third , the algorithm combine the first two experiments ,and compare the four strategy.We can get conclude from result of the experiment that our approach has a higher efficiency and harvest rate. The strategy use domain ontology as background knowledge and combine with text term location weights to compute topic score of web pages. It also uses anchor text and dependency of html to predict relevance hyperlink. This strategy can be made an effect use to focused crawling research.The research of focused crawling does not only have theoretical value, but also have wide application prospect. There are some issues of focused crawling discussed in the paper. Future of web is well expected and our work is the beginning of the research which should be done in the future. How to change the research of focused crawling to web application and how to support service according different users’demand is direction of our research.

【关键词】 主题爬行本体锚文本特征项位置
【Key words】 Focused CrawlingOntologyAnchor textTerm Location
  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2010年 09期
  • 【分类号】TP393.092
  • 【被引频次】1
  • 【下载频次】68

