节点文献

基于本体的主题爬行技术研究

Research of Ontology-based Focused Crawling Technique

【作者】 罗娜

【导师】 左万利;

【作者基本信息】 吉林大学 , 计算机应用技术, 2009, 博士

【摘要】 随着网络信息内容的迅速增长以及信息环境的越趋复杂,现有的以覆盖所有网页为目标的搜索引擎正面临着严峻的挑战。首先,网页数量呈现出指数级的爆炸性增长趋势,搜索引擎无法索引所有的页面,即使是目前全球最大的搜索引擎Google,其索引的页面数量也仅占Web总量的40%左右。其次,Web信息资源是动态变化的,而这种变化使得搜索引擎对于用户的返回结果中有相当比例是过时的甚至是打不开的网页。再次,由于Internet上的信息过于庞杂,往往让用户对五花八门扑面而来的各种信息而无所适从,不知道如何去获取自己需要的信息,陷入了“信息过载”和“资源迷向”的困境。针对上述问题,作者全面的回顾了主题爬行和本体论的研究历史,系统深入地分析了主题爬行算法和本体原理,从而总结归纳了现有主题爬行的缺陷与不足,并在此基础上重点研究了基于本体的主题爬行技术,及实现此技术过程中涉及到的相应问题。本文首先提出了基于本体的主题爬行框架,该框架的优点在于我们不但利用关键字,在爬行算法的设计中还依靠概念和关系等高层次的背景知识来对比搜索网页的文本。这种方法能够很容易达到一个直接的主题。其次,对主题爬行中的关键技术之一网页分类进行了深入研究,提出了基于本体特征提取的PU分类方法,该方法通过两次遍历文档,实现了降维和形成文本向量,再通过CoTraining的学习方式和Affinity Propogation聚类算法使PU文本在正例较少时,提高了PU分类器的性能,并得到了实验验证。再次,利用网页中的视觉信息、标签信息、链接信息和本体概念信息等对网页进行内容分块,在具体的网页分块过程中还提出了一些启发式规则来控制分块的精度和粒度。实验表明,这种分块主题爬行能够解决多主题问题,可以有效的避免主题漂移现象,在一定程度上能解决了灰色隧道穿越问题。同时,我们还首次提出了采用关联规则解决黑色隧道的穿越,该思想也在试验中得到了可行性的验证。最后,我们将前面的思想用于科技文献检索方面,并提出了基于认知心理学、信息传播与遗忘规律的特点构建特定用户兴趣的主题爬行,我们根据用户的检索习惯,跟踪用户的行为模式,通过机器学习方法学习和训练特定用户模型,实现面向特定用户的推荐、过滤等个性化服务。作者结合国家自然科学基金和吉林省科技发展计划项目的研究,给出了具体的实践。理论分析和实验证明上述方法的实用性及可靠性。

【Abstract】 With the rapid expand and growth of web pages information from the World Wide Web, it gets harder to retrieve the information and knowledge relevant to a specific domain. Threrfore, focused crawling technique for retrieving the specific-domain information has got more attention and development in recent years. While crawling the World Wide Web, a focused web crawler aims to collect as many relevant web pages with respect to predefined topic and as few irrelevant ones as possible. The fundamental technical difficulty of focused crawling lies in the necessity to predict a web page’s topical relevancy before downloading it.Ontology as the new concept to describe the semantic hierarchy of knowledge has been widely used in different fields such as Computer Information Processing, Artificial Intelligence and Knowledge Engineering. The information retrieval methods combined with ontology can not only emphasize the advantages of knowledge-based retrieval but also deal with the relationships between the various concepts. Though the research of ontology is just at the beginning, and there have no uniform standard and stationary applications, the research of ontology applied in the Semantic Web will certainly become a hot spot, the application of ontology in information retrieval and semantic web will be the focus in this field. Ontology has capability to represent meaning of the information by a hierarchical structure, and its reasoning support. Ontology-based information retrieval is a promising method. Ontology includes the definition to judge concept so that the machine can understand the concepts of the domain, the relationship between concepts in a unified framework. The system could comprehend the query of user by analyzing user’s query expression and mapped it to information resources. Retrieval has much higher performance than traditional methods.The main contribution of this dissertation and result of study are as follows: 1. This dissertation makes a general summary of the research on web information retrieval andthe correlative techniques, analyzes the derivation background and the course of development. After introducing and analyzing the development of search engines and ontology, the virtues and necessary of a topic-specific search engine be presented. Furthermore, the future of search engines is also discussed in this dissertation. The basic theory and strategies of topical web crawling and text classification technique are also introduced and analyzed, which are the groundwork of farther research works.2. A focused crawling algorithm loads a page and extracts the links. By rating the links based on keywords the crawler decides which page to retrieve next. Link by link the Web is traversed. Our crawling framework builds on and extends existing work in the area of focused document crawling. We do not only use keywords for the crawl, but rely on high-level background knowledge with concepts and relations, which are compared with the texts of the searched page. This ontology-based focused crawling method we can easily achieve a direct focus. This method provide the following main contributions: An ontology structure extended for the purposes of the focused crawler, several new and innovative approaches for relevance computation based on conceptual and linguistic means reflecting the underlying ontology structures, both the management of the focused crawling process and the management of the ontology, and an empirical evaluation which shows that crawling based on ontology clearly outperforms standard focused-crawling techniques.3. It is an effective topical web crawling approach that the relevance of a target web page is evaluated by using web page information. However, the common problem in the construction of classifier is that we need to label great training examples manually. It’s easier to get positive examples than negative examples. In the other side, the negative examples we find are deflected, because of our subjective factors, so that they will affect the performance of classifier. Therefore, researchers advanced that we can build a classifier using a few positive and many unlabeled examples, which is called PU problem. This dissertation put forward ontology-based feature selection for PU classification which scanned the documents twice. In the first time, we get the semantic meanings of the documents with WordNet. In the next time, we had filterated terms without synsets. After that we reduced the dimensionality and get the text vector. Combining with CoTraing and Affinity Propagation, we proved that the ontology-based feature selection can improved the performance of classifier greatly as the positive examples are few. An empirical evaluation shows that compared with document frequency method, our algorithm increases the F1 of One-Class classifier of 10.183% for the fewer positive examples case and 1.941% for the more positive examples case, and increases the F1 of PEBL classifier of 2.781%.4. Due to the complexity of the web environment and topic-multiplicity of the contents of web pages, it is quite difficult to get all the web pages relevant to a specific topic. It is possible for irrelevant web pages to link a relevant web page, so we need to traverse the irrelevant web page to get more relevant pages. This procedure is called Tunneling. There are two types of tunneling, grey tunneling and black tunneling. Our main works are bringing forward a new page segmentation method and finishing a grey tunneling system based on page segmentation. The method makes use of the vision information, tag information, link information and ontology information, which are in the web pages. The vision information contains background color, font size and color etc; the tag information used an order tag collection {<table>, <tr>, <p>, <hr>} to recursive segment page; the link information is make use of“pagelet”concepts and the anchor text and ontology information provided hierarchical concepts. At last we bring forward to a lot of heuristic rules to control the accuracy and grain degree of the block when segment a page. Face to the black tunneling, we use Association Rules to slove these prblems.5. Respect for users, study on user’s behavior and interests are the fundamental for User-oriented personalized service. It provides a better guarantee for users’utilize resources. User-oriented personalized service which aim is satisfy the user’s requests and everything from the user’s requirements. Not only can users customize their interface, but also can freely select the contents of required services, and denifit their own preferences property documents. Information services through the network in accordance with the specific user interest, babits, etc. to carry out personalized services to meet the needs of the user’s individual requirements. Personalized service has been an inevitable trend for the development of search engines. Based on the thinking of focused crawling that we had proposed above, we had built a focused crawling model for specific user’s interests, and this model based on cognitive psychology, information dissemination and the discipline of forgotten. We will accord with user’s search habits and track user’s behavior patterns to realize specific user-oriented recommendation, filtering and other personalized services thought machine learning and training specific user models. At the same time, we note that the groups of user behavior will have the same similar acts of users to create user group. This group can achieve the informations sharing and dissemination of them. We can also indentify the typical users and filed experts. The research has the characters of semantic, personalized, Intelligent and decision support.To sum up, research on semantic information retrieval is of important theoretical value and widely used in search engine area. This dissertation has done some research on its modeling and application. The emphasis of our further research will be on the application, evaluation, and employment of the ontology-based focused crawling to the web search engine.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2009年 08期
节点文献中: