节点文献

基于本体的旅游领域Web信息抽取

Ontology-Based Web Information Extraction in Tourism Domain

【作者】 陈立娜

【导师】 王驹;

【作者基本信息】 广西师范大学 , 计算机软件与理论, 2009, 硕士

【摘要】 随着Internet和Web技术的发展,WWW已经成为一个巨大的信息资源库,然而使用传统的搜索引擎,用户要精确地找到所需信息往往十分困难。Web信息抽取技术正是在这样的背景下出现的。目前,关于Web信息抽取方面的研究有很多。Web信息抽取的方法主要有基于自然语言处理的、基于包装器归纳的、基于HTML结构的和基于本体的。基于本体的信息抽取方法主要是利用了对数据本身的描述信息,对网页的依赖少,而且本体可提供机器可识别的领域概念知识及其关系,具有简单的推理能力。此外,在信息抽取中使用本体有许多优点。首先,本体提供了一个丰富的、预定义的词汇库,可作为与数据源的稳定的概念接口,并且独立于数据模式。第二,本体表示的知识足够支持所有相关信息源的转换。第三,本体支持一致的管理和非一致数据的识别等。由上述的分析并结合项目实际的需要,本文提出了一种基于本体的旅游领域Web信息抽取方法,并设计实现了一个广西旅游信息抽取原型系统。本文主要工作和创新点:(1)分析比较了几类主要的本体构建的方法。综合各方面,本文采用Mike Uschold & Micheal Gruninger提出的方法构建旅游领域本体。在构建过程中,本文分析研究了本体概念之间的关系、概念的层次结构、概念的等价性、属性约束以及实例的等价性。(2)介绍了Pellet推理机,阐述了SHOIQ(D)-Tableaux推理算法,研究利用该推理算法对旅游领域本体的推理,包括本体一致性检测、概念的包含关系检测、概念的可满足性检测、属性约束以及实例检测。最后阐述了利用Jena对本体的解析,分析出本体的概念、关键词、关系和实例等信息,存入数据库。(3)在本体推理解析的基础上,首先根据网页转换为DOM树结构,阐述了利用旅游本体关键词定位页面正文进行页面正文提取的算法。接着阐述利用ICTCLAS分词工具和旅游领域词汇相结合进行的中文分词处理,停用词过滤的分析。最后阐述了抽取规则。在抽取规则的构建中,我们利用了属性的语义特点和三元组相结合的方法。最后,根据研究的关键技术,本文实现了一个广西旅游信息抽取原型平台—Tourism_IESystem,并以旅游网站的Web页面为实验对象,验证信息抽取系统的性能。表明了本文方法的技术可行性,具有实际应用前景和现实的价值意义。

【Abstract】 With the development of Internet and Web technology, WWW has become a tremendous information depository. However, with traditional search engines, people can’t easily find the precise information which they need. The technology of Web information extraction is appeared under this background.At present, the technology of Web information extraction has a lot of research. The main methods of Web information extraction are natural language processing-based and Wrapper induction-based and HTML structure-based and ontology-based. The method of ontology-based information extraction mainly uses the description information of the data itself, relying less on Web page, and ontology can provide domain concepts knowledge and relations which machine can understand, and ontology has expressive reasoning ability. Besides, in information extraction, it has many advantages using ontology. First, ontology provides a rich and predefined lexicon, which can be used as the stable concept interface for data source, and is independent of the data mode. Second, the knowledge of ontology representation is enough for the converting of all relevant information sources. Third, ontology supports the management of consistency and indentification of the non-consistent data, and etc.With the analysis above and the actual needs of our project, a method of Web information extraction based on ontology in tourism domain is proposed in this paper, and a model platform of information extraction in tourism of Guangxi—Tourism_IESystem is designed and implemented. The main works done in this paper are as follows:(1) Analyze and compare the main methods of domain ontology construction. All things considered, tourism ontology is constructed in this paper, using the method proposed by Mike Uschold & Micheal Gruninger. In constructing process, this paper studies the relation between the concept and the hierarchical structure of the concept and the equaivenlent of the concept and the restrictions of the property and equaivenlent of the individual.(2) Introduce the Pellet reasoner, state the SHOIQ(D)-Tableaux reasoning algorithm, study the reasoning of the tourism domain ontology using the reasoning algorithm, including the check of ontology consistency and the check of concept subsumption and the check of concept satisfiability and the check of property restrictions and the check of instance. At last, state the ontology parser using Jena, analyze ontology concept and keywords and relation and instance and etc, storing in database. (3) On the basis of ontology reasoning and parser, firstly, according to the characteristics of the transferring from the website to the DOM tree, state the extraction algorithm of the website text content using the keywords of the tourism ontology to locate the information regional of the pages. Secondly, state the Chinese word segmentation using ICTCLAS word segmentation tool and tourism domain vocabulary, and analyze the filtering of stop words. At last, state the extraction rules. In the construction of the extraction rules, the semantic feature of the property is used in this paper, and combining the triple.At last, according to the key technology studied in this paper, a model platform of information extraction in tourism of Guangxi—Tourism_IESystem is implemented. And the performance of the information extraction system is validated by making use of the Web page of tourism sites as experimental object. This shows that the method proposed in this paper is feasible according to technology aspect, and it has practical application value and realistic significance.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络