

Research on Semantic Annotation for Domain-Specific Web Pages

【作者】 荆涛

【导师】 左万利;

【作者基本信息】 吉林大学 , 计算机应用技术, 2011, 博士

【摘要】 为网页增加语义元数据信息,将Web页面转化为机器可理解的语义描述形式属于语义标注研究范畴。这一研究不仅对于语义Web远景早日实现至关重要,也对当今Web中各类自动化应用性能的提高具有重要作用。本文作者在深入分析前人工作的基础上,综合运用语义Web、本体构建、自然语言处理、机器学习和Web挖掘等多个领域的知识和方法,开展了“面向领域网页的语义标注”研究工作,主要研究内容包括:1.对语义标注研究及相关技术进行了全面的分析和总结。2.在综合现有本体构建方法的基础上,提出了一个以研究需求为驱动,支持研究组在分布式环境中开展工作的四阶段本体构建方法。3.针对知网2000免费版(简称为HowNet)编程开发接口缺失的现状和项目开发的需求,利用逆向工程技术,给出了一个获取HowNet编程开发接口的技术解决方案,并将获得的接口应用到实验中。4.提出了一个在领域本体指导下,综合运用统计学方法和自然语言处理(NLP)技术对中文自然语言Web文档进行语义标注的方法框架。框架分为数据准备阶段、识别阶段和组合阶段。在数据准备阶段利用特征抽取方法构建领域词汇表,并形成类型标注表;在识别阶段提出显式类型标注算法,识别文本中的实例和属性;在组合阶段提出基于依存树的关系抽取算法和基于依存森林的关系抽取算法,完成关系抽取。此外,还给出了一个基于影响度函数的主动学习方法以交互提问方式来提高标注性能。5.提出了基于句子频繁特征模式挖掘的语义标注方法框架,包括数据预处理、模式挖掘和规则处理三个阶段。在数据预处理阶段提出特征句提取算法和特征序列生成算法;在模式挖掘阶段提出基于后缀数组的句子频繁特征模式挖掘算法;在规则处理阶段利用挖掘得到的特征模式来编写标注规则,并将规则应用到语义标注过程中。本文研究依托国家自然科学基金重大项目“非规范知识的基本理论和核心技术”之开放课题“第二代浏览器原型研究”(60496321),目前研究成果已应用到原型系统CRAB中。

【Abstract】 The flourish development of web technology has brought about the explosive growth of web resources, which makes World Wide Web become the largest information repository of the world. Though the web provides people with vast amounts of information, it has increasingly exposed a serious problem:information overload, that is, the information is abundant while the means of acquiring information is relatively scarce, which makes it difficult for people to obtain valid knowledge. Facing this growing trouble, people try to use web information retrieval technology (for example search engines) and automated agents technology based on information extraction to tackle this problem. However, the lack of machine-understandable semantics in the web content makes it difficult for these softwares to be highly efficient. The vision of the semantic web is to make the web content machine-understandable. The achievement of this vision will enable the machine to make full use of the semantic information.in the web pages and meet the user’s demands for knowledge effectively. Realizing the vision of the semantic web requires a lot of web contents which contain semantic metadata, but the existing web pages have little of them. To add semantic metadata to web pages belongs to the researches on semantic annotation. These researches on semantic annotation will be advantageous in narrowing the gap between the current web and the semantic web and realizing the vision of semantic web as early as possible, in improving the performance of the search engines and bridging the knowledge gap between the users and the search engines during the search and also in decreasing the developing cost of the automated agents and increasing the robustness and intelligence of the automated agents.The thesis is financially supported by the Major Research Program of the National Natural Science Foundation of China under grant No.60496321. Based on the deep analysis of related research and existing methods, this thesis has used many computer science theories and methods comprehensively, such as semantic web, ontology engineering, natural language processing, machine learning and web mining etc., has performed researches on semantic annotation for domain-specific web pages. The results have been used in the prototype system—CRAB.The main research results and technical contributions of this thesis are listed as follows: The thesis has introduced and analyzed the current state of art of semantic annotation research and its related techniques. By comparing the situation of the current web and the vision of the semantic web, the thesis has pointed out the urgency and importance of the research on semantic annotation. Based on the analysis and definition of the concept of semantics, annotation and semantic annotation, the thesis has introduced the category and the development of annotation and has reviewed the work related to semantic annotaion. In addition, the study of ontology and ontology engineering closely related to semantic annotation are also introduced in-depth. All the above are the groundworks of the further research works.Based on the existing ontology engineering methods, this thesis has presented a four-phase method for constructing the domain ontology, which is driven by research requirements and supports each research group to work in a decentralized environment. The building process is divided into four phases:1. building together. 2. local adaptation,3. analysis and revise.4. release and update. Except the first phase, the last three phases are performed in iterative cycles. After each cycle, a newer version of the domain ontology is released and the prototype of the domain ontology is evloved. This method fits to cope with the scenarios where users’ needs change frequently and facilitates the rapid development of ontologies.HowNet is an important knowledge base of common sense. However, the lack of the programming interface of HowNet (free edition) makes it hard for the researchers to use it efficiently: Hence, this thesis has given a technique solution to obtain the interface. It is a valuable exploration into the reverse engineering of binary codes. By analysing the assembly codes statically and tracing them dynamically, the thesis has extracted the function interface of Hownet successfully and has generated the header files and libraries according to the function calling conventions. The work has the following two contributions:the first is that it gives the programing interface of the HowNet software and facilitates the research related to Hownet. And the second is that it is a good referential example of making full use of various legacy binary codes in the research and especially of reusing the binary codes without the instruction of the programming interface.Noting the similarity between the two forms of knowledge representation:the natural language sentences and the RDF representaions, the thesis has proposed a methodology framework for semantic annotation of Chinese web pages, which is guided by domain ontology and employs the statistical method and the natural language processing (NLP) technology. The framework comprises three phases:the data preparation phase, the identification phase and the grouping phase.In the data preparation phase, a focused crawler is employed to build the repository of the domain-specific web pages. The domain lexicon is constructed by the feature selection technique, which is used to obtain the high-frequency words relevant to the domain from the repository. After the types of the words (of the domain lexicon) are labeled which are correspondent to the concepts or properties of the domain ontology manually, the type tagging gazetteer is generated. In the identification phase, the thesis has proposed an explicit property type tagging algorithm (EPTT). The tagging type is divided into two kinds:ontology type and general type. The algorithm uses both the rules and the gazetteers to recognize the instances and properties in the text. Compared with the normal methods of named entity recognization, this method makes the further processing easier by tagging the words of property type explicitly. In the grouping phase, the thesis has grouped the words of the sentences by employing the dependency relationship, has proposed the concepts of dependency tree and dependency forest and has given two algorithms:the relation extraction algorithm based on the dependency tree (DTRE) and the relation extraction algorithm based on the dependency forest (DFRE). The DTRE algorithm uses natural language processing technique (NLP) to parse a given sentence and constructs the dependency tree based on the dependency relationship of the words which have been got firstly, and then the Grammar Relation Triples (grt, for abbrivation) can be generated. By combining the domain ontology and the type tagging results, the algorithm validates the grts. Each valid grts are transfered into a knowledge triple (RDF statement) which is correspondent to the domain ontology. Thus, the mapping from the natural language sentence to RDF representation is done. DFRE algorithm is an improvement of the DTRE, which is designed mainly to tackling the long Chinese sentences. The method decomposes a long sentence into clauses, and then constructs the dependency tree of each clause respectively. After unioning all the dependency trees into a dependency forest, the DTRE algorithm is called to accomplish the relation extraction. The experimental results show that compared with semantic annotation method based on the grammatical relationship of subject-verb-object, both of the two methods are significantly more effective. In addition, an active learning idea based on the influence formula has been presented to increase the performance of the annotation. The influence formula has been defined based on two respects:one is the diffculty of annotating the triple and the other is the influence over the other triples of the collection when this triple is annotated.Noting that some sentence patterns occurs frequently in the domain articles, the thesis has presented a method of semantic annotation based on mining the frequent feature patterns of sentences. According to the theory of mining sequential patterns, the thesis has given the definitions of the feature itemset, the feature item and the feature sequence, which are used in mining the frequent feature patterns of sentences. By defining the feature items as word types and defining the feature sequence as type identifier strings, the semantic abstraction of the original sentences can be 吉林大学博士学位论文attained. After giving the above definitions, a methodology framework has been proposed, which is composed of three phases:the data preprocessing phase, the pattern mining phase and the rule processing phase.In the data preprocessing phase, the thesis has extracted the words of property type in the type tagging gazetteer to build the feature words list firstly. Based on the defined formula for caculating the feature strengths of the sentences, the feature sentences whose feature strengths are higher than the predefined threshold are extracted from the whole sentence space. After getting the feature sentences, the corresponding feature sequences database can be constructed by employing the feature sequence generation algorithm.In the pattern mining phase, the feature sequence database has been processed by the proposed sequential pattern mining algorithm based on suffix array, and the frequent feature patterns have been obtained. This mining algorithm makes full use of the advantage of suffix array in processing the long sequences. The nuclear concept is to transfer caculating the supports of the feature patterns in the feature sequence database into caculating the document frequencies of the feature patterns in the various sequence documents.In the rule processing phase, the thesis has written the annotation rules according to the mined feature patterns and has applied them to semantic annotation. The experimental results show this method can tackle some domain specific sentences effectively and avoid the errors caused by the parser. Thus, the precison of the annotation has been improved. By combining this method and the DFRE method, the performance of semantic annotation has been significantly improved.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2011年 10期