节点文献
Deep Web查询结果抽取及注释
Deep Web Query Results Extraction and Annotation
【作者】 谢莹;
【导师】 左万利;
【作者基本信息】 吉林大学 , 计算机软件与理论, 2010, 硕士
【摘要】 本文对Deep Web数据集成系统进行了学习和研究,重点研究了系统中查询结果抽取和查询结果注释两个单元,提出了自己的实现方法。查询结果抽取是指从查询结果返回页面中自动抽取出数据记录;查询结果注释是指为抽取出的数据记录中的各个数据项添加语义标注。在查询结果抽取单元,本文采用基于HTML标签树的方法,通过递归过程在标签树中自顶向下地挖掘数据记录。对数据记录的识别,是通过计算标签树之间的相似度来完成的,标签树之间的相似度是基于编辑距离计算的。本文提出了不同于传统方法所提出的数据记录的定义,基于该定义的抽取过程较传统方法简单,不需要事先挖掘数据区域,而是直接抽取数据记录。在查询结果注释单元,本文采用基于本体与启发式规则相结合的方法为待标注数据项添加语义标注,本体可以保证注释的一致性,启发式规则可以提高注释的完整性。该单元分为本体管理模块和语义标注模块,在本体管理模块构建图书领域本体库并用子概念表和候选概念表来维护本体;在语义标注模块制定了启发式规则,并指出了对一个数据项进行注释的过程。本文采用多个中文图书领域Deep Web站点的查询结果返回页面进行实验测试,测试结果表明本文提出的方法准确、有效。
【Abstract】 carry this information, Web databases appeared. The information is loaded in the Web databases, when users want to find this information, they just need to fill the entry forms of Web databases, which are called query interfaces forms also. Web sites which contain Web databases are called Deep Web. Deep Web is rich in information, so it gets more and more researches. Now, the researches on Deep Web mainly contain Query Interfaces Integration, Query Processing and Query Results Processing three parts, of which, Query Interfaces Integration part includes Web Databases Discovery, Query Interfaces Schema Extraction, Web Databases Classification and Query Interfaces Integration four units; Query Processing part includes Web Databases Selection and Query Transformation two units; Query Results Processing contain Query Results Extraction, Query Results Annotation and Query Results Combination three units. Putting the three parts of Query Interfaces Integration, Query Processing and Query Results Processing together forms the Deep Web Data Integration System.This paper focuses on the units of Query Results Extraction and Query Results Annotation. Query Results Extraction means to mine and extract data records from the returning query results page. Query Results Annotation means to add semantic label to each data item of data records.In the unit of Query Results Extraction, this paper uses the method based on HTML tag tree. As the data records in the same returning result page have high similarity in structure, which is in fact manifested on the tag tree that forms them, so, by turning the pages to tag trees, can identify data records base on the similarity of tag trees. In this paper, after a large number of observations over returning result pages of Deep Web sites and their source code, we summed up the characteristics of data records in the structure of tag trees, and put forward the definition of data record which is different from before. This method is divided into two steps: (1) Building tag trees of web pages; (2) Mining data records. In step (1), this paper use HtmlParser to parse the page, the result is saved in a parsing tree. The type of parsing tree node is Node interface which is defined by HtmlParser. Node has three types of implementation class, which are TagNode, TextNode and RemarkNode, in which TagNode represents the tag nodes of Html code. Traversal the parsing tree top-down to find out TagNode which are used as tag tree node to construct a tag tree, delete some useless node that express style; In step (2), the process of mining data records is a recursive process, starting from the root of tag tree, set the root node as real parameter of process, check whether the node is a data record. If so, then find the recursive exports, extract the content of the node and the process is over; if not, then find all the child nodes of the node, use them as real parameters to call the process recursively in turn. In the process of checking whether a node is a data record, the most important link is calculating the similarity of sub-trees of a tag tree, which is completed by applying edit distance algorithm. Traverse the two sub-trees of the tag tree to turn them to two tag node sequences first, and then use them as real parameters to call the edit distance algorithm. As can be seen from the above, different from many methods based on tag trees, the biggest feature of our method is that it is not necessary to mine data region, but to mine data record directly.In the unit of Query Results Annotation, this paper adds semantic label to each data item by using Ontology combined with heuristic rules. Query Results Annotation unit contains Ontology Management Module and Semantic Annotation Module. In Ontology Management Module, extract concepts which contain main-concepts and sub-concepts according to characteristics of many Deep Web query interfaces schema, establish Ontology library, and maintain the sub-concept table candidate-concept table in Ontology Manager in order to modify Ontology automatically. Sub-concept table stores all the sub-concepts of each main-concept in Ontology library, it ensure the consistency of Ontology concepts; candidate-concept table stores concepts which can not be matched by system. These concepts need confirm by domain experts to be sure whether they are domain relevant or domain irrelevant. For domain relevant concepts, put them into Ontology library as main-concepts, and updates the sub-concept table; for domain irrelevant concepts, do nothing to them and just give them up. By enriching the sub-concept table and candidate-concept table, Ontology’s ability on distinguishing domain semantics is also enhanced. In Semantic Annotation Module, need to pre-process the data records extracted in Query Records Extraction unit first, standardized each data item of data records, keep the text content and remove the image content. Then, for a data item to be labeled, determine whether the data item belonging to semantic-based data item or content-type data item first.For semantic-based data item, use main-concepts of Ontology in Ontology library or their sub-concepts to match with its described text. If they matched successfully, use the main-concept of the matching concept instead of the described text to label the data item; if they match unsuccessfully, put the described text to under-judge-concept table, and view the text after described text as content-type data item. For content-type data item, use the instances of Ontology in Ontology library to match with it. If they matched successfully, use the main-concept of the matching instance as described text to label the data item; if they matched unsuccessfully, then use heuristic rules to it. Query Results Extraction experiment uses many book domain Deep Web sites’query results returning pages to do test experiment for setting thresholds. After test experiment, knows that when thresholds H, L, S are set to be 2, 10, 0.9, F index reached the maximum 96.8%; then, under this thresholds set, do experiment by using some Chinese book domain Deep Web sites’query results returning pages, precision and recall are 100% and 98.2%, the effect is perfect. The data records this experiment extracted are used to the Query Results Annotation experiment. By experiment, The precision and recall of Query Results Annotation are 98.1% and 90.4%, F index is 94.1%, the effect reached the requirement of real application, but need improvement.
【Key words】 Deep Web; extraction; annotation; tag tree; Ontology; heuristic rules;