节点文献
基于浏览器的Web结构化数据抽取的研究及实现
Research on the Web Structure Data Extraction Based on the Browser and Its Implementation
【作者】 付亮亮;
【导师】 左万利;
【作者基本信息】 吉林大学 , 计算机软件与理论, 2010, 硕士
【摘要】 互联网技术的飞速发展为人们提供了大量的信息和资源,很多信息都是从数据库中查询得到然后使用一定的模板来展现在网页中,这类数据被称为结构化的数据或者记录。抽取结构化的数据可以为信息整合、垂直搜索等很多领域提供增值的服务,具有很大的用途,已经有大量的研究人员进行了研究,比如基于自然语言处理的方式、基于网页DOM树结构,但是这些都是基于单个网页抽取的,这种方式有很多缺陷:1、一个主题的完整信息可能需要从多个网页才能抽取到,这对抽取和后期的数据整合提出了挑战; 2、需要爬虫抓取页面供抽取器抽取,而网页的抓取对于深度网处理功能有限;3、网页的数据可能需要通过javascript生成或者AJAX异步请求得到,而传统的抽取方式对于javascript、AJAX等处理能力有限。本文提供了基于浏览器的信息抽取方式,提供了可视化的抽取规则生成工具和后台抽取运行时,可以解决上述问题。本文主要提出如下思路来解决抽取问题:1、提供可视化的交互式的抽取规则生成工具。通过很少的交互即可以生成适用于整个站点同一主题信息的抽取,并且提供了多种可选择的抽取方式,这样可以在不同的情形可以选择更合适的抽取方式。2、抽取信息的定位综合使用了基于DOM树的路径信息、可视化和不变文本信息。本文提出了使用EPath(Extraction Path)描述DOM树的路径信息和解析定位结点的算法。EPath对传统的XPath进行了改进,EPath不仅包含结点的位置信息、属性信息,还包含了可视化方面的信息。在解析的过程综合了这几种属性进行定位结点,对结点的match度进行打分,选择最佳的结点,而不是像XPath每次只能采用一种策略的速错方法。解决了同一模板生成数据中可选数据项导致结构差异的问题。3、基于浏览器的导航技术,表单提交、重复子结构识别和翻页装置的识别,解决了深度网抽取、javascript、AJAX处理的限制。4、定义复杂的抽取指令,相当于信息抽取领域的DSL(Domain Specific Language),可以解决复杂的抽取任务。基于以上思路,本文将他们应用于实用的系统中,并构建了可以作为Web信息获取的工具,为信息整合、垂直搜索提供数据源。
【Abstract】 With the technology of the Internet development, the information on the internet is growing rapidly, the internet has grown to a dynamic all over the world distributed information server, which contains all kinds of information and resources, providing a variety of services and information resources for user and enterprises. Large amounts of data are queried from the database, and then use a certain template, displayed on the web page, generally this kind of data are referred as structure data or record. How to extract structure data from web has been researched by a lot of researchers,such as extraction based on Natural Language Processing,extraction based on DOM Tree of the web page, but these extraction is based on the single page, and those ways have a lot of shortcoming:1) the same topic information may in multiple pages, and the extraction must be on multiple pages, so we need the integration after extraction to generate a complete record; 2) web crawler has limited capabilities to the deep web; 3) the methods have limited capabilities to Javascript and AJAX. The paper gives an extraction method based on browser, which can give a big help to solve the problems. The method combines more information, and give different positioning strategies for user to select according the context of the extraction.The paper gives the following ideas to solve the extraction problems:1、Provide an interactive and visualization tools for extraction rules generation. The user needs little interactive action to generate extraction rules, which can applied to the same theme information over the entire site. The tools providers a variety of alternative methods, so as to the user to select appropriate method according the context.2、Provide an information locator method, which combines the path of the DOM tree, visualization and immutable text information. The paper provider a description of the DOM tree path called EPath(Extraction Path), which contains the position of the node,the attributes of the node and the visualization information. We all give the parsing algorithm, which is not like the parser of the XPath, a fall fast method, it scores the located node according the match degree with the EPath information mentioned above. The algorithm can solve the structure data has the same template but include optional date items problems.3、Provide a browser navigation based technology, to support form submission, repeated structure identification and next page devices identify, which solve the Deep Web Extraction, javascript, and AJAX handling restrictions.4、Define a complex extraction instructions, which provider the DSL(Domain Specific Language) to the extraction fields, and can solve the complex tasks easily.5、Build an extraction system that can be used as a information locator tools in vertical search engine, information integration system.The process of the extraction can be described as the following phases:1、Users use the extraction rule generation tools to demonstrate how to extract the record, and the extraction rule generated automatically, at last save the xml format extraction rules.2、The extractor executes extraction using the extraction rules generated from the first phase, and save the extracted results to file or database.Information locator algorithm is a key information extraction technology, so firstly, the paper describes the location algorithm. The EPath give a method to describes the location of the extracted information, which also combines the location,attribute,visual information. The extraction generation Tool using Firefox extension technology, and the EPath can be automatically generated when the tool generate the extraction rules. In the runtime, the extractor interpret the EPath and locate the target node, the EPath solve the structure data that has the same template but include optional date items problems, providing a robust web information location method. Interactive extraction rule generation tool is an import part of the extraction system. We customize the Firefox browser to make it suit to extraction interactively through the Firefox extension technology, and make the interactive operation just like to surf the internet, which make the tool easy to use. How to identify and submit the form, how to identify the repeated structure and pagination device, and the embedded browser technology provide a foundation to the browser navigation based extraction. We can use the information generated by the interactive extraction tools to identify and submit the form; The paper use a similarity of structure algorithm to identify the repeated structure, the algorithm use the string editing distance to define the similarity. Pagination device identification use heuristic rule-based approach: we define four rules to identity different kinds of pagination devices. The extraction runtime use embedded Firefox browser to navigate and extract information, and it make the use of extraction rules consistent with the generation, and it also has the ability to deal with the Javascript and AJAX. We define the extraction instructions and the logic instructions using EMF technology and XML format, which make the instructions scalable and flexibility. The extraction instructions defined an extraction domain language, which give a strongly support for extraction.After introducing the algorithm and principle, we describe the structure of the extraction system and the key modules in the chapter 5 and 6, and we also give the code and comment for important part of the system to make the description easily. The extraction system contains two parts: the visualized extraction rule generation tools and extraction runtime. The extraction rule generation tools use Firefox extraction technology, providing an interactive way to generate extraction rules. The tools can be divided into basic services layer and interactive UI layer: the basic services layer define several XPCOMs, which provide a basic service to the UI layer. The interactive UI layer provide a tool bar and a popup menu for extraction generation. The tool bar has several buttons such as load schema, watch model, save model; the popup menu is the primarily means for the interactive extraction operation, and all the extraction operations are defined in the popup menu. The extraction runtime use the embedded browser, but we define a abstract layer for browser and DOM operation, and it use adapter design pattern to avoid the specific browser API polluting the extraction code. We can replace the browser with the HtmlUnit for Web page database based extraction, which can avoid the render phase and improve the performance.The contribution and innovative work of the paper mainly in the following areas:1、Provide an interactive visualized extraction rule generation tools, which based on the Firefox extraction technology. It simplifies the complexity and reduce the threshold for taking the use of tools.2、Provide the EPath based and immutable text extraction algorithm. The EPath provides a robust web-node positioning method that combines the node’s attributes, position information and visualized information.3、Provide the browser-based web page navigation extraction method, and the algorithm of identifying and submitting the form, identifying the repeated structure and pagination devices.4、Define complex extraction instructions, which describe the operation of the web information extraction, and some logic instructions. The instructions define a web information extraction domain language.5、Build a practically extraction system, which solve the problem of extraction field such as dealing with the Javascript,Ajax,the same theme information across multiple pages and deep web extraction.In summary, the paper presents a browser-based Web information extraction method, and give a try to build the practically web information extraction tools, and the recall and precise is good enough for practical using. The system given by this paper can be used as the web information locating tool for vertical search engine and information integration.Although some advances have been achieved in the domain of web information extraction, there is much of work to be done because of the limited research time and the author’s ability. To be more specific, we list those that need to be improved and worth to research:1、The EPath location method depends on a lot of the structure information of the web page, so it can firstly partition the page to blocks, and then location the record block, and finally use the relative EPath to locate the node on the block.2、A more convenient extraction rule generation tool, and more interactive ways.3、Pagination recognition based on the heuristic rules, but the rules can not cover all the cases, therefore select the appropriate feature, and use machine learning approach may give help.4、The identification of the repeat structures uses similarity algorithm can combine the visual information such as the block size and alignment information, which may give a greater efficient and accurate method.
【Key words】 Information Extraction; Web mining; Web Structure Data; Deep Web; Browser Navigator; Firefox Extension;