节点文献
Deep Web数据获取方法研究
Research on Deep Web Data Acquisition Method
【作者】 蔡欣宝;
【导师】 崔志明;
【作者基本信息】 苏州大学 , 计算机应用技术, 2010, 硕士
【摘要】 随着互联网的飞速发展,Web中的信息规模日益扩大,为人们提供了各种各样可利用的信息。其中大量的信息是存储在Web数据库当中,只能通过网页中的查询接口才能访问。改变了通过链接来访问网页的方式,使得传统的搜索引擎无法获取,因而被称为Deep Web。高速增长的Deep Web信息已成为人们进行信息获取的一个重要来源,然而Deep Web数据的异构性和动态性,为大规模Deep Web数据集成带来巨大的挑战。通过获取Deep Web的数据,在本地集成Web数据库的重要性正在逐渐凸显。本文针对Deep Web数据获取的相关技术进行深入研究,并提出了相应的算法和模型。本文的主要研究工作如下:(1)研究了Deep Web站点和查询接口的特点,在表单的属性选择方面,提出了一种基于属性相关度的属性组合有效性的计算方法。(2)分析了查询接口中属性的特点,提出了通过机器学习的方法识别查询接口中每个特定的文本属性。(3)通过对属性的分类,针对不同类型的属性采用不同方法产生查询词。对于普通的文本属性,提出了通过抽取查询结果页中的相应内容,并通过适应性策略来选取合适的关键词作为查询词的方法。对于特定的文本属性,使用人工建立知识库的方法。(4)分析了Deep Web数据源中网页的更新特点,通过泊松模型对网页更新事件建立模型,增量获取Deep Web数据。并设计了增量获取Deep Web数据的爬虫系统结构。此外,本文还对文中提出的方法和技术进行了实验,通过对实验结果的分析进一步验证了本文提出的方法是有效的。
【Abstract】 With the rapid development of the Internet, Web information scale is growing continuously, which provide people with all kinds of available information. Large amount of information is stored in the Web database, which can only be accessed through the web query interface. Changed the way of visiting web page by link, so the traditional search engines can not access, they are called Deep Web. The increasing of Deep Web information with high-speed have being a significant resource for information retrieval. Due to the heterogeneity and dynamicity of Deep Web data, data integration of large-scale Deep Web are very challenging. By crawling Deep Web data, integrating web database in local host is becoming more and more significant.This thesis researches on Deep Web data acquisition in-depth, and propose the related algorithms and models. Our research issues are follows:(1) Research on characteristic in Deep Web site and query interfaces. In deciding which form inputs to be filled when submitting queries to a form, propose a method for searching valid attribute compounding based on attributes correlation(2) Analyze characteristic of attributes in query intefaces. a method to identify each typed text attribute in query interface by machine learning methods is proposed.(3) By the classification of attributes, For different types of attributes, used different methods to find appropriate query words. For generic text attributes, extracting the corresponding content in query result page, and through adaptive strategy to select the appropriate keywords as the query words . For typed text attributes,used the knowledge base built by hand.(4) Analyze the pages of the Deep Web website update features, by the Poisson model to model web pages update events, incremental crawling the Deep Web data. And designed the system frame of crawler to incremental crawling the Deep Web data.
【Key words】 Deep Web Crawler; Attributes Correlation; Attribute Compounding; Query Selection; Incremental Crawler;