节点文献

Deep Web数据集成关键问题研究

Research on Key Issues in Deep Web Data Integration

【作者】 董永权

【导师】 李庆忠;

【作者基本信息】 山东大学 , 计算机软件与理论, 2010, 博士

【摘要】 随着互联网技术日新月异的发展,Web已经成为一个巨大的信息源,拥有着海量数据。这些数据具有重要的价值,目前许多应用领域,如市场情报分析等迫切需要利用这些数据进行分析挖掘,从中获取有用知识,最大程度的进行辅助决策。但是,Web数据具有大规模、异构性、自治性、分布式等特点,这使得Web数据的分析挖掘变得尤为困难,当务之急是要对其进行集成,为分析挖掘提供高质量数据。根据Web中所蕴含信息的“深度”,可以将Web分为Surface Web和Deep Web。Deep Web数据在数量和质量上远远超过了Surface Web,具有更高的应用价值。因此,如何进行Deep Web数据集成,以便于更有效的分析挖掘,具有重要的现实意义和广阔的应用前景。现在对Deep Web的研究主要侧重于面向查询的Deep Web数据集成,这种集成方式获取的数据量有限,适用于用户即时查询需求,但是难以胜任以分析挖掘为目标的应用。本文致力于面向分析的Deep Web数据集成研究,目标在于最大限度地获取Deep Web页面,运用抽取与消重技术得到结构化良好、高质量的数据,为进一步的分析挖掘提供数据支持。面向分析的DeepWeb数据集成存在以下问题有待解决:(1)由于分析挖掘需要大量的数据,而这些数据在Deep Web中来自于领域内多个Web数据库动态产生的Deep Web页面,因此,需要自动地最大限度地获取这些页面;(2)由于分析挖掘需要结构化良好的、语义丰富的数据,而这些数据存在于复杂的、半结构化的DeepWeb页面中,因此,需要从页面中准确地进行结构化数据的抽取,并进行语义理解;(3)由于分析挖掘需要统一的高质量数据,而这些数据重复存在于同一领域多个Web数据库中,因此,需要进行多个Web数据库之间的重复记录检测。本文以面向分析的Deep Web数据集成为目标,针对其中存在的关键问题展开研究,主要工作与贡献概括如下:1.提出一种基于扩展证据理论的Deep Web查询接口匹配方法,有效解决了同一领域内不同Web数据库爬取时的查询接口语义理解问题。同一领域内存在大量的Web数据库,这些Web数据库的查询接口模式之间具有异构性,导致在爬取不同Web数据库时难以通过统一的方式识别出需要投放查询词的接口属性,影响Deep Web页面的获取。针对这一问题,本文提出一种基于扩展证据理论的Deep Web查询接口匹配方法,通过构建待爬取Web数据库查询接口与其对应的领域查询接口之间的匹配关系,理解该查询接口属性的语义信息。该方法充分利用了查询接口的多种特征,构建不同匹配器,通过动态预测每个匹配器的可信度扩展现有的证据理论,进行多个匹配器结果的组合,提高组合的适应能力;通过top-k全局最优策略和树结构启发式规则进行匹配决策,得到最终的匹配关系,利用该匹配关系理解待爬取Web数据库查询接口。实验结果表明,该方法具有较高的匹配准确率,有效克服了现有查询接口匹配方法适应能力差导致匹配准确率较低的不足。2.提出一种基于查询词采新率模型的Web数据库爬取方法,有效解决了Deep Web页面的大规模获取问题。以分析挖掘为目标的应用需要大量的Deep Web数据,这些数据来自领域内多个Web数据库动态生成的Deep Web页面,但是Web数据库特有的查询接口访问方式,使得传统的搜索引擎爬虫无法爬取其中的内容。针对这一问题,本文提出一种基于查询词采新率模型的Web数据库爬取方法。该方法通过对Web数据库进行采样,利用采样数据,选择多种特征自动构建训练样本,避免样本的手工标注;利用多元线性回归方法,通过训练样本构建查询词采新率模型,借助该模型迭代选择查询词进行查询提交,从而实现对Web数据库的爬取。实验结果表明,利用该方法爬取Web数据库具有较高的覆盖率,有效地克服了现有Web数据库爬取方法采用启发式规则选取查询词的单一化和经验化的不足,学习得到的查询词采新率模型可以有效应用于同一领域其它Web数据库的爬取。3.提出一种基于层次聚类的Deep Web数据抽取方法,有效解决了DeepWeb页面中结构化数据的自动抽取问题。Deep Web页面以半结构化形式存在,难以对其中的结构化数据进行自动化处理。针对这一问题,本文提出一种基于层次聚类的Deep Web数据抽取方法。该方法通过利用查询结果列表页面的信息来辅助识别Deep Web页面中的内容块,确定数据抽取的区域:通过综合利用多个Deep Web页面的结构和内容特征,对这些页面中同一内容块中的内容结点特征向量进行层次聚类,从而实现Web数据记录的抽取。实验结果表明,该方法具有较高的抽取准确率,有效克服了现有大部分方法仅利用页面自身结构信息导致抽取准确率较低的不足。4.提出一种基于约束条件随机场的Deep Web数据语义标注方法,有效解决了Deep Web数据语义缺失以及多个Web站点数据记录之间的模式异构问题。对于抽取后的Web数据记录,如果单独依赖Deep Web页面中现有的语义标签进行标注,则无法处理语义标签缺失情况,而且不同站点通常使用不同语义标签,造成不同站点Web数据记录之间模式上的异构。针对以上问题,本文提出一种基于约束条件随机场的Deep Web数据语义标注方法。该方法利用已有的Web数据库信息构建可信约束,利用Web数据记录中数据元素之间的逻辑关系构建逻辑约束,将两类约束引入传统的条件随机场模型,构建约束条件随机场模型,采用整数线性规划推理方法,利用领域Web数据库模式的全局属性标签集为Web数据记录中的每个数据元素赋予对应的语义标签,从而实现对Deep Web数据的语义标注,同时也实现多个Web站点数据记录之间的模式统一。实验结果表明,该方法具有较高的语义标注准确率,有效地克服了传统条件随机场无法综合利用已有的Web数据库信息和Web数据元素之间逻辑关系导致标注准确率较低的不足。5.提出一种基于无监督学习的重复记录检测方法,有效解决了Deep Web中大规模重复记录检测的问题。同一领域内Web数据库数量多且数据冗余度高,难以为分析挖掘提供高质量数据。针对这一问题,本文提出一种基于无监督学习的重复记录检测方法。该方法通过利用聚类集成方法自动选择初始训练样本,提高训练样本的准确性;通过利用支持向量机迭代分类方法,构建分类模型,提高了模型的分类准确率;通过利用扩展证据理论集成多个分类模型结果,构建领域重复记录检测模型,从而实现同一领域内大量Web数据库之间的重复记录检测。实验结果表明,该方法具有较高的重复记录检测准确率,得到的领域重复记录检测模型在所属领域具有较好的性能,有效克服了传统方法难以进行大规模重复记录检测的不足。

【Abstract】 With the rapid development of network technology, Web has become a huge information source with the massive data that have important value. At present, it is urgent in many application domains, such as market intelligence analysis, to analyze and mine these data to get useful knowledge that can be used to aid decision making. However, Web data have such characteristics as heterogeneity, autonomy and distribution, which make the analysis and mining difficult. In order to facilitate analysis and mining, Web data integration has been an urgent problem. According to the depth of data stored in Web, Web can be divided into two parts, Surface Web and Deep Web. The capacity and quality of the data in Deep Web have already far beyond those in Surface Web, so how to integrate Deep Web data to facilitate analysis and mining has good application effect and broad prospects.Recently, research efforts have been focused on query-oriented Deep Web data integration, which obtains a limited amount of data and is suitable for user queries on the fly. However, the integration method is not fit for the applications with the goal of analysis and mining. The thesis mainly researches on analysis-oriented Deep Web data integration. The goal of this integration method is to obtain deep web pages as much as possible and use the extraction and deduplication techniques to get structural, high-quality data that are the data basis of analysis and mining. For analysis-oriented Deep Web data integration, there are the following issues which need to be resolved:(1) As analyses require plenty of data which come from Deep Web pages dynamically generated by multiple of Web databases in the same domain, it needs to automatically acquire maximum pages. (2)As analyses require well-formed, semantic-rich data which exist in complex, semi-structured Deep Web pages, it needs to accurately extract the structural data and do the semantic understanding of them. (3)As analyses require consistent, high-quality data which exists in multiple Web databases in the same domain with high repetitive rate, it needs to detect duplicated records among these Web databases.This dissertation aims at analysis-oriented Deep Web data integration and places focus on the issues that need to be resolved. The main research works and contributions are as follows.1. A query interface matching approach based on extended evidence theory is proposed to effectively solve the problem of semantic understanding of query interfaces in different Web database crawling.There are a large number of Web databases in the same domain. The heterogeneities among query interfaces of these Web databases make it very difficult to recognize the interface attributes which are used to submit the query terms in a unified approach. To solve this issue, a query interface matching approach based on extended evidence theory is proposed, which constructs the matches between the query interface of the Web database to be crawled and its domain query interface to understand its semantic information. The approach fully utilizes multiple features of query interfaces and constructs different matchers. Then it extends traditional evidence theory with the credibilities of different matchers which are predicted dynamically to combine the results of multiple matchers. Finally, it performs one-to-one matching decision in terms of top-k global optimal policy and uses some heuristic rules of tree structure to perform one-to-many matching decision. Experimental results show that the proposed approach can improve the matching accuracy and can overcome the limitations of poor adaptabilities of traditional approaches.2. A Web database crawling approach based on query harvest rate model is proposed to effectively solve the large-scale acquisition problem of Deep Web pages.The analysis and mining applications need a large number of Deep Web data which come from Deep Web pages generated dynamically by multiple Web databases in the same domain. However, due to the special access method of Web database, the information in Deep Web cannot be crawled by traditional search engines crawler. To solve this issue, a Web database crawling approach based on query harvest rate model is proposed. The approach firstly samples the Web database and uses the sample database to select multiple kinds of features to automatically construct training instances, which avoids handful labeling. Then, it learns a query harvest rate model from the training instances. Finally, it uses the model to select the most promising query term to submit the query in every crawling round and crawls the Web database as much as possible. Experimental results show that the proposed approach can achieve high coverage of Web database and can overcome the simple and empirical limitations of traditional heuristic rules. The query harvest rate model can be effectively used to crawl other Web databases in the same domain.3. A data extraction approach for Deep Web based on hierarchical cluster is proposed to effectively solve the problem of extracting structural data in massive Deep Web pages.The structure of Deep Web page is so complex that the structural data in them are difficult to be automatically processed. To solve this issue, a data extraction approach for Deep Web based on hierarchical cluster is proposed. The approach uses the information of the list page of query result to recognize the content blocks in the Deep Web page, which determines the area of data extraction. Then it combines structural and content features from multiple Deep Web pages, and clusters content feature vectors in corresponding content blocks of these pages to effectively extract Web data records. Experimental results show that the proposed approach can significantly improve the extraction accuracy and can overcome the limitations of traditional approaches which only use the structural information of the page itself.4. A semantic annotation approach for Deep Web data based on constrained conditional random fields is proposed to effectively solve the problem of labeling the attributes without semantic labels and schema heterogeneities among data records from multiple Web sites.The extracted Web data records needs to be annotated, but only relying on existing semantic labels in Deep Web pages cannot annotate the data elements without labels and different sites often use different semantic labels, resulting in schema heterogeneity between data records from them. To solve this issue, a semantic annotation approach for Deep Web data based on constrained conditional random fields is proposed. The approach incorporates confidence constraints and logical constraints to efficiently utilize existing Web database and logical relationship among Web data elements. Then it incorporates an inference procedure based on integer linear programming and extends traditional conditional random fields to naturally and efficiently support two kinds of constraints. It uses the global attribute labels of the domain Web database schema to annotate every data elements in Web data records. Experimental results show that the proposed approach can significantly improve the accuracy of semantic annotation and overcome the limitations of traditional conditional random fields which cannot simultaneously use existing Web database and logical relationship among Web data elements.5. A duplicate record detection approach based on unsupervised learning is proposed to effectively solve the problem of massive duplicate record detection in Deep Web.Due to the large scale and high redundancy of the Deep Web, a duplicate record detection approach based on unsupervised learning is proposed. The approach firstly uses cluster ensemble to select initial training instance, which avoid handful labeling. Then it utilizes SVM classification with an iterative approach to construct classification model, which improve the accuracy of the model. Finally, it uses the voting approach to combine the results of multiple classification models to construct the domain-level duplicate record detection model, which effectively solves the problem of massive duplicate record detection. Experimental results show that the proposed approach can achieve high accuracy of duplicate record detection and the domain-level duplicate record detection model can get high performance, which overcome the limitations of traditional approaches which cannot carry out massive duplicate record detection.

  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2010年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络