节点文献

非合作结构化深网数据源选择技术研究

Research of Data Source Selection of Non-cooperative Structured Deep Web

【作者】 邓松

【导师】 万常选;

【作者基本信息】 江西财经大学 , 信息管理与信息系统, 2013, 博士

【摘要】 随着Web规模的不断扩大,用户准确地从中找到所要查询的Web数据源并进行查询是非常困难的事情。为了能有效地访问这些数据源,Web数据集成系统应运而生。由于在Web中,深网(Deep Web)即不能通过超链接访问的资源集合,占据重要地位,因此如何对深网中的数据进行有效地集成检索,近几年来一直是信息检索和数据库领域关注的前沿问题。深网数据集成的数据源众多,数据源自治,数据动态变化,而且数据更不规范。这些特点给深网数据的有效利用提出了新的挑战。每个领域中都存在着大量的可供访问的深网数据源,但由于它们的接口不尽相同,因此集成检索系统需要对深网数据源的查询接口进行集成。当有了统一的集成接口之后,如果仅把集成接口上的用户查询经过简单转换后提交给每个具体的深网数据源进行检索,显然是不行的。因为这样不仅会造成查询代价过高,且难以保证查询结果的数据质量。基于以上原因,数据源选择成为了深网数据集成中的关键问题,它的目标在于通过查询很少量的数据源,获取满足用户查询需求的检索结果。深网数据源主要分为文本数据源和结构化与半结构化数据源两种类型。文本数据源通常可以被看作为一个由许多网页构成的“文件集”。结构化与半结构化数据源中存储的是由多属性组成的现实世界的实体,其中半结构化数据源中存储的主要是XML数据。目前多数研究成果是针对以上两类数据源选择,前者主要是把成熟的信息检索技术引入到文本数据源的选择过程中,依据数据源中词项与文档排序信息评判一个数据源的相关性,后者主要是通过挖掘蕴含在数据源中的结构化特征信息对数据源进行评价。文本数据源选择研究起步较早,已经取得了很多可喜的研究成果。近年来,商业化深网发展迅猛,对应的结构化与半结构化深网数据源选择的研究引起了越来越多的关注,总体来说,相关研究还处于起步阶段,主要还存在以下问题需要解决:(1)在依据相关性进行数据源选择的时候没有考虑数据源自身的质量,这样容易给后续数据集成工作,例如实体识别、数据融合等,带来繁重的负担。(2)已有的结构化与半结构化深网数据源选择的高质量研究成果均假设数据源是合作型的,即它们可以向用户提供其索引结构及全部数据,以方便构建数据源摘要,但是在现实情况下以上假设难以实现。因此,需要进一步研究,如何抓住抽样数据中蕴含的主题语义信息即主题词与主题词、主题词与子主题词、主题词与特征词之间存在的关联信息,构建非合作结构化深网数据源摘要,以便更好地满足用户的查询需求。(3)深网数据源是实时更新的,当数据源内容更新之后,数据源摘要必然也需要做相应的调整,然而已有研究还未涉及非合作结构化深网数据源动态摘要更新问题。(4)用户经常会提交一个既包含检索型关键词又包含约束型关键词的混合类型关键词查询,其中检索型关键词表达了用户的主体查询意图,约束型关键词用于表达在用户主体查询意图基础上的约束条件,常用离散值表示。已有结构化深网数据源选择方法构建的摘要还未考虑以上查询需求。由于当前结构化深网的应用较为广泛,本文主要针对非合作结构化深网数据源选择,围绕以上四个方面,具体研究了以下内容:(1)数据源质量的评价。数据源质量评价关键是建立相应的评价模型,本文首先依据用户反馈获取推荐数据源与拒绝数据源集合;然后通过计算分析两集合数据源在各客观维度上的得分,依据相差度与重叠度设计数据源质量核心维度评价模型;通过支持向量机(SVM)训练建立质量评价模型;最后采用多个领域的数据评测方法的性能。(2)面向检索型关键词查询的数据源选择。首先,采用基于回溯下钻的无偏抽样方法获取具有代表性的数据源抽样数据,再依据词性、词频、位置、覆盖范围等因素设计针对数据源抽样数据的主题词获取方法;利用主题语义信息分析,获取每个数据源抽样数据中各主题词对应的特征词;面向检索型关键词查询需求,依据主题词与主题词、主题词与特征词之间的关联构建数据源摘要,并基于此摘要给出相应的数据源选择策略。其次,给出主题空间选择方法,以及基于所建摘要的数据源评价策略。最后,依据领域数据源主题词更新的相关性结合抽样技术,给出基于抽样的动态摘要更新算法。(3)面向混合类型关键词查询的数据源选择。当构建了面向检索型关键词查询需求的数据源摘要之后,为了有效地实现面向混合类型关键词查询的数据源选择,在数据源摘要中还需要增加一些表征特征词与约束型属性离散值相关的信息。本文通过主题词与特征词之间的关联,特征词在约束型属性离散值上的记录分布直方图,以及直方图之间的关联,构建数据源的混合摘要,对数据源中各类型属性进行有效地概括。其中,针对直方图关联的特点,给出直方图之间的约束相关性得分计算方法以及基于混合摘要的数据源评价策略。本文的创新性工作主要体现在:(1)把用户反馈作为重要手段,提出了领域高质量数据源选择方法。已有的基于质量的数据源选择方法通常依据经验选择统一的质量维度,因此不同领域下数据源选择的准确性有较大差异。本文依据用户反馈的推荐、拒绝数据源集合特征数据,获取用户推荐可信度,再结合数据源被选次数,获取准确的推荐数据源集合与拒绝数据源集合成员。通过引入重叠度、相差度两个指标分析推荐数据源和拒绝数据源质量维度特征,建立了维度重要性评价模型,动态地为每个领域的数据源选择不同的核心质量维度,从而建立相应的领域数据源质量评价模型。(2)构建了基于主题语义的非合作结构化深网数据源的层次化摘要,并提出了一种基于抽样的动态摘要更新方法。充分考虑主题语义信息以及同领域数据源主题更新的关联特性,通过建立主题词与主题词之间的关联、主题词与特征词之间的关联、主题词与子主题词之间的关联,构建了一种基于主题语义的数据源层次化摘要,该摘要不仅可以有效地表征数据源中的数据内容,而且反映了多关键词组合后的查询语义;在构建的数据源摘要的基础上,给出了面向检索型关键词查询的数据源选择策略。依据同领域数据源主题更新的关联特性,设计了主题空间变化率计算方法,可以有效地发现领域更新主题词、准确地度量数据源中某主题的变化程度,进而提出了一种基于抽样的动态摘要更新方法。(3)基于多类型属性的混合摘要可满足混合类型关键词查询的需求。通过建立主题词与特征词之间的关联、主题词与主题词之间的关联、每两个特征词在同一约束型属性上的直方图之间的约束关联,构建了数据源的混合摘要,可有效地对数据源中多类型属性进行特征概括;在构建的混合摘要的基础上,依据数据源混合摘要匹配查询中检索型关键词的程度与满足查询中约束型关键词约束条件的程度,给出了相应的面向混合类型关键词查询的数据源选择策略。

【Abstract】 With the constantly expansion of Web, it’s very difficult for user to exactly find and query the Web data sources which they really need. In order to efficiently access these data sources, Web data integration system comes into being. Deep Web is a resource collection, which can’t be accessed by hyperlinks. Deep Web predominates in the field of Web, in recent years, it is a frontier issue that how to integrate retrieve data in Deep Web effectively. The above problem has been concerned by the researchers from information retrieval field and database field all the time. Deep Web data integration has these Characteristics:the number of data sources is large, autonomous, data is dynamic and irregular. These features present new challenges to the effective application of Deep Web data.There are a lot of accessible data sources in each filed and their interfaces are different, an integrated retrieval system needs to integrate all query interfaces. After having unified integrated interfaces, it is clearly infeasible that submit user queries on the integrated interfaces to each specific data source to retrieve results only with a simple conversion. Because not only it will causes a high price of the query, but also make it hard to ensure the quality of query results. Based on the above reasons, data source selection becomes a key issue of the data integration of Deep Web. Its purpose is to obtain retrieval results which can meet users’ requirements, by querying a very small amount of data sources.Deep Web data sources are divided into two types:text data source, structured and semi-structured data source. Generally speaking, the former can be viewed as a file set which includes many Web pages, the latter mainly stores the real-world entities with many attributes. Specially, semi-structured data source mainly stores XML data. Currently, many researches of data source selection are on these two types of data sources. The former mainly brings the mature information retrieval technology into the selection process of text data sources, and judges the availability of a data source base on terms and documents sorting. The latter mainly makes an evaluation on data sources by mining structured feature information from their content.As researches of text data source selection start earlier, it has made a lot of promising research results. In recent years, with the rapid development of commercial Deep Web, more and more people pay more attention to the corresponding structured and semi-structured Deep Web data source selection research. In general, these related researches are still in infancy, principally, there are still many issues to be resolved as follows:(1) During the time of selecting data sources by correlation without considering their own quality, it is easy to put a heavy burden on data integration, such as entity recognition, data fusion, etc.(2) The high-quality research results of existing structured and semi-structured Deep Web data source selection bases on this assumption that data sources are cooperative and they can provide users with index structures and all data in order to build theirs abstract easily. But in fact, it is difficult to establish this hypothetical. Therefore, there is a need to make further researches on how to seize thematic semantic information from sample data to build the corresponding data source summary which can further satisfy query demands. Thematic semantic information includes relationship feature between subject heading and subject heading, relationship feature between subject heading and sub-subject heading, relationship feature between subject heading and feature word.(3) Deep Web data source is updated timely, after updating data source, its summary needs to be adjusted accordingly. However, exsiting studies have not been involved in dynamic summary updated issues.(4) Customers maybe submit hybrid queries, which include search type keywords and constrained type keywords. Search type keywords reflect user’s primary query intent, constrained type keywords reflect the constraints on primary query intent. The constrained type keyword is commonly expressed by discrete values. The summary of existing methods for structured and semi-structured Deep Web data source selection haven’t considered above query needs.As current structured Deep Web data sources are widely used, this paper focuses on four above aspects about structured Deep Web data source selection, and specific researches are as follows:(1) The evaluation of data sources quality. The key of Data sources quality evaluation is to establish corresponding evaluation models. First, with users’ feedback, we gain collections of recommended data sources and refused ones. Second, we analyze and calculate the objective dimensions scores of two collections, and design a core dimensions quality model of data sources, according to the degree of discrimination and the degree of overipping. Thirdly, we establish the quality model by SVM training. Finally, we evaluate this method’s performance with multi-domains data. (2) Data source selection for search type keyword query. Firstly, we obtain the representative sample data based on an unbiased sample method of backtracking drill; designing the subject heading access schemes of sample data of data source base on term nature, word frequency, position information, coverage; obtaining the feature words of each subject headings base on subject semantic information; arounding user’s needs about data source selection of search type keyword query, we use the relationship between two subject headings, subject heading and feature word to build a corresponding summary in order to deal with data source selection problem. Secondly, we have proposed the subject space selection method and data source evaluation strategy based on above summary. Finally, based on updated relevant of subject headings of data sources in a field, combining sampling techniques, we design a sample-based dynamic summary update algorithm.(3) Data source selection for mixed-type keyword query. After building a summary of data source for query requirement of search type keyword query, in order to implement data source selection for mixed-type keyword query, we add related information of discrete values of feature words’s constraint properties to the above summary. Our method effectively summarizes all type attributes, by creating the histogram for discrete values of constraint properties, the association of subject headings and feature words, as well as the association between record distributed histogram. In addition, in light of the characteristics of the histogram association, giving a calculation method of constraint correlation score between histograms, and providing a data source evaluation strategy based on mixed summary.Innovations of this thesis are mainly reflected in the following aspects:(1) Regarding users’ feedback as an important means, proposing the field oriented high-quality data source selection method. Existing data source selection methods based on the quality, usually select uniform quality dimensions by researcher’s experience, and the accuracy of data source selection in different fields are quite different. According to characteristic data of refused data sources set and recommended data sources set, which got by user feedback, we gain the user recommend credibility and recommendation number of data sources. With above information, we accurately get the members of the refused data sources set and the members of the recommended data sources set. By introducing overlapping degree and difference degree to analyze the dimensional feature of refused data sources set and recommended data sources set, building an evaluation mode of dimension importance, so we can dynamically select different core quality dimensions for data sources in a field. After completion of the above work, it can establish the appropriate quality evaluation models of data sources.(2) Building a subject semantic-based hierarchical summary of non-cooperative structured data source for Deep Web, and present a dynamic update method of summary based on sampling. Take full account of subject semantic information, relationship feature between subject heading and subject heading, relationship feature between subject heading and feature word, relationship feature between subject heading and sub-subject heading, constructing a hierarchical data source summary. This summary not only can effectively characterize contents in data sources, but also reflects inquiry semantics of multiple keywords combination. Then, give the data source selection strategy for search type keyword query base on above summary. In addition, we have designed a calculation method for change rate of subject space. This method can find the update subject headings effectively, and measure the degree of the variation of a subject space accurately. Base on this, it is the first time to propose a sampling-based dynamic summary update method.(3) Mixed summary based on multi-type attributes meets users’ mixed types keyword query needs. Through the establishment of association of subject headings, association between subject heading and feature word, and the constraint association between histograms for every two feature words in the same constraint attribute, mixed summary have bean build. Mixed summary can characteristic multi-type attributes efficiently. Finally, we give a data source selection strategy of corresponding keyword query of mixed types, which is based on the degree of search type keywords in data source matching user query and the degree of constraint conditions satisfied user query.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络