节点文献

基于语义标注的元数据自动构建及其相关技术研究

Research on the Issues of Semantic Annotation Based Automatic Metadata Construction

【作者】 刘海学

【导师】 顾君忠;

【作者基本信息】 华东师范大学 , 系统分析与集成, 2010, 博士

【摘要】 为了解决网络信息“爆炸”时代出现的诸多问题,元数据作为一种重要的应对方法和措施,已广泛应用于信息检索、信息集成及信息共享等服务中。毫无疑问,元数据自身质量的好坏决定了元数据应用服务的最终成败。为了提高元数据的服务质量,学术界和产业界主要从以下几个方面进行了大量的研究和探索。一是元数据质量相关标准的制定,建立统一的元数据标准可以有效地保证元数据的一致性和完整性,并实现规范性的交互操作,这一点已经在研究工作者中达成广泛的共识;二是元数据构建及管理方法的改进与完善,元数据构建及管理方法的改进和完善是提高元数据质量的另一种途径,目前,在元数据的模式发现、模式转换、控制策略、管理机制等诸多方面都已经开展了大量的研究工作;三是元数据质量评估的研究,学术界对此问题的讨论集中在评估指标体系、评估方法及评估用例等几方面。从目前的文献和资料中我们发现,现有的研究工作更多的从元数据创建者手动方式的角度出发,考虑了创建工具的有效性和便利性,然而,从元数据的创建者和使用者两方面考虑,这必将会引起诸如以下问题:从创建者来看,面对大量形式多样的数据集,元数据创建者需要花费一定的精力去了解数据集内容,直到对数据集的内容具有深入透彻的理解,这必将是一项繁琐沉重的工作,此外,不同创建者理解上的差异,也会导致元数据理解上的歧义;从使用者来看,用户也需要对预先定义好的元数据具备正确的认识,否则在创建者和使用者之间就会产生认知上的“鸿沟”,用户自然就无法有效的查询获取需求信息。因此,为了解决以上问题,构建高质量的元数据服务,本文首先提出了一种基于语义标注构建元数据的方法,利用数据集中已有的语义标注信息自动构建生成元数据。该方法在考虑元数据构建效率的同时,充分借鉴了知识共享的理念,探索了利用语义标注信息传递出来的多视角信息消除主观认知上“鸿沟”的可行性,并对不同结构视图下的元数据识别策略进行了针对性研究。在此基础上,本文进一步研究了元数据模式语义异构的问题,提出了一种支持元数据模式语义集成的模式匹配方法。为了验证方法的适用性、评估元数据的质量,本文又提出了一种可以有效提高查准率,抑制查全率低引起目标缺失的元数据查询方法。考虑到档案信息资源自身特有的使用价值及其在基础信息资源中重要的地位[1],本文在实验设计的出发点以及测试数据集的选择上,都将目标定位在了这个领域之中。具体来讲,本文各项研究成果主要包含以下几个方面:(1)在分析基于模板和基于机器学习两类主要元数据抽取方法的基础上,提出了一个自动构建元数据的方法(SAMC)。该方法能克服上述两类方法的缺点与不足,不但能充分地利用现有语义标注信息对元数据进行有效的识别和定位,而且还有机地将统计学理论、信息的结构化特征、视觉布局特征等融合在一起,为SAMC的性能提供了有力的保证,因而,该方法构建出的元数据具有更高的精确度与更强的信息表达能力,能够很好地满足对构建高质量元数据的要求。(2)提出了不同布局模式下识别元数据的算法。为了提高本方法中生成元数据的可行性,本文考虑了语义标注信息结构视图差异的情况,重点研究了在总分、递进、综合分布等序列模式下,语义标注信息所表现出来的差异特征,针对性的设计了相应的元数据识别算法。算法中有效地利用了树型数据结构的层次、线性数据结构的次序以及信息分布的频繁程度等特征,从而使元数据识别的效果以及性能等方面都有了很好的表现。(3)提出了能有效支持元数据属性级语义集成的模式匹配方法(PISMatching)。与相关研究相比,本研究面临的是一个以丰富元数据模式语义信息为目的、以多数据源元数据模式合并为任务的新问题。本文尝试了将本体、叙词表和概念相似度计算结合使用,实现了整合各自优点的目的,在实现难易、复杂度、语义强度等方面都拥有更好的性能。本体的引入为匹配方法准确性的提高提供了强有力的领域上下文支持,基于关联信息联想和概率统计的概念相似度方法也为模式匹配提供了一个新的度量标准,该度量标准能够发现积极相关的属性以得到潜在的属性组,也能将同义关系的属性组保留下来。在PISMatching具体设计的表现力上,本文更注重匹配程度的高低排序而不是差距值的计算,这样对实际应用更具意义;更注重对匹配可利用信息的捕获,而减少对特定匹配模式的依赖,这样使研究成果具有更大的灵活性、扩展性和更广泛的利用价值。(4)提出了利用域上下文信息度量相关性的元数据查询方法(MFCQuery)。与传统元数据查询方式相比,为了能在查准率、查全率上有进一步地提高,MFCQuery主要从两个方面进行了扩展:一是利用向量空间模型(Vector Space Model)在用户查询信息和元数据域上下文信息之间建立相关性计算矩阵,利用域上下文信息与用户查询信息相关性的高低来判断用户的真实查询意图,用以提高检索结果的查全率;另一个方面考虑到部分查询者可能由于缺少足够的背景知识,而无法提供必要的元数据域查询,我们将为其匹配最相关的目标域限制,以提高检索结果的查准率。该方法在保证传统查询方式下高精度特点的同时,能够使检索结果的查全率得到进一步地提升。(5)细化了元数据的评估标准。从整个论文研究的出发点讲,论文全部研究工作的主要目的是为了有效地提高元数据的质量,使其能在具体应用领域发挥更大的作用。为此,本文选择了档案信息资源作为实验中的目标应用领域,而对于元数据最终质量的评估,作者考虑到并不能单纯从信息技术经典的评估指标查全率和查准率来体现,所以本文尝试了细化各项评估指标,对特征不同的评估对象,采用了分化的评估比较的办法,这样可以在更细致的层面上反映出不同方法在元数据质量上的影响。总之,本论文通过规则、统计、概率等方法分别从上述各个方面对元数据相关技术进行了深入研究。解决了元数据构建过程中的关键问题,提高了生成元数据的查准度和查全率;增强了对不同格式以及不断变化的元数据模式进行集成的适用能力;提高了用户主动查询的性能,在进一步提高查全率的同时,也提高了查准率,在这些工作中取得了一系列相关的研究成果。

【Abstract】 To solve a good deal of problems in the age of network information "explosion", metadata as an important method and measure has been widely used in information retrieval, information integration, information sharing and so on. There is no question that good or bad quality of metadata itself determines the ultimate success or failure of metadata application services. In order to improve the quality of metadata, academia and industry made a lot of research and exploration mainly from the following aspects:First, set standards related to metadata, establish a unified metadata standard to effectively ensure its consistency and integrity, also to achieve normative interaction, this point has been widely recognized by the research workers; Second, construct metadata, improve and perfect the management methods, it’s another way to improve the quality of metadata, at present, metadata schema discovery, schema transformation, control strategy, administration mechanism and many other aspects have been widely carried out; Third, study for metadata quality assessments, academic discussion of this issue focused on several aspects such as evaluation indicators, evaluation methods, evaluation use cases and so on. From the current literatures, we found that the existing research works are more often started from the angle of manually creating metadata, considered about the effectiveness and convenience of creating tools. However, thinking about the creator and the user of metadata, which will give rise to problems such as the following:For the creator, facing with a large number of diverse forms of data sets, metadata creator need to take some effort to understand the contents of the data sets until the contents of data sets are deeply understood. It will surely be a cumbersome and heavy work, in addition, different creators have different understandings, which can lead to ambiguity in the understanding of metadata; from the view of users, they need to have a correct understanding for the predefined metadata, otherwise, there would be "gap" between creators and users on the knowledge, the user naturally can not effectively query information on demand.Therefore, in order to solve the above problems, and to build high-quality metadata services, this paper presents a method based on semantic annotation to build metadata, using the existing semantic annotation in data sets to automatically build the metadata. This method is given to build metadata efficiently, and it fully borrows idea of knowledge sharing, exploring the feasibility of elimination of subjective perception "gap" using multi-angle of semantic annotation, and strategies on metadata identification in different structure views. On the basis, this paper further studies heterogeneous problems of metadata schema, and proposes a schema matching method for semantic integration of metadata schema. In order to validate its applicability, this paper proposes a metadata query method for effectively improving the precision and inhibiting result loss caused by low recall. This paper locates in the the field of archive information resources in experimental designs and test data sets, considering its own unique value and its important position in basal information resources [1]. Specifically, our studies mainly cover the following aspects:(1)Come up with a method of automatically constructing metadata called SAMC, based on the analysis of two main metadata extraction methods:template-based and machine learning-based. This method can overcome shortcomings and disadvantages of above methods, not only can effectively identify metadata from existing semantic annotation, but also organically combine statistical theory with the structural features of information and visual layout characteristics, providing a guarantee for performance of SAMC. So, our method has higher precision and greater ability to express information, and can well meet requirements of building high-quality metadata.(2)Come up with related algorithms for identifying metadata from different layout patterns. To improve feasibility of our method, this paper considers the differences in structure views, and focuses on the differences in characteristics demonstrated by summary-detail, iterative, integrated sequence patterns, and designs corresponding algorithm of identifying metadata. The algorithms use hierarchy of tree structure, order of linear structure and information characteristics such as frequency distribution, so that these result in good effects in metadata identification.(3)Put forward a schema matching method for attribute-level integration of metadata schema called PISMatching. Compared with related works, this research is facing new issues for the purpose of enriching semantic of metadata schema, and for the task of merging of metadata schema from multiple data sources. This paper tries to combine ontology with thesaurus and concept similarity for integrating their respective advantages, and has better performance in difficulty of implement, complexity, semantics richness and so on. Ontology provides a strong context domain support for improving matching accuracy, and concept similarity based on related information and probability provides a new metric for schema matching, which can dig out those properties with positive correlation to get potential properties groups, and also reserve properties groups with synonymous. On concrete designs, this paper pays more attention to matching sort rather than the gap between calculated values, which is more meaningful to the practical application; And pay more attention to capture available information, and reduce dependence on a specific schema, this will make research more flexibility, scalability and wider use-value.(4)Come up with a metadata query method of measuring field context called MFCQuery. Compared to traditional method, in order to have further improved in precision and recall, MFCQuery Mainly extends two aspects from following:first, establish similarity matrix between user query and metadata field context by vector space model, and determine real query intent by similarity between field context and user query to improve recall; Another aspect, considering that some users can not provide necessary metadata fields query, may be due to a lack of sufficient background knowledge, we will match the most relevant target field for restricting query to improve precision. The method not only can ensure high-precision, but also can further enhance recall.(5)Detail evaluation of metadata. From the starting point, all the works in the paper main aim to effectively improve quality of metadata in order that it can play a greater role in specific applications. So, this paper selects archive information domain as target applications for our experiments. For evaluation of metadata quality, we think that it can not be simply reflected from classic evaluation indicators of information technology such as recall and precision, therefore, this paper attempts to detail evaluation indicators, and uses a more refined approach to make a evaluation for objects with different characteristics, this will reflect the impact on different methods on metadata quality at a more detailed level.In a word, this paper makes a deep study in related technologies of metadata from above aspects by rules, statistics, probability and other methods. Address key issues during construction of metadata, and improve precision and recall of generating metadata; Enhance applicable capacity for integrating different metadata schemas; Improve performance of users’active queries, and not only further improve recall, but also improve the precision. In these efforts, We made a series of research achievements.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络