节点文献
基于语义的科技文献元数据检索系统
Semantic Based Scientific Literature Metadata Retrieval System
【作者】 褚帆;
【导师】 邹德清;
【作者基本信息】 华中科技大学 , 计算机软件与理论, 2007, 硕士
【摘要】 由于缺乏语义信息,传统的元数据检索难以准确地描述科技文献元数据的内在特征。从异构数据源导入的各种元数据存在差异性和重复性,不易获取基于语义关联的信息,导致结果容易出现语义偏差,元数据中会存在很多重复记录,使得检索结果也会出现很多重复记录,因此必须对其进行重复记录清理来提高检索质量。为了减少领域资源中单纯数据库和统计检索方法带来的缺陷,基于语义的科技文献共享平台-SemreX的元数据检索借鉴语义思想,提出了针对科技文献的元数据检索模型。采用英文名与中文拼音名的识别方法以及中文拼音切分算法,实现元数据的各种关联;提供元数据检索入口,使用各种语义推理规则、作者关联算法和三种语义关联检索方法,包括概念、实例和语义关系的关联,语义关系又进一步分为概念与概念、概念与实例、实例与实例三种子类型,来实现基于语义关联的元数据检索,使得元数据的检索结果更加准确而丰富,符合用户的直观语义需求;对检索结果中的重复记录进行清理,针对元数据重复记录清理各步骤中算法的缺陷进行了改进。在重复记录检测过程中,针对字段值的特点采用基于编辑距离的字段匹配算法;采取利用有效权值和长度过滤的优化算法进行记录匹配;在数据库级上对重复记录进行聚类操作过程中,针对传统的基本近邻排序算法的两个缺陷改进了基本近邻排序算法。SemreX的元数据检索系统基于元数据检索框架,利用语义关联检索以及相关技术,并结合元数据重复记录清理技术,实现高效的科技文献元数据检索。
【Abstract】 For resources retrieval, traditional statistic strategy uses keyword based algorithms efficiently, but with the lack of semantic information, both search query and result have much misunderstanding. Meanwhile, data from heterogeneous sources may exist various quality problems.There are many duplicate records in the retrieve results. There is a strong need to carry out a cleansing process to improve the data quality.To overcome the disadvantage mentioned above, we use semantic thinking, and describe a metadata retrieval model for scientific literatures. In semantic retrieving, we provide a semantic search portal and use semantic reasoning rules to improve search result. At the same time, we put forward the semantic search for metadata including concept, instance and relationship. The relationship can be further divided into three types in detail, i.e., the relationship between concepts, between instances, and between concept and instance.We summarized and described the theories, methods, evaluating standards and basic workflow of data cleansing. Especially our researching emphasis is on the techniques and algorithms of duplicate records cleansing, and we put forward the relevant advanced algorithms. In duplicate records cleansing, we introduce its basic knowledge and workflow, depict the main techniques and algorithms in detail in each step respectively. At the same time, we give our advanced algorithms to improve the limitation of original ones in each step. They mainly include the following: the advanced method using sorted key to sort the dataset. In duplicate records detection, we put forward the field match algorithm and abbreviation-discovered algorithm based on edit distance. In record match, we come up with the optimized method using valid weight value and length filtering to reduce the runtime of original algorithm and improve its efficiency. In clustering the duplicate records on database level, we amend two limitations of traditional sorted neighborhood method and give the advanced sorted neighborhood method.At last, based the metadata management model framework and previous research work on duplicate records cleansing, we apply the strategies of semantic retrieval to SemreX System.
【Key words】 Scientific Literature; Metadata Retrieval; Semantic Association; Semantic Reasoning; Duplicate Records Cleansing;