节点文献

基因本体及其注释数据语义网模型

An RDF Model of Gene Ontology and Its Associations

【作者】 许庆炜

【导师】 骆清铭; 李亦学;

【作者基本信息】 华中科技大学 , 生物医学工程, 2008, 博士

【摘要】 作为当前应用最广泛的生物本体,截至2007年8月,基因本体中共包含了大约23,700条术语,对约20个生物数据库中超过1600万条的基因和基因产物进行注释。在语义网应用领域,基因本体协会提供了一个RDF-XML格式文件?(http://archive. geneontology.org/latest-full/go200708-assocdb.rdf-xml.gz)。然而该文件存在以下缺点,无法提供复杂的语义查询和推理服务:1)基因本体的三个子本体间是相互孤立的,缺乏必要的跨本体语义联系。2)文件以基因本体术语为中心进行组织,所有的信息都存放在一个单独的文件中。3)文件中缺乏对GOSlim的支持。本文中我们提出了一个语义网模型GORouter。该模型主要论证了如何利用多种基于RDF规范的语义网技术和工具对原始资源重新组织,为用户提供复杂的有关基因本体及其注释数据的语义查询和推理服务。我们对基因本体协会提供的异构原始数据重新进行编码,构建了一系列的RDF数据模块。GORouter模型中每个RDF模块由两个部分组成:元数据部分采用RSS技术进行标识、数据部分采用LSID技术进行全球统一命名。通过采用GLUE系统,我们在三个独立的基因子本体间建立了一对一类型的本体映射关系。为了提高映射精确度,GLUE系统采用“放宽标记”技术获得在给定领域约束和先验知识的条件下最佳的映射配置方案。我们采用Oracle NDM作为RDF存储容器,通过调用SDORDFMATCH表函数无缝的将RDF查询结果与传统的关系型数据结合起来。最终,GORouter模型的规模被最小化,那些不直接和语义推理相关的数据将被存储在传统的关系数据表中。我们相信该解决方案能够部分克服传统语义网应用程序的性能瓶颈问题。GORouter模型及其应用程序支持Apache 2.0开放协议,研究人员可以通过访问http://www.scbit.org/gorouter/来获得最新数据和服务。

【Abstract】 Gene Ontology (GO, http://www.geneontology.org) is by far the most widely used bio-ontology. As of August 2007, it contains approximately 23,700 terms, linked to a database of more than 16 million annotations of genes and gene products, originating from about 20 organisms. As a Semantic Web application domain, Gene Ontology Consortium provides a RDF-XML data file (http://archive.geneontology. org/latest-full/go_ 200708-assocdb.rdf-xml.gz). It is an export of the database, containing both the GO vocabulary and associations between GO terms and gene products. However, this file has drawbacks, making it unsuitable for providing complex semantic query and inference services.The first drawback is the lack of relationships between concepts among different GO subontologies, limiting the power of inference based on them. The second drawback is that the RDF-XML data file is organized with a term-centric view of GO annotation data. The third drawback is the lack of support for GOSlim.In this paper, we present a RDF model GORouter, which mainly demonstrates how to use multiple semantic web tools and techniques to integrate heterogeneous resources and to provide a mixture of semantic query and inference solutions of GO and its associations. Most of the original files come from the Gene Ontology Consortium. We encoded these heterogeneous resources in uniform RDF format, and created a set of RDF datasets. Each dataset consists of two RDF files, metadata and data. The metadata RDF files are encoded with RSS1.0. Each metadata RDF file has a data RDF files associated with it. We assign only one unique LSID to each URL of data RDF files.By introducing GLUE system, we create ontology mappings between pairs of terms coming from the three independent GO sub-ontologies. To improve the match accuracy, the GLUE system uses a Relaxation Labeler, which searches for the match configuration that best satisfies the given domain constraints and heuristic knowledge.We use the Oracle Network Data Model (NDM) as the native RDF data repository and the table function SDO_RDF_MATCH to seamlessly combine the result of RDF queries with traditional relational data. As a result, the scale of GORouter is minimized; information not directly involved in semantic inference is put into relational tables. We believe that this is an effective way to partly overcome the bottleneck of conventional semantic web applications.GORouter is licensed under Apache License Version 2.0, and is accessible via the website: http://www.scbit.org/gorouter/.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络