

An Approach for Measuring Semantic Similarity between GO Terms

【作者】 刘丹

【导师】 卫金茂;

【作者基本信息】 东北师范大学 , 电路与系统, 2008, 硕士

【摘要】 关于相似性的研究在很多研究领域里都起到了关键作用。相似性的研究主要包括结构的相似性和语义的相似性。以往对结构相似性的关注和研究比较多,近几年,语义相似性吸引了越来越多的注意。由于历史原因所致,生物学数据来源非常复杂。为了减少或消除概念及术语的混乱,Gene Ontology协会开发了生物学数据的大型语义词典----基因本体GO(Gene Ontology)。GO应用的一个重要方面就是对GO术语的语义相似性进行度量。通常认为,如果两个基因产物的功能相似,那么它们的基因表达就相近,同时它们在GO中注解的术语就相似,所以我们只要能找出GO中术语对的相似度,就可以近似估计两基因表达的相似度,从而判断两基因产物功能的相似程度。所以说,GO术语间语义相似性的度量是解决生物学数据集成中语义异构问题的重要方法。本文首先介绍了关于GO的背景知识和对于语义相似性的研究;接着分析了当前GO术语间语义相似性的几种常用度量方法;然后主要针对其中最常用的一种提出了改进的措施----基于语义子图计算GO术语间语义相似性的方法;并以GO图的一小部分为例,做了算法的研究;最后对该方法进行了总结,并探讨了其更为广阔的应用空间。本文提出的方法是结合了基于信息量和基于概念距离两方面的方法,可使语义相似性测量的精确度得到进一步的提高,如果能应用到大的GO数据库中,将能更加准确地查找功能相似或者相关的蛋白质,为相关研究及应用打下良好的基础。

【Abstract】 The study of similarity includes mostly structural similarity and semantic similarity. The study on structural similarity is pervasive comparatively in the past, and the study of semantic similarity attracts more and more attention till recent years.Owing to historical reasons, the data source of biology is very complicated. For reducing or eliminating confusion between concepts and terms, Gene Ontology consortium researched and developed a large semantic dictionary ---- GO (Gene Ontology). The reseach of similarity plays an important role in many study fields. One important aspect of GO application is measuring semantic similarity between GO terms. It is generally believed that if two gene products are similar, we would except that their genetic expressions are similar, and that they are similarly annotated in the GO. Thus, we may compare similarity of function levels of two gene products against their corresponding similarity of annotation in the GO. So measuring semantic similarity between GO terms is an important approach to resolve the problem of semantic heterogeneity in biological data integration.At first, we present the background of GO and the study situation of semantic similarity in this paper. Then we analyze several available approaches for measuring semantic similarity between GO terms, and propose a subgraph-based approach against one of the most commonly used approaches. And then, we design an algorithm and testify it upon a part of GO graph. Finally, a summary of this approach is given, and we discuss more broad application space for it.The new approach proposed in this paper is an approach which combines information content-based and semantic distance-based methods. It makes semantic similarity measure between GO terms more accurate. If this approach can be used to GO database, it will be promising to search similar or related proteins more accurately, and will lay a good foundation for the relevant study and application of bioinformation.

【关键词】 GO语义相似性信息量语义距离
【Key words】 GOSemantic similarityinformation contentsemantic distance
  • 【分类号】TP391.1
  • 【被引频次】6
  • 【下载频次】196

