节点文献

复杂生物网络聚类分析方法

Clustering Analysis of Complex Biological Networks

【作者】 梅娟

【导师】 李炜疆;

【作者基本信息】 江南大学 , 发酵工程, 2010, 博士

【摘要】 基因组学和高通量技术提供了大量生命系统组成元件(如蛋白质)以及它们之间相互关系的数据,由这些关系数据构成的复杂生物网络蕴含着丰富的生命系统运行机制的知识,挖掘这些隐蔽的知识成为后基因组时代的主要任务之一。作为知识发现重要手段的图聚类方法,在复杂生物网络分析上受到了普遍关注。本文开发了高性能的图聚类算法CD(contraction-dilation),以复杂生物网络为研究对象,研究了基于蛋白质相似性网络的远同源性探测,分析了从蛋白质相互作用网络中探测到的功能模块。为了对图聚类算法进行评价,提出了一种融入研究对象重要特征的接近现实的网络模型。主要研究内容包括:(1)针对模块性(modularity)这个描述复杂网络集团结构强弱的全局指标,提出了一种简单而高效的基于模块性优化的图聚类算法CD。将此算法应用于计算机的生成基准测试网络,生物网络和社会网络,并将性能表现与已有的同是基于模块性优化的聚类算法进行了比较分析,实验结果表明CD算法能在较短的CPU时间内找到较大数值的模块性,探测到稳定且较精确的集团结构,且对内存需求非常低,可用于分析大型网络。该方法是本文后面分析的基础。(2)远同源蛋白间的序列相似性很低,处于随机涨落区域边缘(twilight zone),很难区分通过比对获得的序列特征是进化过程中功能约束还是随机突变导致的结果。为了从具有高度噪音的比对分数中提取关于同源性的微弱信号,本文将图聚类算法CD应用于蛋白质相似性网络来探测远同源性。将序列间的Smith-Waterman分数通过S型曲线变换作为蛋白质关联图的连接权重,然后使用CD算法对此关联图进行聚类分析。将该方法应用于蛋白质结构分类数据库SCOP超家族层次的几个数据集,且与谱聚类方法和MCL方法进行比较,实验结果表明,该方法能较好地探测到蛋白质远同源性,因为输出的集团在很大程度上对应着蛋白质超家族;该方法输出的集团数目接近数据集中超家族的个数;该方法得到的结果明显优于其他方法。研究结果表明,序列相似性确实携带了关于远同源性的显著信息,使用CD算法挖掘这些信息能够较准确地探测到蛋白质远同源性。(3)蛋白质相互作用网络是典型的复杂生物网络,它的节点多且连接分布不均匀,不易被划分为有统计意义的集团。为此,我们从文献提供的数据集中挑选具有中度和高度置信度的酵母蛋白质相互作用数据来构建酵母蛋白质相互作用网络,采用CD算法来将此网络划分为集团。对于得到的集团,我们从拓扑结构的角度以及生物学意义的角度进行了分析。实验结果表明,通过CD算法得到的集团是内部连接紧密的子图,且MIPS数据库的ComplexCat中已知的蛋白质复合体很大程度上包含于这些集团中,且有许多蛋白质复合体完全包含于这些集团中。此外,我们使用超几何聚集分布的P值来分析一个集团对某个特定功能的富集程度,将最小的P值对应的功能作为该集团的主要功能。分析集团中成员的功能发现,集团中大部分的蛋白质具有相同的功能,且与集团的主要功能相一致。(4)在通常情况下,我们缺乏对现实网络背景知识的了解。为了评价图聚类算法在现实网络上的性能表现,本文构建了一种接近现实的网络模型(near-realistic model,简称NR模型),通过算法在模型网络上的性能表现来推断其分析现实网络的能力。为了确保此推断的合理性,构建的模型网络具有与所研究网络完全相同的一阶统计特征,即模型网络中每个节点的度与所研究网络中相应节点的度完全一致。同时,构建的模型网络可具有任意设定的集团结构,这就相当于给定了背景知识,即真实的分类信息是已知的。构建的NR模型为客观评价图聚类算法提供了一条途径。

【Abstract】 Genomics and high-throughput technologies have yielded large amount of relational data about the components (like proteins) of living systems; such data form various complex biological networks that carry rich knowledge about the systems’mechanisms. In the post-genome era, a current challenge is to mine the hidden knowledge stored in the networks. As an important means for knowledge discovery, graph clustering attracts much attention in the analysis of complex biological networks.We develop a high-performance graph clustering algorithm CD (contraction-dilation) in this paper, investigate the detection of protein remote homology based on protein similarity network and analyze functional modules in protein-protein interaction network. To evaluate graph clustering algorithms, a near-realistic network model with key characteristic of the research subject is proposed. The main contents include:A novel graph clustering algorithm CD based on modularity optimization is proposed. The modularity is a global index describes the strength of community structures of complex networks. The proposed algorithm is extensively tested on computer-generated benchmark networks, biological networks and social networks. Compared with existing algorithms, CD algorithm is efficient to discover more stable community structures with higher modularity scores and accuracies at lower expenses of both CPU time and memory. Thus, it can be applied to analyze large networks. CD algorithm is the basis for this paper.Remote homologues are hidden in the twilight zone, where their sequence similarity is too low to distinguish them from pairs of sequences with equal or even higher sequence similarity due to chance. To extract weak signals about homology from sequence similarity with high noises, we apply CD algorithm to a protein similarity network to detect remote homology. The similarity matrix is formed by the sigmoidal transformation of the Smith-Waterman alignment scores, and then fed into the CD algorithm. The method is tested on several datasets from superfamily level of SCOP database. Numerical experiments show that CD algorithm manages to detect protein remote homology, for the communities it outputs are largely corresponding to protein superfamilies and the number of communities is close to the number of superfamilies in the given dataset. Also, the performance of CD algorithm is superior to spectral clustering algorithm and MCL algorithm. The results show that sequence similarity carry significant information on remote homology, which can be mined by using CD algorithm.Protein-protein interaction network is a typical complex biological network with many nodes and unevenly distributed links, thus is not easily divided into communities with statistically significant. We use CD algorithm to partition a yeast protein-protein interaction network, which is constructed by selecting yeast protein interactions with medium and high confidence from the literature. For the communities outputs by CD algorithm, we first assess the validity from the topology perspective and find that they are densely connected local subgraphs; then we find known protein complexes annotated by MIPS database are largely contained in them in their entirety. This indicates that these communities may have biological significance. Moreover, we annotate these communities based on P-values. And we investigate the functions of individual proteins in communities and find that most of them usually share common functions, which are in accordance with the main functions of communities.In most cases, we know little background knowledge of real networks. To evaluate the performance of graph clustering algorithms on real networks, a near-realistic network model (NR model) is constructed. Inferring the ability of a graph clustering algorithm to analyze real networks through its performance on model networks. To ensure rationality of this inference, the model network constructed has the same first-order statistical properties as the network under investigated, that is, degree of every node in the model network is in accordance with the corresponding node in the network under investigated. Meanwhile, the model network constructed can have any given community structure. It is equivalent to give the background knowledge, that is, the real classification information is known. NR model constructed here provides a way to evaluate graph clustering algorithms objectively.

  • 【网络出版投稿人】 江南大学
  • 【网络出版年期】2010年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络