节点文献

蛋白质网络中复合物和功能模块挖掘算法研究

Identifying Protein Complexes and Functional Modules in Protein Interaction Networks

【作者】 李敏

【导师】 陈建二; 王建新;

【作者基本信息】 中南大学 , 计算机应用技术, 2008, 博士

【摘要】 在后基因组时代,一个重要的挑战就是系统地分析和全面理解蛋白质之间是如何通过相互作用来完成生命活动的。从拓扑结构上分析蛋白质网络的特性,进而探寻蛋白质复合物和功能模块、注释未知蛋白质功能正成为当前国内外研究的重要焦点。本文从蛋白质网络拓扑特性分析出发,利用各物种蛋白质网络所具有的一些共性特征设计了有效的蛋白质复合物和功能模块识别算法,主要研究工作包括:应用复杂网络理论和图论技术对不同物种的蛋白质网络进行拓扑特性分析,包括节点的度分布、度与度的相关性、网络直径、网络的特征路径长度、边介数、边间隔以及网络的可靠性等,发现了不同物种的蛋白质网络的一些共性特征,为设计合理的蛋白质复合物和功能模块挖掘算法提供依据。针对目前能够获得的蛋白质相互作用数据还不完全,直接从蛋白质网络中挖掘完备的全连通图(极大团)来预测蛋白质复合物具有很大局限性这一事实,提出了一种基于极大团扩展的蛋白质复合物识别算法IPC-MCE。该算法不需要其它任何辅助信息,简单有效。将算法IPC-MCE应用于酵母蛋白质网络,实验结果表明其能够识别比较多的具有生物意义的蛋白质复合物,且对输入参数不敏感。基于对已知蛋白质复合物内蛋白质之间的最短距离一般不超过2的发现,提出了一种基于距离测定的蛋白质复合物识别算法IPC-DM。实验结果表明,算法IPC-DM较其它识别蛋白质复合物的聚类方法更能有效地标识已知蛋白质复合物,并且具有较高敏感度、特异性和综合评价。特别地,算法IPC-DM对蛋白质相互作用大规模数据中普遍存在的比例较高的假阳性和假阴性具有很好的健壮性,能够在蛋白质相互作用数据还不完善且具有较高噪声的情况下有效地识别蛋白质复合物,可以为生物学家进行蛋白质复合物识别的实验和进一步研究提供有价值的参考信息。针对基于介数的层次化聚类算法计算复杂度高,很难应用于大规模蛋白质网络的不足,引入了局部变量边聚集系数,提出了一种基于边聚集系数的快速层次聚类算法FAG-EC。为降低算法对噪声的敏感性,本文应用logistic回归模型对蛋白质相互作用的可靠性进行评估进而建立加权蛋白质网络,并定义了加权的边聚集系数和功能模块,提出了应用于加权网络的层次聚类算法HC-Wpin。基于GO数据库中生物过程、分子功能和细胞成分全部三种注释信息的验证评估结果表明,算法FAG-EC和HC-Wpin不仅能够有效识别蛋白质网络中具有生物意义的功能模块,并且可以通过修改参数取值来展示蛋白质网络中功能模块的层次化组织结构。此外,算法FAG-EC和HC-Wpin的运行效率非常高,随着大规模蛋白质相互作用数据的不断增加,可以应用于更大规模的蛋白质网络。针对蛋白质网络中普遍存在的“中心性-致死性”法则,提出了一个图分裂.规约模型,并在该模型基础上设计了一种新的交叠功能模块识别算法OMFinder。实验结果表明算法OMFinder能够有效地识别彼此交叠的功能模块,不同功能模块之间的重叠率约为2。与其他识别交叠功能模块的算法比较,算法OMFinder具有更好的识别性能,且具有更低的丢弃率。本文提出的几个聚类算法从不同角度出发,有效地解决了蛋白质网络聚类过程中存在的一些问题。本文提出的聚类算法不仅运行效率很高,而且具有很好的聚类效果,识别的蛋白质复合物或功能模块都从统计意义上被证明是有生物意义的,有效地预测了一定数量的未知蛋白质的功能,将会对生物实验有指导意义。此外,本文提出的聚类算法对其它具有相似结构的复杂网络也具有普遍意义。

【Abstract】 In the post-genome era, one of the most important challenges is to systematically analyze and comprehensively understand how the proteins accomplish the life activities by interacting with each other. Analyzing the characters of protein interaction networks based on the topology structure, identifiying protein complexes and functional modules, and predicting the functions of unknown proteins are becoming the most improtant issues in the domestic and overseas researches.The characters of topology structures in protein interaction networks are studied firstly. Based on the common characters of different specie protein interaction networks, several effective algorithms for detecting protein complexes or functional modules are proposed. The main original works include:Complex network theory and graph technology are applied to the analysis of the topology structure characters in different specie protein interaction networks, such as the dgree distribution, the degree-degree correlation, the network diameter, characteristic path lenghth, edge betweenness, range, and the reliability. Some common characters are detected from these protein interaction networks of different species, which can provide foundation for develping reasonable algorithms of mining protein complexes and functional modules.At present, the available protein-protein interactions are not complete. Only mining maximal cliques are too limited to be used for predicting protein complexes since it is unlikely that all proteins in a large complex can interact with each other. To avoid of the limitation, a new algorithm of identifying protein complexes based on maximal clique extension (IPC-MCE) is proposed, which is easy to be implemented and effective. The algorithm IPC-MCE is applied to the protein interaction network of Sacchromyces cerevisiae and identifies many well known protein complexes. Moreover, algorithm IPC-MCE is not sensitive to the input parameter.Based on our discovery that most of the shortest paths between proteins in complexes are no more than two, we propose a new algorithm IPC-DM for identifying protein complexes on the basis of distance measure. The experiment results show that the algorithm IPC-DM recalls more known complexes than other previously proposed clustering algorithms and has a relatively higher sensitivity, specificity and F-measure. Moreover, the algorithm IPC-DM is robust to the known high rate of false positives and false negatives in data from high-throughout interaction techniques. Thus, the algorithm IPC-DM can be used in protein interaction network even with high false positives and high false nagatives to identify new protein complexes and to provide references for biologists in their research on protein complexes.The hierarchical clustering algorithms based on betweenness are not suitable to be used in large protein interaction networks because they are time consuming. A new local variable of edge clustering coefficient is introduced and a new fast hierarchical clustering algorithm FAG-EC based on it is proposed. To decrease the effect of noisy data on the clustering results, a new algorithm HC-Wpin is proposed for hierarchically clustering in the weighted protein interaction network. The logistic regression-based scheme is used to assign each edge a weight. The edge clustering coeffcient and the functional module in weighted graph are redefined. All the identified functional modules are validated by the three types of annotations of Gene Ontology (GO): Biological Process, Molecular Function, and Cellular Component. The experiment results show that algorithm FAG-EC and algorithm HC-Wpin can not only detect the significant functional modules in protein interaction network but also accurately identify functional modules in hierarchy by changing the values of parameter. Moreover, algorithm FAG-EC and algorithm HC-Wpin are extremely fast, which can be used in even larger protein interaction networks of other higher-level organisms as the protein-protein interactions accumulating sharply.According to the "centrality-lethality rule" generally existing in protein interaction networks, a graph split and reduction model is proposed and a new algorithm OMFinder for identifying overlapping functional modules based on the proposed model is developed. The experiment results show that algorithm OMFinder detect many significant overlapping functional modules. The overlapping rate between different functional modules is about 2. Compared to other algorithms for detecting overlapping functional modules, algorithm OMFinder has better performance and lower discard rate.The clustering algorithms proposed in this paper start off from different sights and solve some problems effectively in the processes of clustering in protein interaction networks. The proposed clustering algorithms not only can be implemented efficiently and have good clustering performances. The identified protein complexes or functional modules are proved to be statistically significant. A number of unknown protein functions are predicted, which can provide some references for biologists in their biochemical experiments. Moreover, the proposed clustering algorithms can be generalized to other complex networks with the similar structures.

  • 【网络出版投稿人】 中南大学
  • 【网络出版年期】2010年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络