节点文献

基于pairwise核的蛋白质相互作用对称预测研究

Research on Symmetric Prediction of Protein-Protein Interactions Based on pairwise Kernels

【作者】 于建涛

【导师】 郭茂祖;

【作者基本信息】 哈尔滨工业大学 , 人工智能与信息处理, 2011, 博士

【摘要】 蛋白质是生命活动的直接执行者,蛋白质之间的相互作用是蛋白质实现其功能的重要途径之一,因此构建蛋白质相互作用(protein-protein interaction, PPI)网络是了解分子生物功能、洞悉细胞生命规律的前提,也是研究生物体内疾病的产生与发展、进而从事药物分子靶标识别的关键。蛋白质相互作用预测方法是近年来生物信息学家关注的一个热点问题,它可以有效克服生物实验检测方法周期长、代价昂贵、假阳性率高的缺点。而对称性预测、核函数的选择是基于机器学习核方法进行蛋白质相互作用预测的两个关键因素,它直接关系到预测模型的有效性及准确性。本文以蛋白质相互作用的对称性为切入点,研究了pairwise核在保证蛋白质相互作用对称预测方面的必要性,揭示了传统核方法以及传统反例数据集对蛋白质相互作用预测的偏置影响,提出了解决偏置的方案及算法。在此基础上,将无偏置预测模型应用于大豆物种的蛋白质相互作用预测,取得了较好的效果。第一,揭示了传统核方法在蛋白质相互作用预测过程中对蛋白质次序的依赖偏置,在充分分析现有pairwise核函数构建规律的基础上,提出了一种新的用以保证蛋白质相互作用对称预测的pairwise核函数,并利用其构建了一种多核组合模型,较之已有的方法,该模型具有更高的预测准确率。蛋白质相互作用具有典型的对称特点,即“蛋白质A与B相互作用”等同于“蛋白质B与A相互作用”。在传统的机器学习方法中,当蛋白质以顺序拼接方式构成训练/测试样本时,普通核方法由于无法识别一个样本由两个蛋白质组成的事实,从而对蛋白质的次序变得较为敏感,由此产生预测偏置。这种偏置表现为分类器可能产生“蛋白质A与B相互作用”而“蛋白质B与A不相互作用”的相悖结论。Pairwise核克服了传统核以样本作为相似度度量单位的局限,采用蛋白质作为相似度度量单位,有效保证了蛋白质相互作用预测的对称性。本文强调了pairwise核在实现对称预测方面的必要性,总结了现有的几种pairwise核函数在对称性、正定性、均衡性方面的一般特点,分析、提炼了它们在改善预测性能方面的一般规律。在此基础上,提出了一种新的pairwise核函数——AMPK(Arcsin Maximum Pairwise Kernel),并分别基于Cosine核、拉普拉斯核构建了AMPK的多核组合模型,该模型在蛋白质复合体相互作用预测中取得了比已有的核方法更优的预测性能。第二,揭示了在简单序列特征(三联氨基酸)的传统数据集上,采用pairwise核方法进行蛋白质相互作用预测存在严重偏置。提出了一种构建合理反例集的方法,从而使分类器的预测性能够得到公正、客观地评价。由于传统方法所采用的正、反例数据集分别具有无标度(scale-free)网络以及随机网络性质,一部分称之为hub结点的蛋白质在正、反例集中出现次数差异较大,形成所谓“强势样本”。受训练集中“强势样本”的影响,pairwise核分类器倾向于将含有hub结点的测试样本预测为正例、而将含有非hub蛋白质的测试样本预测为反例——这种偏置效应在基于简单序列特征(即三联氨基酸)的数据上表现得尤为明显,从而导致对分类器预测性能过于乐观的估计。基于此,本文提出了一种针对正例集无标度网络结构的、以“平衡随机采样”方式构建合理反例集的方法。通过保证每个蛋白质在正、反例集中出现的次数基本一致来消除正、反例数据集的结构差异。在合理反例集上,分类器的预测性能可以得到公正、客观的评价。最后证明了复杂序列特征(Pfam域)对预测偏置的影响程度以及它在预测蛋白质相互作用中的积极贡献。第三,首次基于新近测序的大豆基因组数据,将传统的同源PPI推理方法与本文的无偏置pairwise核预测模型相结合,推理、预测得到10 426条大豆蛋白质相互作用数据。大豆蛋白质相互作用网络构建是大豆基因组测序工作完成以后的一项重要任务。本文首次以大豆基因组数据为来源,采用同源PPI(interolog)推理方法与基于域特征的pairwise核预测方法相结合的方式,得到上万条大豆蛋白质相互作用数据。首先,以拟南芥、酵母、人类三个源物种的PPI为源数据,寻找它们在大豆物种中的同源PPI,据此得到大豆蛋白质相互作用候选集;然后,提出跨物种的训练/测试模式,利用域及其相互作用在物种间表现出的保守性,在源物种数据上建立关于InterPro域的无偏置pairwise核预测模型,而后将预测模型应用于大豆PPI候选集,以筛除其中的假阳数据。交叉验证结果表明,预测结果具有较高的可信性,从而表明本文所采用的方法在新近测序物种的蛋白质相互作用预测方面具有较高的参考价值。最后分析了大豆蛋白质相互作用复合体的抗性功能,发现了大豆抗性基因/蛋白质之间的相互作用规律。

【Abstract】 Proteins are directly involved in biological processes, often exerting their function via protein-protein interactions. Constructing protein-protein interaction networks is, therefore, very beneficial for investigating molecular functions and discerning where groups of proteins may locate, as well as furthering our understanding of disease associations for identifying drug targets. In silico methods of predicting protein-protein interactions have recently emerged as an important area of Bioinformatics, because they often overcome the drawbacks of wet-lab experiments, such as expense (both time and money) and high false-positive rates. Of the available machine-learning approaches for predicting interaction data, kernel-based methods are popular due to their robustness and high performance. However, methods for maintaining the symmetry of predictions, i.e.‘A is predicted as interacting with B’, should be equivalent to‘B is predicted as interacting with A’, made by kernel functions have not been well studied, and the symmetry problem appears to directly affect the effectiveness and the performance of these predictive models.This thesis, thus, focuses on how to retain the symmetry of protein-protein interactions by using pairwise kernels, which adopt symmetric calculations on the measurement of similarity between pairs of proteins. The biases that originate from traditional kernel-based predictors and training datasets are revealed, and the methods for removing these biases are correspondingly proposed. As an application of these methods, unbiased predictive models are created and used to predict a large number of protein-protein interactions in soybean for the first time.More specifically, there are three main aspects which are focused on in the thesis:Firstly, the prediction bias towards protein order is revealed when traditional kernel-based methods are used. The pairwise kernel is then introduced to fix the problem and a new pairwise kernel is proposed, that utilizes important properties that have already been shown as useful when predicting protein-protein interactions.Protein-protein interactions are of symmetric character. However, when examples are formed by simply uniting two proteins sequentially, where one protein behaves as the first half of the example, and the other as the second half, traditional kernel functions are of little use. This is due to their inability to‘split’one example into two proteins, and be sensitive to the order of proteins, resulting in inconsistent prediction conclusions, such as‘A interacts with B’, whilst‘B does not interact with A’.Pairwise kernels are appointed to remove asymmetry resulting from the traditional kernels. Pairwise kernel functions regard proteins, rather than examples, as the minimal‘unit’, and consider both‘normal’and‘reverse’orders for measurement of similarity between two pairs of proteins. The necessity of pairwise kernels to keep symmetric prediction is underlined. Furthermore, the principles of creating pairwise kernel functions, such as symmetry, (semi-)positive definiteness, and balances between variables, are summarized. Based on these principles, a novel pairwise kernel, AMPK (Arcsin Maximum Pairwise kernel) is created, which performs on par with the current best pairwise kernel, and a novel combination model of pairwise kernels,‘AMPK based on Cosine plus AMPK based on Laplace’, is also proposed, which has been proven to outperform the current kernel, or kernel-combination methods, in predicting interactions of protein complexes.Secondly the performance of pairwise kernel-based classifiers are discovered to be artificially inflated when simple sequence features (neighboring three residues, 3mers) are used on traditional datasets, in which negative datasets are made by the‘simple random sampling’method. The novel‘balanced random sampling’method is proposed to overcome the bias via constructing rational negative dataset, on which objective evaluation of classifiers’performance for unbiased prediction is acquired.The traditional PPI positive dataset is shown as a scale-free network, and the traditional PPI negative dataset is as a random network. This causes hub nodes, which are highly connected with other nodes in the positive dataset, to appear less frequently in the traditional negative dataset. The difference of the number of times each protein appears in positive and negative dataset results in prediction bias of protein-protein interactions. When 3mers are used as sequence features, the bias becomes even more serious. In this case, pairwise kernels are prone to labeling examples which involve hub proteins as‘positives’, and those which do not involve hub proteins as‘negatives’. This kind of prediction is purely based on the number of times each protein appears in dataset and does not aid in making predictions, but can still cause prediction performance to appear artificially high.In order to remove these biases, the‘balanced random sampling’is proposed, aimed at creating a rational negative dataset, simulated as scale-free like the positive dataset. During the process of balanced random sampling, each protein has equal opportunity to appear in the positive or the negative dataset, and the bias towards the number of occurrences of each protein per dataset is, therefore, removed. Rational datasets form a basis for objective evaluation of the performance of pairwise kernel-based classifiers, and show that previous estimations of prediction performance, using 3mer features, were over-optimistic. However, complex sequence features, i.e. Pfam domains, are proven to be less sensitive to the traditional datasets than 3mer feature, and have a positive contribution to the prediction of protein-protein interactions.Thirdly, we use the newly sequenced Glycine max (soybean) genome, to infer a large number of soybean protein-protein interactions for the first time. To make these novel inferences we use conventional methods of homologous protein-protein interactions (interologs) and kernel-based predictive model mentioned above, resulting in 10 426 confidential soybean protein-protein interactions.Predicting soybean protein-protein interactions was one of the main tasks following the sequencing of the soybean genome. More than ten thousand soybean protein-protein interactions have been successfully predicted with our in silico method. Soybean interologs are primarily inferred from protein-protein interactions of homologous species, and then filtered by pairwise kernel-based methods, using domains as the classifier feature. More specially, the candidate dataset of soybean interactions are obtained by looking for soybean interologs from homologous protein-protein interactions in Arabidopsis thaliana, Saccharomyces cerevisiae, and Homo sapiens, and then domain-based pairwise kernel methods act as unbiased predictive classifiers to filter interologs, during which a cross-species strategy is used: training on data from the source species (Arabidopsis, Saccharomyces, or Homo sapiens), and testing on data from soybean. This novel transferability of methods between species is proposed according to conserved domain-domain interactions which are presented in both‘source’and‘target’species. This is the first time that a large number of soybean PPIs have been predicted using computational methods, and prediction performance is assessed using cross-validation. The combination of homologous PPIs and domain-based pairwise kernels used in this thesis are concluded to be effective methods in predicting protein-protein interactions of organisms whose genome is newly sequenced. Finally, soybean protein complexes in a predicted protein-protein interaction network are revealed and interactions between Plant Resistance genes/proteins within protein complexes are investigated in order to infer some related biological function.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络