节点文献
基于机器学习的蛋白亚细胞定位预测
Machine Learning Based Protein Subcellular Localization Prediction
【作者】 梅素玉;
【作者基本信息】 复旦大学 , 计算机软件与理论, 2010, 博士
【摘要】 蛋白亚细胞定位是分子细胞生物学和蛋白组学的一个重要研究课题,与蛋白功能、代谢、信号传导和生物过程紧密相关,对生物学基础研究和生物医药研究有着重要作用。基于计算的蛋白亚细胞定位预测具有廉价、高效、适用范围广的优点,有可能通过大量蛋白数据分析寻找有效蛋白特征,推断出蛋白特征与蛋白亚细胞定位之间的统计规律。近几年,尽管蛋白亚细胞定位预测研究已取得比较大的进展,但现有预测方法具有以下几个不足:第一、蛋白特征信息挖掘深度不够,忽略了某些重要蛋白特征信息;第二、集成多个异构数据源时,一般采用异构特征空间拼接或者采用基于多数投票的集成学习方法,没有考虑各种特征数据的重要性和数据缺失(data unavailability)问题;第三、现有蛋白亚细胞定位预测模型在不平衡蛋白数据、微观蛋白亚细胞定位和大规模蛋白亚细胞定位几个问题上,预测性能不很理想。本文从机器学习角度研究蛋白亚细胞定位预测新方法,提高蛋白亚细胞定位预测的性能,并使预测模型具有实际生物学意义和合理的生物学解释。本文主要贡献概括如下:1、引入多示例学习方法(multi-instance learning),挖掘蛋白序列结构域组成信息、结构域序列信息、结构域边界以及结构域序信息。一方面引入多示例学习(multi-instance learning)模型捕获蛋白序列局部结构信息,另一方面引入多类标学习(multi-label learning)处理蛋白多个亚细胞位置问题,为蛋白亚细胞定位预测提供了一种新思路。这种多示例多类标学习模型以包—示例形式表示蛋白—结构域之间的整体与局部关系,能有效地挖掘蛋白结序列局部结构信息,在Gram阳性细菌蛋白实验上取得了与基于基因本体知识的k-近邻集成学习模型相当的预测性能。2、提出了一种谱核函数SpectrumKernel+,将多种氨基酸分类信息嵌入到k-mer特征表示中,在此基础上模拟蛋白序列多种可能的模体(motif)进化模式。SpectrumKernel+从蛋白序列进化生化约束角度,解释k-mer中嵌入氨基酸分类信息的生物学意义,与传统谱核函数(spectrum kernel)和(k,l)不匹配核函数((k,l)mismatch kernel)建立联系,具有更合理的生物学意义和直观的生物学解释。SpectrumKernel+综合考虑多种氨基酸分类信息,度量两条蛋白序列之间多种模体进化模式差异和模体分布差异,在此基础上更精确地度量蛋白序列之间相似性。相对于一般蛋白亚细胞定位预测问题,蛋白亚细胞核定位预测(protein subnuclear localization)更具有挑战性,两个亚细胞核蛋白数据集上实验表明,SpectrumKernel+预测性能显著优于基准模型。3、提出了一种融合多示例核函数HoMIKernel+,嵌入同源蛋白序列细粒度信息。同源序列进化上的保守性和趋异性决定了同源序列信息在描述目标蛋白亚细胞定位模式上的含糊性,这种含糊性与多示例学习方法中正示例描述类别的含糊性是一致的,是实际生物学意义和多示例学习方法的结合点,也是我们提出HoMIKernel+函数的出发点。HoMIKernel+利用同源蛋白序列集合的k-mer特征表示,共同描述目标蛋白,增强了目标蛋白的模体分布信息,抑制了目标蛋白上可能的噪音。一个原核蛋白数据和三个真核蛋白数据上实验表明,HoMIKernel+预测性能优于基准模型;嵌入同源蛋白序列有助于改善模型的预测性能;多种多示例核函数融合能够显著地提高模型的预测能力。4、提出了同源基因本体知识迁移学习、统计相关基因本体知识迁移学习两种蛋白亚细胞定位预测方法,设计了一个简单非参交叉验证方法估计核函数线性组合权重,实现同源相关蛋白之间知识共享,降低核函数权重估计的时空复杂性。通过直观生物意义建立目标蛋白和辅助蛋白之间联系,将同源蛋白基因本体知识、基因本体库内统计相关的基因本体知识迁移给目标蛋白,在此基础上构建一个多核学习模型,用于蛋白亚细胞定位预测。引入同源基因本体知识迁移引具有以下几个优点:丰富目标蛋白基因本体知识、克服新蛋白或者生物实验证据较少蛋白的基因本体知识缺失问题;引入统计相关基因本体知识迁移具有以下几个优点:丰富蛋白基因本体知识、调整基因本体三方面知识的权重分布、嵌入基因本体语义距离信息、调整蛋白基因本体注释覆盖率、降低测试基因本体注释不命中率、避免预测时模型重新训练。核函数权重估计考虑预测性能偏向性指标Matthew相关系数(MCC),能较好地适应大规模不平衡蛋白数据。8个蛋白数据集上实验结果表明,同源相关蛋白知识迁移学习模型能够显著提高蛋白亚细胞定位预测性能,一定程度上抑制了基因本体知识迁移可能带来的噪音和异常影响,较好地克服了大类偏向性,能够很好地处理大规模不平衡蛋白数据。
【Abstract】 As an important research field in molecular cell biology and proteomics, protein subcellular localization is closely related to protein function, metabolic pathway, signal transduction and biological process, and plays an important role in basic biological research and biomedicine research. Computational models based protein subcellular localization prediction is cheap, fast, effective and widely applicable. Through statistical analysis on large amount of protein data, computational models can be used to find effective protein feature information and make a general statistical inference about the law between protein feature information and protein subcellular localization pattern. In the recent years, the research field of protein subcellular localization prediction has witnessed great progresss. However, the previous protein subcellular localization predictive models have several disadvantages:firstly, the protein feature information is not fully mined, so that some important aspect of protein information is ignored; secondly, the data integration models generally concatenate heterogeneous feature spaces, or adopts majority votes based ensemble learning, so that no explicit importance evaluation is individually conducted for different protein feature information, and the problem of data unavailability is not handled; finally, the previous models achieve relative poor performance on unbalanced protein data, protein sub-organelle localization and large-scale protein subcellular localization.This paper conducts research on novel predictive methods for protein subcellular localization from the standpoint of machine learning, for the purpose of improving the predictive performance of protein subcellular localization and endowing the models with reseanable biological interpretation. The paper contributions are summarized as follows:1. Introducing multi-instance learning method into protein subcellular localization prediction, in order to fully exploit the ignored protein domain information:domain composition, domain boundary partition information and the order of domain along protein sequence. On one hand, multi-instance learning is introduced to capture the local structural information of protein sequence in terms of protein domain; on the other hand, multi-label learning is introduced to handle the problem of multiple protein subcellular locations, thus introducing a new way to protein subcellular localization prediction. The proposed multi-instance learning method uses bag-instance representation to describe the whole-part relation between protein sequence and domain, thus effectively exploiting the local structural information of protein sequence. The experiment on Gram-positive bacteria protein data shows that the sequence based multi-instance learning method achieves performance equivalent to the gene ontology based k-NN ensemble learning model.2. Proposing a spectrum kernel SpectrumKernel+ to incorporate multiple amino acid classification information into k-mer feature representation, based on which to simulate multiple sequence motif evolution patterns. SpectrumKernel+ interpretes the biological implication of incorporating amino acid classification information into k-mer feature representation, in terms of physiochemical constraints on protein sequence evolution, and makes connection with classicial spectrum kernel and (k,l) mismatch kernel, endowing the model with more reasonable biological meaning and intuitive biological interpretation. SpectrumKernel+ incorporates multiple amino acid classification information to measure the difference between two sequences’motif evolution patterns& motif distributions, based on which to more accurately define the similarity between two protein sequences. As compared to general protein subcellular localization prediction, protein subnuclear localization prediction seems more challenging. The experiments on two subnuclear protein datasets show that SpectrumKernel+ outperforms the baseline models.3. Proposing a fused multi-instance kernel HoMIKernel+ to incorporate the fine-grained information of full homologous sequences. The evolutionary conservation and divergence determine the fact that the information of homologous sequences is the vague descriptor of the target protein’s subcellular localization pattern. The vagueness is consistent with the positive instances’vagueness in terms of describing object label in multi-instance learning scenario, which is the combination of biological meaning with multi-instance learning method, and also is the standpoint for us to propose HoMIKernel+. HoMIKernel+ uses the k-mer feature representation of homology set to describe the target protein, so that the motif distribution of the target protein is enhanced and the noise is compressed. The experiments on one prokaryotic dataset and three eukaryotic datasets show that outperforms the baseline models; and that homology incorporation benefits the predictive performance; and that multiple multi-instance kernel fusion significantly increase the predictive accuracy.4. Proposing two machine learning models:homology based knowledge transfer learning model and statistical correlation based knowledge transfer learning model; and proposing a simple non-parametric cross validation method to estimate the weight distribution of linear kernel combination, based on which to achieve knowledge share between homologous and statistically correlated proteins, and to reduce the time& space complexity of kernel weight estimation. The relatedness between the target protein and the auxilary proteins is derived through intuitive biological meaning, based on which to transfer to the target protein the gene ontology knowledge of homologous proteins and statistically correlated proteins. A multiple kernel learning system is constructed on the transferred knowledge for protein subcellular localization prediction. homology based knowledge transfer demonstrates the following advantages:to enrich the gene ontology knowledge about target protein, to overcome the data unavailability of novel protein and those proteins with few biological evidence; Statistically correlation based knowledge transfer demonstrates the following advantages:to enrich protein gene ontology knowledge, to tune the weight distribution among the three aspects of gene ontology, to incorporate the gene ontology semantic distance, to adjust the gene ontology term coverage, to reduce the missing rate of test gene ontology term, to avod retraing model for novel protein prediction, etc. The kernel weight estimation takes into account the Matthew correlation coefficient (MCC) measure of performance bias to perform better on large-scale unbalanced protein data. The experiments on 8 benchmark datasets show that homology based knowledge transfer learning model and statistical correlation based knowledge transfer learning model significantly improve the performance of protein subcellular localization prediction, to a certain degree to reduce the unfavorable impact of noise and outlier that may be introduced by gene ontology knowledge transfer, overcome the performance bias towards large subcellular locations, and perform well on large-scale unbalanced protein data.
【Key words】 protein subcellular localization; machine learning; multiple kernel learning; multi-instance learning; transfer learning;