节点文献

蛋白质翻译后修饰及其相互作用预测方法研究

The Research of Protein Post-translational Modifications and Protein-protein Interactions Prediction Methods

【作者】 赵晓威

【导师】 马志强;

【作者基本信息】 东北师范大学 , 细胞生物学, 2013, 博士

【摘要】 蛋白质翻译后修饰和蛋白质间的相互作用是蛋白质发挥正常生物学功能的基础,在生命体中具有十分重要的作用。由于实验研究手段欠佳和相关数据的零散不齐,尽管有350多种蛋白质翻译后修饰已经被实验所证实,仅有很少的几种蛋白质翻译后修饰被较好的研究。通过传统的实验方法鉴定蛋白质翻译后修饰位点既费时又费力,并且酶反应的优化又是一个极为耗时的过程,这些因素严重制约了相关研究的进展速度。因此,一些基于计算的方法逐渐被提出来,这些方法既可以高效地、准确地预测蛋白质的翻译后修饰位点,又可以对进一步的体内或体外的实验验证提供一些线索。而对蛋白质间相互作用的研究,将有助于从系统角度深入理解各种生物学过程,为进一步探索生物体疾病的发生机制提供可靠的数据来源,同时还可以为寻找新的药物靶标,新药研发开辟道路。本文针对蛋白质翻译后修饰位点及蛋白质间相互作用的预测方法进行了研究,主要成果如下:(1)提出了一种基于集成学习的蛋白质泛素化位点预测方法,首先采用四种类型的特征,来编码每一个赖氨酸位点及其相邻位点的氨基酸;接下来,为了减少计算复杂度并提高预测方法的准确度,采用了一种有效的特征选择方法筛选最优的特征子集;最后,利用筛选出来的最优特征子集建立了一个集成分类器,并对最优特征子集中进行了特征分析。与其它方法预测方法在公共数据集上的对比实验表明该集成分类器良好的预测性能。(2)通过提取有效的pupylation底物信息,建立了一个新的pupylation位点分类器。首先,对训练集中每个样本序列,提取五种类型的信息并对pupylation位点本身和它邻近的残基进行编码;接下来,对于这五种特征构成的集合,应用最大相关最小冗余(mRMR)和增量的特征选择(IFS)方法找出最优的特征子集;最后,基于最优特征子集,用最近邻算法(NNA)建模并预测pupylation位点,其留一法测试的预测准确率可以达到70.93%。通过对最优特征子集的生物学分析,研究发现进化信息和物理化学/生物化学属性在pupylation位点识别中发挥了极其重要的作用,位点7,10和11对pupylation位点识别的贡献最大。本文的工作结果表明:mRMR与IFS两种特征选择方法的结合能够有效地对生物数据集进行特征筛选,在此基础上的建模,既可以得到满意的预测性能,也容易发现所选特征的生物学意义。(3)首次将一种新的编码方式,k-spaced氨基酸对构成编码(CKSAAP),应用于预测磷酸化位点预测问题,并提高了磷酸化位点的预测准确度,通过与PPRED、DISPHOS和NetPhos这三种预测工具的比较,本章构建的CKSAAP_PhSite预测工具能够更加准确地预测磷酸化位点。CKSAAP_PhSite对丝氨酸磷酸化位点预测的敏感度是84.81%,特异度是86.07%,准确度是85.43%;对苏氨酸磷酸化位点预测的敏感度是78.59%,特异度是82.26%,准确度是80.31%;对酪氨酸磷酸化位点预测的敏感度是74.44%,特异度是78.03%,准确度是76.21%。实验结果验证了该方法的有效性和实用性,相应的特征分析表明CKSAAP编码方式能够有效地提取出磷酸化位点附近序列模式。基于该研究内容,建立了相应的在线预测工具。(4)提出了一种新的基于扩增的Chou’s伪氨基酸构成编码的蛋白质间的相互作用预测方法,首先采用了三组描述符来编码每一个蛋白质交互对;然后利用PCA技术对编码后的930个序列特征进行降维,经PCA降维后得到的特征子集不但包含很少的特征,而且还尽可能多地保留了原始特征集合的信息;最后,通过将降维后的特征子集作为输入向量,建立了一个基于支持向量机的蛋白质相互作用预测模型,并在黑腹果蝇数据集和幽门螺杆菌数据集上与其它预测方法进行比较,实验结果表明,本文提出的预测模型能够更加准确地预测蛋白质间的相互作用。

【Abstract】 As basics of protein’s normal biological function, post-translational modifications andprotein-protein interactions play a very important role in the life body. Due to the poorexperimental methods and the lack of sufficient data for analyses, although more than350kinds of protein post-translational modifications have been discovered, only a few of themhave been well-characterized. Conventional experimental identification of proteinpost-translational modifications sites is laborious and expensive, and the optimization ofenzymatic reaction is also a very time consuming process, these factors severely limit thedevelopment speed of the related researches. Therefore, some computational methods havebeen proposed and applied with varying success. These methods not only can efficiently,accurately predict protein post-translational modification sites, but also can provide someclues for further in vivo or in vitro confirmation. The research of protein-protein interactionswill help related researchers in-depth understand of various biological processes from thesystem point, meanwhile, it could provide a reliable data source for further exploring themechanism of zoonotic diseases, and point out the direction of new drug research anddevelopment. In this paper, we do some researches on protein post-translation modificationsites and protein-protein interactions. The main results can be summarized as follows:(1) We propose an ensemble computational method to predict lysine ubiquitylation sites.Firstly, four kinds of useful features are used to describe each amino acid of lysine site and itssurrounding sites. Secondly, in order to reduce the computational complexity and enhance theoverall accuracy of the predictor, an effective feature selection method is used to select someoptimal feature subsets. Finally, the ensemble classifier is established using the optimalfeature subsets as input, and compared with the other predictors. Experimental results haveshown that our method is very promising to predict lysine ubiquitylation sites.(2) Based on the effective pupylation substrate information, we construct a novelpredictor to predict the pupylation sites. Firstly, we extract five kinds of features for eachprotein sequence in the training dataset and use these features to encode each amino acid ofpupylation site and its surrounding sites. Then, the maximum relevance minimum redundancy(mRMR) and incremental feature selection (IFS) methods are made on the feature set to selectthe optimal feature subset. Finally, the predictor model is built based on the optimal featuresubset with the assistant of nearest neighbor algorithm (NNA), and the accuracy is70.93%bythe jackknife cross-validation. Through the biological analysis of the optimal feature subset,we find that evolutionary information and physicochemical/biochemical properties play important role in the recognition of pupylation sites, and sites7,10and11contribute the mostto the determination of pupylation sites. The experimental results indicate that thecombination of mRMR and IFS could effectively select the optimal feature subset of thebiological datasets. We can obtain satisfactory prediction performance and find the biologysignification of the selected features using the model constructed on the optimal featuresubset.(3) The composition of k-spaced amino acid pairs (CKSAAP) is first used to predictprotein phosphorylation sites, and enhanced the prediction accuracy of phosphorylation sites.When benchmarked against PPRED, DISPHOS and NetPhos, the performance ofCKSAAP_PhSite is measured with a sensitivity of84.815%, a specificity of86.07%,and anaccuracy of85.43%for serine, a sensitivity of78.59%, a specificity of82.26%and anaccuracy of80.31%for threonine as well as a sensitivity of74.44%, a specificity of78.03%and an accuracy of76.21%for tyrosine. Experimental results indicate that the proposedapproach is effective and practical. Based on the model of predicting protein phosphorylationsites, a corresponding online web server is established.(4) We propose a new augmented Chou’s pseudo amino acid composition to predictprotein-protein interactions. Firstly, three groups of descriptors are used to encode eachinteractive pair. As a result, each interactive pair is represented by930features. Then theprincipal component analysis (PCA) is utilized for dimensionality reduction. The resultingfeature subset contains few features, meanwhile, retains as much information of the whole setas possible. Finally, a protein-protein interaction prediction model is established based on theresulting feature subset, and compared with the other predictors on the Drosophilamelanogaster and the Helicobater pylori datasets. Experimental results have shown that ourmethod is very promising to predict protein-protein interactions.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络