节点文献

基于序列信息的蛋白质功能位点预测的算法开发

Method Development of Protein Functional Site Prediction Based on Sequence Information

【作者】 陈震

【导师】 张子丁;

【作者基本信息】 中国农业大学 , 生物信息学, 2014, 博士

【摘要】 蛋白质功能位点的识别对深入理解蛋白质的生物学功能具有重要的意义,应用各种计算方法对蛋白质功能位点进行预测是生物信息学中的一个重要课题。本文中,作者开展了两种蛋白功能位点(泛素化位点和锌离子结合位点)的预测研究。首先,作者根据酵母和人类中泛素化位点的序列特征,先后开发了分别针对酵母和人类泛素化位点的预测工具CKSAAP_UbSite和hCKSAAP_UbSite。然后,基于四个物种的数据集,作者对现有的泛素化位点预测工具的性能进行了系统的评价。最后,通过对蛋白质锌离子结合位点的序列特征进行分析,并整合多种预测方法和特征开发了一个基于序列信息的锌离子结合位点预测的新工具。作为一种重要的可逆的蛋白质翻译后修饰位点,蛋白质泛素化涉及众多的生物学过程并且与多种疾病紧密相连。对泛素化位点的识别是进一步了解泛素化相关生物学过程和分子机制的第一步,也是比较重要的一步。因此,作者根据酵母中泛素化位点周围的序列特征开发了一个名为CKSAAP_UbSite的酵母特异的泛素化位点预测工具。在CKSAAP_UbSite中,首次将CKSAAP编码应用到泛素化位点预测当中,并使用支持向量机建立预测模型。为了方便学术界使用,建立了一个在线服务器(http://protein.cau.edu.cn/cksaap_ubsite/)和开发相关软件来执行CKSAAP_UbSite算法。另外,CKSAAP_UbSite也可以被用来预测整个蛋白质组的泛素化位点。随着基于质谱手段的蛋白质组学技术的发展,数以万计的人类泛素化位点被实验测定。针对人类泛素化位点周围复杂的序列特征,作者通过整合多种互补的预测方法开发了一个人类特异的泛素化位点预测工具。首先,采用CKSAAP编码并用支持向量机建立一个预测模型。接着,为了进一步对人类泛素化位点周围的序列特征进行挖掘,作者使用支持向量机分别整合正交编码、理化性质编码和蛋白质聚集倾向性编码建立了三个预测模型。最后通过逻辑回归的方法对四个预测模型的结果进行整合建立hCKSAAP_UbSite。hCKSAAP_UbSite在5-折交叉检验(5-fold cross validation)中,其AUC (Area under the ROC curve)能够达到0.770。为了方便用户使用,hCKSAAP_UbSite算法被进一步整合到CKSAAP_UbSite的在线服务器中。近年来,许多泛素化位点预测工具被相继开发出来。但是这些工具之间有很大区别,具体表现在所采用的分类算法不一、所使用的特征不同和数据集来自不同的物种等方面,从而导致用户在选择这些工具时比较困惑。为了解决这一问题,作者搜集了四个不同物种的数据集,对五种工具的预测性能进行了全面比较分析。接着,作者从用户的角度对不同的工具的使用方便性做出了评价,用于指导用户快速高效地选择预测工具。最后,测试了一些常用编码特征对泛素化位点的预测能力,并对这些特征进行排序,从而找出在特定的物种中哪些特征具有较好的预测能力。作为一种重要的微量元素,锌离子与多种生物学过程和疾病紧密相关,锌离子对于蛋白质行使其功能具有重要的作用。由于锌离子重要的生物学功能,作者提出了一个新的基于序列信息的预测方法ZincExplorer来对锌离子结合位点进行预测。ZincExplorer是一个集成的算法,它整合了SVM-based predictor、Cluster-based predictor和Template-based predictor三种预测方法的结果,能够对四种残基(即CYS, HIS, ASP和GLU)进行预测。经过5-fold cross-validation测试,ZincExplorer的AURPC (Area under recall-precision Curve)值能够达到0.851,在Recall等于70%的情况下,其Precision可达到85.6%(Specificity=98.4%, MCC=0.747)。另外,ZincExplorer同时也能够对结合于同一个锌离子的多个残基的相互依赖关系(Interdependent relationships, IRs)进行识别。最后,作者建立了一个在线服务器(http://protein.cau.edu.cn/ZincExplorer/)来执行ZincExplorer算法,方便学术界免费使用。

【Abstract】 Identification of protein functional sites is of great importance to further understand the biological function of protein molecules. In silico prediction of protein functional sites has become an important topic in the field of bioinformatics. In this thesis, the author focused on the prediction of two different protein functional sites (ubiquitination sites and zinc-binding sites). Firstly, according to the ubiquitina-tion site characteristics of yeast and human, the author developed two species-specific ubiquitination site prediction tools (CKSAAP_UbSite and hCKSAAP_UbSite). Then, the author conducted a compre-hensive evaluation on the existing ubiquitination site prediction tools based on four datasets from dif-ferent species. Finally, after the intensive feature analysis between zinc-binding sites and non zinc-binding sites, multiple prediction methods and features were integrated into a prediction tool named ZincExplorer.As one of the most important reversible protein post-translation modifications (PTMs), ubiquitina-tion has been reported to be involved in lots of biological processes and closely implicated with various diseases. To fully decipher the molecular mechanisms of ubiquitination-related biological processes, an initial but crucial step is the recognition of ubiquitylated substrates and the corresponding ubiquitination sites. At first, a new bioinformatics tool named CKSAAP_UbSite was developed to predict ubiquitina-tion sites from protein sequences in yeast. With the assistance of Support Vector Machine (SVM), the highlight of CKSAAP_UbSite is to employ the composition of k-spaced amino acid pairs (CKSAAP) surrounding a query site (i.e. any lysine in a query sequence) as input. To facilitate the community’s research, a web server of CKSAAP_UbSite was constructed and is freely available at http://protein.cau.edu.cn/cksaap_ubsite/, which can be further used for proteome-wide ubiquitination site identification.Recent developments in the mass spectrometry (MS)-based proteomics have greatly expedited proteome-wide analysis of PTMs, more than ten thousands of ubiquitination sites in human were deter-mined. According to the complicated sequence context of human ubiquitination sites, the author devel-oped a novel human-specific ubiquitination site predictor through the integration of multiple comple-mentary classifiers. Firstly, a SVM classier was constructed based on the CKSAAP encoding, which has been utilized in our previous yeast ubiquitination site predictor. To further exploit the pattern and prop-erties of the ubiquitination sites and their flanking residues, three additional SVM classifiers were con-structed using the binary amino acid encoding, the AAindex physicochemical property encoding and the protein aggregation propensity encoding, respectively. Through an integration that relied on logistic re-gression, the resulting predictor termed hCKSAAP_UbSite achieved an area under ROC curve (AUC) of0.770in5-fold cross-validation test on a class-balanced training dataset. To facilitate the users, hCKSAPP_UbSite has been integrated into the existing CKSAAP_UbSite server.In the past several years, a few tools have been developed for the prediction of ubiquitination sites, but users are frequently confused by the differences in the prediction algorithms adopted and the select- ed features as well as the performance in different species. To address this problem, the author first compared and analyzed five popular standalone/web-server tools on four large sets from different spe-cies. Then, the author summarized the usage convenience of the tools under investigation in order to guide the users to choose the tools more efficiently. Finally, the author tested most of the features used in previous prediction tools and ranked them according to their performance to find out which features make a significant contribution in predicting ubiquitination sites for a specific species.As one of the most important trace elements within an organism, zinc has been shown to be in-volved in numerous biological processes and closely implicated in various diseases. The zinc ion is im-portant for proteins to perform their functional roles. Motivated by the biological importance of zinc, the author proposed a new method called ZincExplorer to predict zinc-binding sites from protein se-quences. ZincExplorer is a hybrid method that can accurately predict zinc-binding sites from protein sequences. It integrates the outputs of three different types of predictors, namely, SVM-, cluster-and template-based predictors. Four types of zinc-binding amino acids CHEDs (i.e. CYS, HIS, ASP and GLU) could be predicted using ZincExplorer. It achieved a high AURPC (Area Under Recall-Precision Curve) of0.851, and a precision of85.6%(specificity=98.4%, MCC=0.747) at the70.0%recall for the CHEDs on the5-fold cross-validation test. Moreover, ZincExplorer could also identify the interde-pendent relationships (IRs) of the predicted zinc-binding sites bound to the same zinc ion, which makes it a useful tool for providing in-depth zinc-binding site annotation. To facilitate the research community, the online web server of ZincExplorer was constrcuted, which is freely accessible at http://protein.cau.edu.cn/ZincExplorer/.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络