节点文献
基于序列信息的膜蛋白结构、功能预测研究
Sequence-based Structure,Function pediction for Membrane Protein
【作者】 王诚祺;
【导师】 姚小军;
【作者基本信息】 兰州大学 , 化学信息学, 2012, 博士
【摘要】 膜蛋白在整个细胞生命体活动中扮演着极其重要的角色。负责包括离子运输、小分子转运以及复杂细胞信号转导过程等在内的多种生命活动。同时,膜蛋白也是很多药物的靶点,据估计将近60%的药物直接作用于膜蛋白上。然而生物学家目前所掌握的膜蛋白结构、功能信息还比较有限,这主要是因为用实验手段进行膜蛋白的结构测定以及功能研究较为复杂,科研人员难以分离出稳定的膜蛋白样品,用于电子显微镜或者X-射线晶体分析。膜蛋白结构、功能的研究仍然是生物学界最具挑战的研究领域之一。本学位论文以膜蛋白的结构、功能为研究对象,综合应用多种统计学和生物信息学方法,探讨膜蛋白的序列-结构、序列-功能关系的研究新方法,希望能够发展出以膜蛋白序列信息为基础的,结构、功能预测模型,解决膜蛋白研究中的结构预测、亚细胞定位预测、功能预测等重要研究问题。论文第一章,我们重点介绍了膜蛋白的结构、生命合成机理、折叠方式以及功能分类。然后介绍了基于生物信息学技术的膜蛋白结构、功能预测模型。最后阐述了本文所用到的膜蛋白数据库信息、序列表征以及建模方法。论文第二章,我们从输入信息简洁、预测方法简单、预测结果准确率高等原则出发,应用最小二乘支持向量机方法,建立了高效的α-螺旋膜蛋白跨膜氨基酸残基埋藏情况(残基暴露于磷脂分子层或者埋藏于螺旋结构当中)的预测模型。该方法使用划窗技术提取目标残基(这里指被预测残基)周围的序列信息。然后使用结构、物理化学特征、保守性指数对划窗的序列信息进行表征,并使用递归特征消去(Recursive feature elimination, RFE)方法选取和埋藏情况高度相关的序列特征。最后将所选取的描述符输入最小二乘支持向量机模型,用于建立跨膜氨基酸残基埋藏情况的预测模型。我们所建立的预测模型所选择用的训练集包括43条膜蛋白,模型的预测能力使用10条未参与建模过程的α-螺旋膜蛋白进行外部验证。结果表明,我们所建立的模型可以得到令人满意的预测结果。另一方面,通过应用特征选择方法,我们找到了影响膜蛋白跨膜残基埋藏情况的重要序列信息。埋藏情况预测模型只能指出暴露于磷脂分子层的跨膜残基,但是却不能给出其暴露面积的多少。为此,我们发展了可以预测α-螺旋、p-折叠跨膜残基的溶剂可及化表面积的定量预测模型。整个模型的建立是基于78条α-螺旋膜蛋白、24条p-桶装膜蛋白所组成的训练集样本。我们首先使用遗传信息表征划窗序列,并根据随机森林算法返回的描述符残差平方和(Residual sum of squares)选取和可及化表面积高相关的序列特征。最后,将选取的描述符输入支持向量机以及随机森林算法建立模型。溶剂可及化表面积的预测结果显示,随机森林算法的预测能力和拟合能力优于支持向量机。获取膜蛋白的亚细胞定位信息,是了解膜蛋白功能信息的重要途径之一。在本论文的第四章,我们发展了一种可以有效鉴别真核细胞膜蛋白全部亚细胞定位的预测模型。该模型的建立步骤包括:首先从UniProt数据库上下载全部膜蛋白序列、亚细胞定位信息,将其随机分为训练集和测试集。然后,通过使用序列的遗传信息、结构、物理化学性质描述膜蛋白序列特征,并运用结合周氏函数的K-临近算法建立预测模型。通过留一法交互验证、外部测试集将所建立的预测模型进行检验,结果表明我们所建立的模型具有良好的拟合能力和预测能力,预测结果令人满意。更为重要的是,由于周氏函数的引入,该模型可以直接应对具有多个亚细胞定位的膜蛋白复杂分类问题。论文第五章,我们提出了基于序列的膜蛋白功能预测模型。该模型可以用于膜蛋白的26个功能分类预测,并且可以直接返回一条膜蛋白的多个功能分类信息。同样,该模型完全从膜蛋白的序列信息出发,并采用基于序列的遗传信息、结构、物理化学信息对膜蛋白序列进行表征。交互验证以及外部测试集预测结果显示,该模型具有稳定的预测能力,可以用于膜蛋白的功能预测工作。
【Abstract】 Membrane proteins are crucial players in the cell and take center role in processes ranging from ions, small molecules transport to sophisticated signaling pathways. Many are also prime contemporary or future drug targets, and it has been estimated that about60%of approved drugs are directed against membrane proteins. Despite the biological importance of membrane proteins, it is still notoriously hard for sturctural and functional studies of membrane proteins, due to the problems associated with the purification and availability in stable forms suitable for X-ray crystallography and electron microscopy (EM) studies. Therefore, membrane proteins still represent very important yet one of challenging research objects in a number of disciplines.This dissertation focuses on the sturctural and functional studies of membrane proteins using vary mathematical and bioinformatics approaches to study the relationship between sequence, structure and function. The ultimate purpose is to build sequence-based model to predict the structure and function of membrane proteins. Most important, we hope the built models could resolve major issues (structure determination, subcellular localization and functional studies) on membrane protein only from sequence information.In Chapter1, we first review the development and discuss the consequences for our understanding of membrane protein structure, biogenesis, folding and function. Then, we discuss current structure and function prediction methods against a background of knowledge that has been gleaned from membrane protein. At last, the data resource, sequence representation and prediction mathematical methods for membrane proteins structure, function prediction in this dissertation were introduced.In Chapter2, we presented a novel and concise method for predicting burial status (the residue exposure to the lipid bilayer or buried within the protein core) of transmembrane residue of a-helix membrane proteins. By using sliding window technology, the sequence information contained in the immediate neighbors of the central residues was first extracted. Then, two strategies were used for feature generation to encode the window. The main features used include the conservation index, sequence based-structural and physicochemical features. The features that highly correlated with burial status were then selected using recursive feature elimination (RFE) method. At last, least squares support vector machines (LS-SVMs) was used to develop classification model due to its good performance and less time-consuming characteristic in the classfication model development. The model was developed from43membrane protein chains and its prediction ability was evaluated by an independent test set of other non-redundant ten membrane protein chains. The prediction accuracy of our method were satisfactory. On the other hand, the position and the composition of hydrophobic amino acid propertie were proved to be very important features influencing the burial status of a TM residue.Burial status prediction model can only qualitative identify exposed transmembrane residue but can not figure out how much surface area is exposed. Therefore we developed a sequence-based computational model for the prediction of solvent accessible surface area of a-helix and β-barrel transmembrane residues The main proces of our model is described in Chapter3. The model was developed from78a-helix membrane protein chains and24β-barrel membrene proteins. Firstly, the evolutionary conservation in a set of a-helix and β-barrel transmembrane proteins was extracted by using sliding window technology. Thereafter, the decrease in "residual sum of squares " was used to rank all variable and the conservation score that high correlated with accessible surface area of transmembrane residues were selected to building model. At last, the prediction models were developed using support vector machine and random forest methods. The results show that our model performs well for both types of transmembrane residues and outperforms other prediction model which was developed for the specific type of transmembrane residues. The prediction results also proved that the random forest model incorporating conservation score is an effective sequence-based computation approach for predicting the solvent accessible surface area of transmembrane residues.Knowledge of the subcellular localization of membrane proteins is very important and fundamental to understand the function of membrane proteins in many cases, such as in cellular function, biological process, signal transduction, metabolic pathway and drug design, In Chapter4, we aimed to develop a model that can be used to predict the subcellular localization of membrane proteins covering all localization sites in eukaryotic. The main process of our model is described as follows:firstly, the dataset were downloaded from the UniPort database. Then the dataset was divided into a development set and an independent test set. In order to represent the information about MPs comprehensively, the sequence-derived structural, physicochemical features and the evolution information extracted by the concept of Chou’s pseudo amino acid composition were utilized. We utilized K-nearest neighbor (KNN) algorithm combined with Chou’s score function in the development of the computational model. The performance of the prediction models was evaluated by cross-validation and its prediction on the test set. The results prove that our computational method performs well for predicting multiple subcellular localization sites of membrane proteins in eukaryotes.In Chapter5, the first sequence-based model for predicting function of membrane proteins were presented. It can be used to identify eukaryotic membrane proteins among26functions. In addition, the predictor is powerful and flexible, particularly in dealing with proteins with multiple functions. Both the sequence-based structural, physicochemical information and evolution information have been fused into the predictor. The satisfactory prediction results from cross validation and independent test set proved that our computational method is reliable to predict multiple function of membrane proteins in eukaryotes.