节点文献
基于智能计算的膜蛋白结构与相互作用预测研究
Study on Intelligent Computation Based Prediction of Membrane Protein Structure and Interaction
【作者】 赵培英;
【导师】 丁永生;
【作者基本信息】 东华大学 , 控制理论与控制工程, 2010, 博士
【摘要】 得到基因数据后,要分析全部基因的功能,阐明基因组所表达的真正执行生命活动的全部蛋白质的表达规律和生物功能,最直接的是进行蛋白质结构研究。在膜蛋白结构与功能研究的具体领域中,膜蛋白类型预测是一个重要的基础性研究,利用分子生物学方法来预测膜蛋白已经不能满足日益增长的膜蛋白序列的需求,因此,本论文结合智能计算的相关技术,挖掘膜蛋白序列内氨基酸的排列顺序信息,以更好的理解膜蛋白序列与结构、功能之间的关系。另外,越来越多的基因组大规模测序提供了更多的膜蛋白序列,同时也为膜蛋白的相互作用提供了基础。膜蛋白的相互作用在生命活动中起重要作用,不仅控制正常的生理过程,也在病理过程中起着重要的作用;不仅为注释未知膜蛋白的生物学功能提供了线索,也为研究膜蛋白结构,了解生命活动的机制,提供了必要的信息。本论文在膜蛋白序列的基础上研究膜蛋白的结构,主要从两个方面进行:膜蛋白结构类预测研究和膜蛋白相互作用识别预测。采用伪氨基酸组成理论和近似熵算法,优化参数组合,根据参数不同组合形成不同的分类器,最终构建一个集成分类器,用来对膜蛋白的结构类进行预测;建立模糊支持向量机网络,结合生物物理属性对膜蛋白进行分类。在膜蛋白相互作用研究中,收集较多的正样本数据,借助实验数据提取相互作用特征,应用模糊支持向量机算法识别膜蛋白相互作用;在此基础上,采用不同的特征表示,建立另外的数据集,应用AdaBoost算法集成多个弱分类器,用来预测膜蛋白相互作用,以更好的研究膜蛋白的结构和功能。本论文具体的研究内容有:在膜蛋白二级结构类预测中,采用伪氨基酸成分理论描述膜蛋白序列,近似熵方法计算结果作为补充序列信息,使用优化后的权重系数,根据参数设置的不同,组合建立多个不同的分类器,集成了多个模糊k近邻分类器,经过训练、测试,应用集成分类器预测膜蛋白结构分类,刀切法测试证明了该方法的有效性和实用性。针对传统支持向量机分类问题中出现不可分区域的问题,引入模糊隶属度函数,构成模糊支持向量机分类器,集成多个这样的分类器构建模糊支持向量机网络,结合膜蛋白序列的物理化学属性信息预测膜蛋白结构类。由于膜蛋白的疏水等特性,其结构数据在整个蛋白质数据库中所占比例非常小,实验方法获取膜蛋白相互作用更是困难,所以已知的膜蛋白相互作用数据非常少。本文提出用模糊支持向量机算法识别未知的膜蛋白对,收集较多的正样本数据,借助实验数据提取相互作用特征,经验证,该算法是有效的。AdaBoost的原理是,一个弱学习器不能很好学习的样本,将尽可能成为下一个弱学习器着重学习的样本,因此,我们应用AdaBoost算法集成多个弱分类器,结合不同的数据集,采取不同方法提取膜蛋白相互作用特征,以获得更好的特征表示,应用集成分类系统对膜蛋白相互作用进行分类预测,取得了很好的结果。最后,总结了全论文的工作,指出了研究工作中存在的不足,并对今后的研究方向和研究重点进行了讨论。
【Abstract】 After obtaining genetic data, the most direct way is to conduct studies of protein structure in order to analyze all the gene function and clarify the expression patterns and biological functions of proteins, especially the the proteins expressed by the genome and used to implement the life activity. In the specific study of membrane protein structure and function, the prediction of membrane protein types is the important foundation. However, it can not meet the demand for the increasing membrane protein sequences using molecular biological methods to predict membrane protein types. Given an amino acid sequence, what features should be derived from it and how to formulize these features so as to represent the relationship between the sequence and the structure or function of the corresponding protein correctly? In other words, characteristic description of the amino acid sequence requires further study. In this thesis, combining intelligent computing technologies, the information of membrane protein sequences is mined in order to better understand the relationship between the membrane protein sequences, structure and function. Besides, more and more large-scale genome sequencing provided us not only additional membrane protein sequences, but also conditions for the study of membrane protein interactions. Membrane protein interactions play an important role in the life activities. They provide not only clues for the annotation of the unknown biological functions of membrane proteins, but also necessary information for study of membrane protein structure and understanding of the mechanisms of life activities.In this thesis, we study the structures of membrane protein based on the sequences. We mainly focus on two areas: the prediction of membrane protein types and prediction of membrane protein interactions. Using pseudo amino acid composition theory and the approximate entropy algorithm, optimizing parameter combination, according to different combinations of parameters of the formation several different types of classifiers are built, then we ultimately construct a classifier by integrating the different basic ones. The integrated classifier is used for predicting membrane protein structure classes. Besides, we establish fuzzy support vector machine network to classify membrane proteins by combination of bio-physical properties of them.In the study of membrane protein interaction, we collect more positive samples, extract features of membrane protein interactions through the experimental data, and use fuzzy support vector machine algorithm to identify membrane protein interactions. By creating additional data set, we use different feature representation methods and apply AdaBoost algorithm to integrate multiple weak classifiers to predict membrane protein interactions. The main contributions in the thesis are described as follows.In the prediction of secondary structural classes of membrane protein, first, we use pseudo-amino acid composition theory to describe membrane protein sequences and the additional sequence information is computed with approximate entropy method. Next, we establish a number of different classifiers according to the different parameter settings using the optimized weighting factor. Then we integrate a number of fuzzy K nearest neighbor classifier, and after training and testing we apply integrated classifier to predict membrane protein structural classes. Jackknife tests on the datasets show that the method is effective and practical.In the process of classification using traditional support vector machine algorithm, unclassifiable regions exist. In order to resolve the problem, we introduce the fuzzy membership function to constitute a fuzzy support vector machine classifier and then integrate multiple classifiers to build fuzzy support vector machine network. Combining with the information of physical and chemical properties of membrane protein sequences, the network is used to predict membrane protein structural classes.As the hydrophobic characteristics of membrane proteins, its structure data in the database occupies a very small proportion. Experimental methods for membrane protein interactions are more difficult, so the known data about membrane protein interactions is very little. In this paper, we use fuzzy support vector machine algorithm to identify unknown pairs of membrane proteins. We collect more data on the positive samples and extract interactive features with the experimental data. The algorithm is proven to be effective.AdaBoost principle is that the samples that a weak learner can not well study will be the samples that the next weak learner focus on as far as possible. Therefore, we apply the AdaBoost algorithm for integration of multiple weak classifiers, test on different data sets and take different ways to extract the characteristics of membrane protein interactions in order to obtain better feature representations. Application of integrated classification system to classify and predict membrane protein interactions achieved good results.At last, a summary of the thesis is made, and the deficiency in the project and the further development are narrated respectively.