节点文献

膜蛋白分类的特征提取算法和数据集构建技术研究

Research on Feature Extraction Algorithm and Dataset Construction Technology in Membrane Protein Classification

【作者】 曾聪

【导师】 王正华;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2010, 硕士

【摘要】 膜蛋白作为生物膜的主要组成成分之一,在生物体中发挥着极其重要的作用。膜蛋白是膜功能的主要承担者,是细胞执行各种功能的物质基础。近些年的研究报道更加表明,某些膜蛋白结构或者功能的改变与人类疾病的产生有着密切的联系,相应受体膜蛋白也成为药物设计的重要靶点。故本文将膜蛋白作为研究对象。20世纪90年代初期提出的人类基因组计划(HGP),在全世界科学家的共同努力下取得了巨大的成就,促进了基因组学和蛋白质组学的极大发展。随着生物数据的海量增长,依赖计算机技术的生物信息学研究方法突破了以往的研究手段。通过膜蛋白的一级序列预测其所属类型以获取相关的高级结构和功能信息,从而解决其生物学问题,这是一项极其重要且具有挑战性的研究工作,也是全文研究的目的所在。用于膜蛋白分类预测的数据集整理与构建是整个分类模型的基础与前提,数据集构建得好坏决定了算法的准确性,是基于计算的膜蛋白分类问题研究的重要要素之一。膜蛋白序列的特征提取是基于计算的膜蛋白分类研究中最为基本的问题,也是决定分类质量的关键。本文分析了通用数据集构建准则,从SWISS-PROT数据库的最新发布中筛选出膜蛋白序列,构建了新的膜蛋白数据集;本文从膜蛋白的一级序列出发,研究了膜蛋白的结构、功能类型分类预测问题,总结了目前膜蛋白分类预测领域中已有的序列特征提取算法和分类算法,深入剖析了不同算法的数学原理,在此基础上,构造了一种新的膜蛋白特征提取方法;并在新构建的数据集上进行了新特征提取算法与其他膜蛋白分类模型的性能比较。1)构建新的膜蛋白序列数据集用于分类预测的膜蛋白来自蛋白数据库SWISS-PROT。目前通用的标准数据集CE2059和CE2625建立在SWISS-PROT 35.0(1997年)版本基础上。随着数据库的日新月异,蛋白质序列不断更新和发展,数据量和数据信息更新换代非常快,数据库中的蛋白质数量越来越多、规模越来越大、分类注释越来越精准。因此与时俱进的构建新数据集对于膜蛋白分类研究而言是一件工作量大、意义重大的事情。本文分析了通用数据集CE2059和CE2625的构建年限早和注释不全面等问题之后,从SWISS-PROT国际公共数据库最新发布版本SWISS-PROT Release 57.0(2009年)中,按照现有的公认的标准数据集构建准则筛选出符合标准的膜蛋白序列,收集整理成相应的新的较为完整和理想的标准训练数据集,为该领域做了很好的补充,为后续研究奠定了数据基础。2)基于多种氨基酸残基指数构建自相关系数的特征提取算法特征提取算法是膜蛋白分类问题的又一关键要素,它是决定分类质量的关键问题。为了能够获得具有更好分类性能的膜蛋白分类预测模型,本文考虑在序列氨基酸组分的基础上,加入序列氨基酸残基的顺序关联信息,从而更大限度地挖掘膜蛋白序列中蕴含的结构和功能信息。考虑膜蛋白序列中氨基酸残基的物理化学特性和长程相关性,提出了基于多种氨基酸残基指数构建自相关系数的特征提取算法,并进一步特征降维,实现维度优化以减少计算量。该模型采用新建膜蛋白序列数据集作为训练集,模型的自适应检验、Jackknife检验和独立测试集检验总体分类预测精度分别是96.78%、91.03%和86.93%,对比已有的膜蛋白分类预测模型,分类预测精度均获得普遍提高。这为进一步推动膜蛋白分类问题的研究打下了良好的基础。

【Abstract】 As one of the main components of biomembrane, membrane proteins play a vital role in organisms. Membrane proteins are the main manifestations of biomembrane’s function, and make the material basis for cells to implement various functions. Moreover, recent research reports indicate that the structure or function change of some membrane has extremely close relations with the production of human beings’ diseases, and the relevant receptor membrane proteins also become an important target for drug design. That is why this thesis focuses on the membrane proteins.The Human Genome Project (HGP) raised in the early 1990s has got tremendous achievements under the united efforts of scientists all over the world. Meanwhile the Genomics and Proteomics have accomplished a great development. Nowadays, with the unprecedented quantity growth of biological data, bioinformatics, a new method based on computer technology, is taking the place of the traditional means.Predicting the respective types of membrane proteins through their primary sequences to gain the correlative advanced structure and function information, is a crucial fundamental research in the study of the structures and functions of membrane proteins. This important and challenging work will also provide clues for conquering the special biological problems, which is our goal too.The construction of the dataset of membrane proteins is the foundation and premise of the whole prediction model, its quality influences the accuracy of the algorithm, is one of the dominant elements in the research of membrane proteins classification. Feature extraction of membrane protein sequences is another basic technique in the research of protein classification based on calculation, and also a key factor of the classification performance. This thesis collects the membrane proteins sequences from the latest release of SWISS-PROT to build a newer, more comprehensive and evenly dataset according to the common dataset CE2059 and CE2625 construction standards. From the membrane proteins’ primary sequences, this thesis studies the classification problem for membrane proteins’ structures and functions, proposes a new feature extraction algorithm based on the new dataset, further tests and analysis of the feature extraction algorithm are undergoing too. The main work in this thesis is summarized as follows:(1) Construction of the new dataset for membrane proteins. The construction of the dataset is one of the dominant elements in the research of membrane protein classification. The common used datasets CE2059 and CE2625 in this field are almost based on the SWISS-PROT Release 35 in 1997. As the development of the databank, the number, scale and annotations of membrane protein sequences are renewed regularly, indicating the significance and necessity of the construction of a new dataset with these latest data. The thesis builds up a larger and more evenly new dataset according to the common dataset construction criterions of the standard datasets from the latest SWISS-PROT Release 57.0 in 2009, providing an important and necessary preparation of the further study.(2) The feature extraction algorithm is another key process in this field. In order to get a classification model with better prediction accuracy and further mine the information of structures and functions in the membrane protein sequences, this thesis considers further the physical and chemical properties of amino acid residues and long distance correlation between them, constructing a novel type of membrane proteins classification model which combines two feature classes and support vector machine algorithm (SVM), encompassing the AAC and several indexes of the residues from the amino acid index database. Under three typical tests(Self-consistency, Jackknife and Independent dataset), the accuracy rate of prediction is respectively 96.78%, 91.03% and 86.93% based on the membrane protein new dataset mentioned above. Compared with existing models, the prediction method gets a good performance and a notable improvement.

节点文献中: