节点文献

蛋白质分类问题的特征提取算法研究

Research on Algorithm in Feature Extraction of Protein Classification

【作者】 张振慧

【导师】 王正华;

【作者基本信息】 国防科学技术大学 , 应用数学, 2006, 博士

【摘要】 人类基因组计划的实施带来了蛋白质数据库中海量的序列信息,而对蛋白质高级结构和功能的认识却远远落后于序列信息。面对浩瀚的蛋白质序列数据,探索理论与计算的方法研究蛋白质结构和功能具有重要意义,也是后基因组时代生物信息学的核心问题之一。由于蛋白质结构和功能的复杂性,人们很难抓住其整体特征用简单的方法对所有蛋白质进行分类。而在蛋白质研究中存在许多专业分类方法,每一种分类准则在一定领域内都有很重要的实用价值。因此蛋白质分类问题作为蛋白质组学研究的一个分支,近年来受到研究者们越来越多的关注。蛋白质分类研究是全面掌握蛋白质结构与功能的前提和基础,在分子生物学、细胞生物学、药理学和医学中扮演着非常重要的角色。蛋白质序列的特征提取是基于计算的蛋白质分类研究中最为基本的问题,也是决定分类质量的关键问题。本文对此进行了深入的分析和研究,针对蛋白质分类研究中的四类基本问题,提出和实现了四种不同的特征提取算法,并在标准数据集上进行了测试验证和比较分析。本文的主要工作和创新之处概括如下:(1)蛋白质的结构型可以为蛋白质空间结构预测提供重要的信息。对于一个结构未知的蛋白质,如果能够准确地知道其结构型,不仅可以提高二级结构分类精度,而且能够大大缩小三级结构预测中构象搜索的范围。此外,结构型与蛋白质的某些功能也具有密切联系。本文基于离散量的概念构造了一种新的蛋白质序列特征提取算法——k -子串离散源方法。结合k -子串离散源和最小离散增量算法,构建了一种新的蛋白质结构型分类模型SS+Diver。该模型从蛋白质的序列出发,不需引入其它任何信息,计算简单、分类精度高。针对标准数据集T359,SS+Diver模型的Jackknife检验总体分类精度达到97.49%,比目前已有的分类模型提高了1.67~56.27个百分点。实验结果表明,与已有分类模型相比,本文提出的SS+Diver模型具有较强的自适应、泛化和推广应用能力。(2)四级结构是蛋白质一级结构、二级结构和三级结构的延伸,是指寡聚蛋白质中亚基的种类、数目、空间排布以及亚基之间的相互作用。寡聚蛋白质广泛地参与物质代谢、信号传导、染色体复制等各种生命活动,对寡聚蛋白质四级结构的研究有着重要的生物学意义。本文提出了三种不同的组合特征提取算法,并采用最近邻居算法对二聚体与非二聚体蛋白以及七类同源寡聚体蛋白的分类问题进行了探讨。实验结果表明,三种组合特征提取算法中基于DPC_ACF的模型计算简单、分类性能好;针对标准数据集RG1639,该模型的Jackknife检验总体分类精度达到90.2%,比目前已有的分类模型提高了2.7~31.3个百分点;针对标准数据集CC3174,该模型的Jackknife检验总体分类精度达到91.18%,比目前已有的分类模型提高了12.68~22.78个百分点。(3)细胞凋亡蛋白质在生物体的生长发育和动态平衡中起重要作用,这些蛋白质对于了解细胞程序性死亡的机制非常重要。而细胞凋亡蛋白质的亚细胞定位与其在细胞中行使的功能有着密切的关系。本文基于“粗粒化”和“分组”的思想,提出了一种新的蛋白质序列特征提取算法——分组重量编码方法。并分别结合组分耦合算法、最近邻居算法和支持向量机构建了EBGW+CCA、EBGW+NNA和EBGW+SVM三个分类模型。实验结果表明,针对相同的数据集,采用相同的分类算法,分组重量编码方法综合考虑氨基酸的多种物理化学特性,能比氨基酸组成和非稳定性指标等特征提取算法更加有效地揭示出蕴含在字母序列中的结构与功能信息,且计算简单;在标准数据集上与现有的工作相比,本文提出的EBGW+SVM模型分类效果较好,总体分类精度、各类的敏感性和Matthews相关系数都有较大幅度的提高。(4)膜蛋白质在细胞中占有重要的地位。国际上已有成功的方法区分膜蛋白质与非膜蛋白质。如果人们能够从理论上预测膜蛋白质的类型及其与磷酸双脂层的结合方式,对于了解新测序的膜蛋白质的功能有十分重要的意义。本文引入亚字母集(sub-alphabet)的概念,并进一步提出了基于亚字母集的亚多肽组成特征提取算法。该方法不仅能够提取蛋白质序列中蕴含的细胞特征信息,有效改善分类模型的性能;而且大大降低计算复杂性,解决了传统多肽组成方法特征提取能力强,但是计算复杂、应用受限的现状。针对标准数据集CE2059,提出的基于AAC_S6P2的模型的总体分类精度比基于氨基酸组成和二肽组成组合方法的模型提高了0.1%,而运算时间仅为后者的11.75%。与已有的分类模型相比,该模型的总体分类精度提高了1.02~25.16个百分点。(5)最后,本文还对分类模型的分类性能与数据集特性之间的关系进行了初步探讨。

【Abstract】 With the success of human genome project, a widening gap appears between sharply increasing known protein sequences and slow accumulation of known protein structures and functions. It is urgent to find a trustworthy theoretical and computational approach to predict protein structures and functions from immensurable sequences, which is a kernel task of bioinformatics in the post-genomic era.Since the great diversity of protein structures and functions, it is difficult to capture the important features of them with any simple classification scheme. There are many specialized ways of grouping proteins, each of which has been helpful for some fields. As an offshoot of the research of proteomics, protein classification has been focused on with more and more attentions. Any new breakthrough in this research will be helpful to further understand the structure and function of protein. What’s more, it plays an im-portant role in molecular biology, cellular biochemistry, pharmacology and medicine etc.Feature extraction of protein sequence is a basic problem in the research of protein classification, and also a key factor of the classification performance. This thesis studies some algorithms in this subject, proposes four new feature extraction algorithms for four basic types of problems in the research of protein classification, and takes some testing and analysis for these algorithms based on the standard dataset. The main work and the creative achievements in this thesis are shown as followed:1. Protein structural class is very important to the protein structure prediction. To protein with unknown structure, it will lead to the increase of secondary structure pre-diction accuracy, and also lead to the decrease of the complexity of protein tertiary structure prediction, if the structural class is clear. Based on the concept of measure of diversity, k-substring diversity source is presented. Combined with the increment of di-versity algorithm, the new feature extraction approach is applied to protein structural class prediction. For the dataset T359, the overall accuracy of SS+Diver model in Jack-knife test is 97.49%, about 1.67~56.27 percentile higher than that of other existing models.2. To understand the structure and function of a protein, an important task is to identify the quaternary structure for a new polypeptide chain, i.e., whether it is formed just as a monomer, or as dimer, or any other oligomer. Thus, a computational method for properly classifing the quaternary structure of proteins would be significant in inter-preting the original data produced by the large-scale genome sequencing projects. Three different composite feature extraction methods are raised and applied to protein quater-nary structure prediction combined with the nearest neighbor algorithm. The simulation results show that the performances bsed on DPC_ACF are higher than that of other composite methods. For the dataset RG1639, the overall classification accuracy of DPC_ACF in Jackknife test is 90.2%, about 2.7%~31.3% higher than that of other ex-isting models. For the dataset CC3174, the overall classification accuracy of DPC_ACF in Jackknife test is 91.18%, about 12.68%~22.78% higher than that of the best existing model.3. Apoptosis proteins play an important role in the growth and homeostasis of or-ganism. Functions of those proteins will be helpful to make clear the mechanism of programmed cell death. The knowledge of the subcellular location of apoptosis protein is important to understand the function of apoptosis protein. Based on the idea of coarse-grained description and grouping, a new approach named as encoding based on grouped weight (EBGW) for protein sequence is presented. Combining with the com-ponent-coupled algorithm, the nearest neighbor algorithm and support vector machine respectively, three classification models (named as EBGW+CCA, EBGW+NNA and EBGW+SVM) are put forward, and applied to the subcellular location prediction of apoptosis protein. Experiments show that, for the same dataset, with the same classifica-tion algorithm, the capacity of feature extraction from EBGW approach excel that from amino acid composition and instability index. The overall classification accuracy, sensi-tivity and Matthews’correlation coefficient of each class from EBGW+SVM model are all higher than those of existing models.4. Membrane proteins are very important in a cell, and can be relatively easily dis-criminated from non-membrane proteins. The determination of functions for new mem-brane proteins can be expedited significantly if we can find an effective algorithm to predict their types. Based on the concept of sub-alphabet, sub-polypeptide composition of protein sequence is presented. The new algorithm not only contains more cellular in-formation of protein sequence, but also greatly decreases the computation complexity. Consequently, for the dataset CE2059, the overall classification accuracy of model with sub-polypeptide composition is 0.1% higher than that of model with traditional poly-peptide composition. Even more, the computation time of our model is only 11.75% of that of the latter. Compared with existing models, the overall classification accuracy in-creases about 1.02~25.16 percentile in the Jackknife test.5. In the end, relation between the performance of classification model and the characteristics of training dataset is simple discussed.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络