节点文献

特征提取及分类算法在膜蛋白分类预测问题中的应用

The Application of Feature Extraction and Classification Algorithm in Predict Membrane Protein Classification Problem

【作者】 王立鹏

【导师】 袁占亭; 陈旭辉;

【作者基本信息】 兰州理工大学 , 控制理论与控制工程, 2010, 博士

【摘要】 基因是能够自我复制,永远保存的单位,它的生理功能是以蛋白质的形式表达出来的。细胞中有大约30%的蛋白质是膜蛋白。膜蛋白作为生物膜的主要组成成分之一,是生物膜功能的主要承担者,在生物体中发挥着极其重要的作用。面对数量庞大的膜蛋白序列信息,利用传统的分子生物学实验方法来预测膜蛋白结构类型不仅费时费力,还会遇到一些目前无法解决的困难,已经难以满足现实的要求。膜蛋白序列的特征提取和分类是膜蛋白分类预测研究中最基本的问题之一,也是决定膜蛋白分类质量的关键。本文以膜蛋白序列的分类预测为主题,针对膜蛋白序列的特征选择算法、分类算法进行了相关的研究,现将主要工作和创新之处概括如下:(1)本文将线性降维方法应用到膜蛋白分类预测问题中。现今,在膜蛋白特征提取算法中,二肽组成(DC)已逐渐被证明比传统的氨基酸组成(AAC)更有效。然而通过此方法虽然可以取得较高的分类预测精度,但是从膜蛋白序列特征中提取出的属性特征向量的维数一般都很高,它在全面描述膜蛋白序列信息的同时,也带来了“维数灾难”问题,使得膜蛋白预测系统的计算复杂度很高。为了解决这一问题,我们将线性降维方法应用于膜蛋白分类预测问题中。首先采用二肽组成(DC)方法从膜蛋白序列中提取出高维属性特征向量,然后采用线性降维方法从高维DC空间数据中进行二次提取,提取出重要的低维特征向量,接着在降维后的低维特征向量上再进行分类预测,最后预测结果表明采用该方法的预测准确率要高于不采用线性降维方法的预测方法,证明了将线性降维方法应用于膜蛋白类型预测问题中的可行性和有效性,简化了膜蛋白预测系统,提高了预测效率。(2)本文提出五种新的基于降维的组合特征提取算法。本文首先引入线性降维的思想,构造了两种基于线性降维的组合特征提取算法:结合二肽组成和主成分分析算法,构造了新的特征提取算法DC_PCA;结合二肽组成和线性判别分析算法,构造了新的特征提取算法DC_LDA。通过实验结果表明,与传统的基于二肽组成(DC)的膜蛋白分类模型以及基于氨基酸组成(AAC)的膜蛋白分类模型相比较,基于线性降维的组合特征提取算法所构造的分类模型所达到的分类预测精度更高。为了得到具有更好分类性能的膜蛋白分类模型,更好的预测膜蛋白序列中所蕴含的结构和功能信息,本文又构造了三种基于非线性降维算法的组合特征提取算法:结合二肽组成和核心主成分分析算法,构造了新的特征提取算法DC_KPCA;结合二肽组成和核心线性判别分析算法,构造了新的特征提取算法DC_KLDA;结合二肽组成和邻域保护嵌入算法,构造了新的特征提取算法DC_NPE。实验结果表明,与传统的基于二肽组成(DC)的膜蛋白分类模型以及基于氨基酸组成(AAC)的膜蛋白分类模型相比较,基于非线性降维的组合特征提取算法所构造的分类模型所达到的分类预测精度更高。为了得到分类精度最好的分类模型,本文对五种组合降维特征提取算法做了比较,结果表明,基于DC_KLDA的模型分类精度最高,针对标准数据集CE2059,经过Jackknife检验,该模型的总体分类精度达到92.71%,比目前常用的基于氨基酸组成的分类模型提高了15.1~30.59个百分点;针对标准数据集CE2625,该模型的独立测试集检验总体分类精度达到94.12%,比目前常用的基于氨基酸组成的分类模型提高了14.69~31.42个百分点。(3)基因芯片技术从基础上改善了研究生物技术的方法和效率,对基因组学及后基因组研究产生了重要的影响,但海量信息的获得也对数据的分析及信息特征提取提出了新的挑战。为了解决当基因数据维数急剧升高时无法维持较高的分类准确性和效率的问题,本文在传统近似支持向量机(PSVM)的基础上,提出了降维近似支持向量机(DRPSVM)的基因芯片数据分类器。DRPSVM采用了降维的二次规划算法,不但能将基因数据的分类问题归结为仅含线性等式约束的二次规划问题,同时还在传统近似支持向量机(Proximal Support Vctor Machines, PSVM)的基础上维持了较好的分类准确性,并降低了分类处理的时空复杂度。

【Abstract】 Gene is self-replication and preservation unit,Its physiological function is expressed in the form of the protein. There are about 30% of the protein is membrane proteins in Cells. As one of the main components of biomembrane,membrane proteins play a vital role in organisms.With the explosion of protein sequences generated, determination of membrane proteins types by molecule biology experiments is time-consuming ,what’s more,it may encounter some difficulties in the experiments that can’t be solved at present.Feature extraction of membrane protein sequences is a basic problem in the research of protein classification based on calculation,and is also a key factor that determines the classification performance.This paper studies Membrane protein sequence’s feature selection algorithm and classification algorithms ,and to predicte membrane proteins. The main work and innovations of this thesis are summarized as follows:(1)linear dimensionality reduction algorithms are introduced to Predict membrane protein types. This thesis proposes that linear dimension reduction methods be applied to the membrane protein type prediction. Nowadays, In the membrane protein’s feature extraction algorithm, Dipeptide composition (DC) has gradually been proven more effective than the conventional amino acid composition (AAC).Although using the dimensionality reduction algorithm helps to increase the predicting accuracy. However, a high dimensional disaster may be caused by using this representation method. Thus, a linear dimensionality reduction algorithm is introduced to extract the indispensable features from the high-dimensional DC space, respectively,and identify the types of membrane proteins based on the reduced low-dimensional features. Finally, experiment results show that using the proposed method to cope with prediction of membrane proteins types are very effective.(2)This thesis Propose five new Combined feature extraction algorithms . This thesis introduces the idea of linear dimension reduction, and construct two combination of feature extraction algorithm based on linear dimension reduction:combination of Dipeptide composition and the principal component analysis algorithm, we construct a new feature extraction algorithm DC_PCA ; Combination of dipeptide composition and linear discriminant analysis algorithm, we construct a new feature extraction algorithm DC_LDA. The experiment results show that using feature extraction algorithm based on linear dimensionality reduction to predict accuracy of Membrane protein types are higher than the traditional dipeptide composition (DC)and amino acid composition (AAC) methods.In order to obtain better classification performance of the membrane protein classification model and predicte structure and function information of membrane protein sequence, this thesis constructs three combination of feature extraction algorithm based on nonlinear dimensionality reduction algorithm: Combination of Dipeptide composition and the Kernel principal component analysis algorithm, we construct a new feature extraction algorithm DC_KPCA; Combination of the dipeptide composition and Kernel linear discriminant analysis algorithm, we construct a new feature extraction algorithm DC_KLDA; Combination of Dipeptide composition and neighborhood preserving embedding algorithm, we construct a new feature extraction algorithm DC_NPE. The experiment results show that using feature extraction algorithm based on nonlinear dimensionality reduction to predict accuracy of Membrane protein types are higher than the traditional dipeptide composition (DC)and amino acid composition (AAC) methods.To obtain the classification model with best classification accuracy, this paper construct a new feature extraction algorithm DC_KPCA; binding dipeptide composition and core linear discriminant analysis algorithm, we construct a new feature extraction algorithm DC_KLDA; binding dipeptide composition and neighborhood preserving embedding algorithm construct a new feature extraction algorithm DC_NPE.(3) DNA microarray technologies have changed the Methods and efficiency of biological technologies, and had a significant impact on the Genomics and post-genome, but it Presented new challenges for data analysis and information extraction to obtain a great deal of information. In order to solve the problem dimension of genetic data can not be sustained when a sharp increase in the higher classification accuracy and efficiency issues, this approximation in the traditional support vector machine (PSVM) based on the proposed dimension reduction proximal support vector machine (DRPSVM) of microarray data classification. DRPSVM using quadratic programming algorithm for dimensionality reduction, not only the classification of genetic data can be reduced to contain only linear equality constrained quadratic programming problem, while also similar to the traditional support vector machine (Proximal Support Vctor Machines, PSVM) based on the maintenance of a better classification accuracy and reduce the classification time and space complexity.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络