节点文献

蛋白质亚细胞定位特征表达与分类算法研究

【作者】 施建宇

【导师】 潘泉;

【作者基本信息】 西北工业大学 , 模式识别与智能系统, 2006, 博士

【摘要】 蛋白质组学是后基因组时代的一个重要研究方向,它试图诠释蛋白质在细胞中扮演的角色,揭示细胞环境中蛋白质之间的相互作用和及其功能。确定蛋白质的亚细胞定位是实现蛋白质功能注释的重要一环,但生物实验确定蛋白质亚细胞定位周期长,成本高,迫切需要发展新的更有效的方法。本文基于现代统计模式识别理论与方法,开展了亚细胞定位预测中的特征表达、分类算法、多类分类策略以及不均衡数据处理等问题的研究。主要贡献如下:1.提出了矩描述子特征表达方法,并从预测正确率、支持向量、训练和测试时间几个方面对基于支持向量机的三种多类分类策略的分类性能进行了研究。该特征表达方法从统计学角度分析了氨基酸组成成分特征,引入了氨基酸次序和位置信息,以氨基酸坐标均值和坐标方差来表示蛋白质序列中氨基酸出现位置的期望值和离散程度。基于两种典型数据库进行了实验数据验证,结果表明,矩描述子特征能更有效地表达出蛋白质序列中各种氨基酸残基的位置分布信息。2.提出了氨基酸组成分布特征表达方法,给出了不均衡性衡量指标,研究了不均衡数据集的不均衡性对支持向量机分类的影响,并提出了一种基于加权惩罚系数的训练方法。该特征表达方法将蛋白质序列平均分成多段,分别求取每一段子序列的氨基酸组成成分,不仅包含了所有子序列的氨基酸含量而且能够体现了子序列在空间结构上的相互作用关系。实验数据验证结果表明,(1)氨基酸组成成分特征体现了局部的子序列信息之和大于整体序列信息,能更有效地表达出蛋白质子序列之间的相互关系;(2)基于加权惩罚系数的训练方法能够来减轻数据的不均衡性给分类带来的负面影响。3.针对蛋白质物理化学信号的非平稳性,提出了基于氨基酸残基指数的多尺度能量特征表达方法。该特征表达方法利用氨基酸残基指数将蛋白质符号序列映射成数值信号,使用基于多分辨率分析思想的小波变换,将信号进行Mallat塔式分解,从而求解出该信号在多个尺度上的均方根能量,并通过向量的形式来表达亚细胞定位的特征信息。实验数据验证结果表明,本方法能更有效地表达出蛋白质物理化学信号的特性,并具有更小的计算复杂度。4.针对多种亚细胞定位特征之间的不一致性和特征维数高等问题,提出了一种基于多分类器系统的蛋白质亚细胞定位预测方法。该方法引入多分类器系统对多种特征进行聚合,融合了互补模式信息,降低单个分类器的不确定性,降低了高维特征带来的分类器模型构造难度,并减小了相应的计算负担。实验结果表明,与单个分类器相比,分类系统的预测能力得到了提高和改进;与其他方法相比,本方法更为有效和鲁棒。

【Abstract】 As one of the most important areas in post-genome era, proteome aims tounderstand proteins’ potential roles, elucidate their interaction in a cellular context, andfurther make the corresponding functional annotation. Determination of subcellularlocation of proteins is of essence and importance to their functional annotation.However, the biological experiment of protein subcellular localization will be hard tomeet the demands. Therefore, there is a need to develop more effective methods.Based on the modern theories and methods of statistical pattern recognition, therepresentation of feature, the algorithms of classification, the multi-class classification,and the processing of imbalance dataset are studied for the prediction of proteinsubcellular localization. The main contributions are as follows:1. A feature representation, moment descriptor (MD), is proposed and theperformances of three approaches of multi-class for support vector machines (SVM) areanalyzed in the case of recognition rate, the number of support vector, the training andtesting time. With the view of statistical theory, the presented method analyses aminoacid composition (AAC) and considers the information of amino acid’s position inprotein sequence, and then uses amino acid coordinate mean (AAM) and coordinatevariance (AAV) to respectively represent the expectation and variance of its position ina protein sequence. The experiments are executed to validate the presented method ontwo classical databases, and its result shows that MD can represents the information ofpositions of amino acid residues in a protein sequence more effectively.2. A feature representation, amino acid composition distribution (AACD), isproposed, and then both an imbalance index and a training algorithm by weightingpenalty coefficients are presented to analyze prediction performance of SVM on theimbalance dataset. The presented method divides a protein sequence equally intomultiple segments, and then calculates AAC of each segment in series. In this way, itcan not only show AAC of each segment, but also reflect their interaction. In theexperiments, it is shown that the information of all segments is more useful than that ofthe whole sequence and AACD can represent the interaction of several segment of aprotein sequence effectively. Besides, the presented training algorithm can lighten thenegative effect derived from the imbalance.3. A feature representation, multi-scale energy (MSE), is proposed for theunstationarity of protein physic-chemical signal. The presented method codes a proteinsequence to a digital signal by mapping all residues of the sequence to thecorresponding numerical codes of one amino acid index. Via wavelet transform based on multi-resolution analysis, the mapped signal is decomposed according to Mallatdecomposition algorithm. Consequently, the square root energy factors are calculatedand further joined to a feature vector to represent the approximation and detailinformation of the signal. The experiments are executed to validate the presentedmethod, and its results show that MSE can represent the physic-chemical property ofprotein more effectively and has less computation complexity than other methods.4. Based on multiple classifier system (MCS), a novel method for prediction ofprotein subcellular localization is introduced to deal with the case of high dimensionand disagreement of multi-feature. This method can aggregate multiple groups offeatures, fuse the complementary information of patterns, and decrease the uncertaintyof individual classifier. Furthermore, the difficulty of designing a classifier and the highcomputation burden derived from high dimension vector can be avoided. Theexperimental results show that the presented method is better than any individualclassifier, and is more effective and robust thanother methods.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络