节点文献

基于机器学习方法的蛋白质亚细胞定位预测研究

Research on Protein Subcellular Localization Prediction Based on Machine Learning Methods

【作者】 马军伟

【导师】 顾宏;

【作者基本信息】 大连理工大学 , 控制理论与控制工程, 2011, 博士

【摘要】 随着生物信息的爆炸性增长,采用实验的方法收集和分析相关的生物信息已远远不能满足实际研究的需要。人们已经迫切认识到,采用智能数据处理技术解决上述问题可以大大节省时间和成本。蛋白质序列信息是这个领域的研究重点之一,本论文运用机器学习方法对蛋白质亚细胞定位预测和蛋白质结构类预测展开研究,主要工作如下:1、针对革兰阴性杆菌亚细胞定位预测问题,本文提出了改进的选择性集成Elman神经网络方法。首先以Elman网络作为基底分类器;然后利用多种不同的算法来训练Elman网络,以增加基底分类器的多样性;最后用GASEN算法选择合适的网络进行集成,使集成后的各个网络彼此互补,相互协调。采用氨基酸组成成分分析表示蛋白质序列,在自相容验证、留一法验证和独立测试集验证等三种实验模型上都取得了良好的效果。2、针对蛋白质亚细胞定位预测问题,本文构造了一种新颖的亚细胞定位预测系统ELM-PCA,可以预先确定传统的伪氨基酸组成成分分析模型中反映氨基酸序列次序效应的参数。在该系统中,首先让参数λ取最大以包含尽可能多的序列次序信息,然后用主成分分析技术提取关键主特征,最后采用Elman神经网络作为分类器,实验表明ELM-PCA的性能要优于已有的预测系统;同时,将主成分分析技术和伪氨基酸组成模型结合,形成了新的蛋白质表示模型PPseAAC,在几个常用的机器学习算法实验中表明此模型要优于原始模型。3、针对蛋白质结构类的预测问题,本文提出了改进的局部线性嵌入映射(LLE)算法,克服了传统局部线性嵌入映射算法在求取最优重构权值时常常出现的奇异现象。改进的算法基于共轭梯度算法,具有有限步收敛的性质,求解过程中不涉及矩阵的逆运算。在此基础上,把此改进的局部线性嵌入映射算法应用于蛋白质结构类的预测,采用k-nn分类器,伪氨基酸组成模型中参数λ值大于序列长度L。在Jackknife实验中,结果显示本方法具有较好的预测性能。

【Abstract】 With the explosive growth of biological information, experimental methods of collect-ing and analyzing the related biological information have been far from meeting the needs of the actual research. People have urgently realized that using intelligent data processing techniques to solve the above problem can greatly save time and cost. Protein sequence information is the focus of research in this field. This paper employs machine learning methods to study on protein subcellular localization prediction and protein structural class prediction. The main contributions are described as follows:1. An improved selective Elman neural networks ensemble method is proposed for Gram-negative bacterial protein subcellular localization prediction. Firstly, Elman net-work is used as a base classifier:Secondly, many different algorithms are employed to train the Elman network to consider the diversity of the base ensemble; lastly, GASEN algorithm is used to select appropriate networks for ensemble, to make sure the networks can complement and coordinate each other. Meanwhile, amino acid composition is em-ployed to represent the protein sequence. Experimental results show that our method can achieve better performance in the self-consistency test, the jackknife test and the independent data set test.2. A novel prediction system ELM-PC A is designed for protein subcellular local-ization prediction, which can determine in advance the parameter value that reflects the protein sequence order effects in the traditional pseudo amino acid composition (PseAAC). Firstly, the parameter A is set to be the maximum to contain the more sequence order information. Secondly, principal component analysis (PCA) is employed to extract the essential features. Finally, the Elman network is used as a classifier. Experimental results show that the system performance is better than other existing systems. Meanwhile, PCA and PseAAC are combined into a new protein representation model PPseAAC. Ex-periments for several common machine learning algorithms show that the new model is superior to the original model.3. An improved locally linear embedding (LLE) algorithm is proposed for protein structural class prediction, which can overcome the singular phenomenon via solving the optimal reconstruction weight in traditional LLE algorithm. This improved algorithm is based on the conjugate gradient algorithm, which has convergence property in finite steps and does not involve the inverse matrix. Furthermore, this algorithm is applied in the protein structural class prediction, where the simple k-nn classifier is used and the parameterλof PseAAC is greater than the sequence length L. Experimental results show that the proposed method has better performance in the jackknife test.

节点文献中: