节点文献

基于伪氨基酸成分和功能域的蛋白质序列分类研究

The Study on Protein Classification Based on Pseudo Amino Acid and Function Domain Compositon

【作者】 王普

【导师】 肖绚;

【作者基本信息】 景德镇陶瓷学院 , 机械设计及理论, 2009, 硕士

【摘要】 人类基因组计划启动以来,蛋白质数据库中储存了海量的序列信息,但是对蛋白质结构与功能的认识却严重滞后。在这种情况下,探索理论和计算的方法就显得尤为重要,它将对认识蛋白质结构和功能起到重要的辅助作用。蛋白质分类问题作为蛋白质组学研究的一个分支,近年来受到研究者们的关注越来越多。蛋白质分类研究是全面掌握蛋白质结构与功能的前提和基础,在细胞生物学、分子生物学、医学和药理学中扮演着非常重要的角色。在构建蛋白质分类计算模型的过程中,特征提取算法是最为基本的问题,有时甚至成为关系分类质量好坏的关键所在。本文详细分析并研究了此问题,提出了基于元胞自动机图像参数的伪氨基酸成分和SMART功能域表示法,在标准数据集上进行了测试验证,大大提高了分类预测率。本文的主要工作和创新之处概括如下:(1)本文利用氨基酸数字编码模型生成蛋白质序列的元胞自动机图,提出了一种基于纹理图像特征的伪氨基酸成分表示法。用扩大的协方差算法对蛋白质二级结构类型进行预测,仿真结果显示有较好的分类效果。(2)本文提出了一种新的蛋白质序列特征杂交表示法——SMART功能域成分结合伪氨基酸成分。要理解一条蛋白质序列的结构和功能,一个重要的前提任务就是辨别一个新的多酞链的四级结构类型。本文采用最近邻居算法对七类同源寡聚体蛋白的分类问题进行了探讨。实验结果表明,该方法计算简单、分类性能好;另外拓展了蛋白质序列四级结构分类,构建了四级结构超家族数据集,并用功能域和伪氨基酸方法对其分类进行了研究。(3)设计了G蛋白偶联受体的两级分类器,对序列的元胞自动机图像纹理特征和功能域分布状况进行了较为深入的分析。

【Abstract】 With the HGP put in practice, abundant sequence information is stored in biologic database. However, there is a very lack of understanding of the protein structure and function. In this situation, it is very important to explore theoretical and computational approaches, and this will boost the prediction of protein structures and functions from immensurable sequences.In these years, protein classfication, as an important aspect of proteomics, arose more and more attention. Feature extraction of protein sequence is a basic problem in the research of protein classification, even a key factor of the classification performance. This thesis lucubrate this problem and proposes a few new feature extraction algorithms, such as charactor parameter based on Celluler Automation Image and SMART function domain composition, which perfermonce very well in some protein classfication problemes. The main work and the creative achievements in this thesis are shown as followed:(1) Investigating the prediction of secondary structural class of proteins. Based on the concept of CAI, a new approach is presented. It was demonstrated thru the jackknife cross-validation test that the overall success rate by the new approach was significantly higher than those by the others.(2) For the protein quaternary structure prediction, two different composite feature extraction methods are raised, combined with the nearest neighbor algorithm, good results are obtained.(3) Designing a two level classifier for GPCR, carefully researching the CAI texture character and SMART function domain distributing status of the different sequence species.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络