节点文献

支持向量机及密码子偏性在序列识别中的应用

Support Vector Machine and Codon Usage for Sequence Recognition

【作者】 周童

【导师】 陆祖宏;

【作者基本信息】 东南大学 , 生物医学工程, 2006, 博士

【摘要】 随着人类基因组计划和模式生物基因组计划的完成,公共数据库中生物数据的增长速度越来越快。如何从海量的生物数据中解读、提取和获得有用的生物信息,已成为基因组计划下一步亟待解决的问题。本课题的主旨是尝试利用机器学习的方法并结合某些核酸或者蛋白的序列特征来解决一些生物信息学中的问题。具体研究可以分为二个部分:基因同义密码子使用偏性进行的分析;以密码子使用偏性作为序列特征,利用支持向量机来对生物序列进行的识别。在第一部分中,我们对A型流感病毒、衣原体以及酵母的密码子使用模式进行了分析,并且对导致这些物种采用各自密码子使用模式的内在因素进行了探讨。基因组的碱基组成和基因翻译选择的压力被认为是决定物种基因密码子使用的最主要的两种因素。但是,在我们所分析的生物中,这些内在因素并不尽相同。除了上述的两种主要因素,我们发现DNA复制过程中引起的链间的碱基差异、基因所编码蛋白的亲疏水性、基因的功能类型和基因所处区域的减数分裂重组率等都是能影响基因的同义密码子使用偏性的因素。这些探讨性的研究对于理解物种的进化以及指导基因的体外表达都有着重要作用。我们还发现在基因的不同区段里,其密码子使用偏性也有区别。为此,我们定义了相关的统计量:密码子区段使用偏性。通过对酵母和冠状病毒基因的计算分析,我们发现在mRNA编码起始端附近区域的密码子使用偏性与整条序列的偏性存在着差别:稀有密码子相对于其它区段来说,更倾向于出现在编码区的起始位点附近,这可以用“弱势密码子调节假说”来解释。另外,我们也观察到,在冠状病毒基因编码终止端附近,弱势密码子出现的频率也相对较高,我们推测,这也许与基因的表达调控有关。在论文的第二部分中,我们利用支持向量机,结合基因的同义密码子使用偏性,对生物信息学中的一些热点问题进行了研究:我们首次利用核酸序列的信息对G蛋白偶联受体分子的类型进行识别(前人主要利用的是氨基酸序列信息),并取得了很好的预测效果;我们独创性地对酵母基因组减数分裂重组冷热点区的ORF序列进行了分类,结果表明密码子使用偏性是很好的区分重组冷热点的统计量,我们还发现重组冷热点区ORF所编码的蛋白序列存在氨基酸组成上的差异;我们考察了使用支持向量机与密码子使用偏性对细菌基因组水平转移基因进行识别的能力,我们提出,在对细菌基因组的水平转移基因进行识别时,要将受体基因组前导链和滞后链上的基因区别对待,这样在对水平转移基因的预测时会取得更好的结果。另外,我们利用支持向量机技术,使用双联核苷酸使用频率作为序列特征,对干扰RNA的降解效率进行识别,我们取得的预测效果超过通常基于序列特征的打分算法。

【Abstract】 With the achievement of genome project of human and some other model organisms, the amount of available biological data in public databases grows more and more rapidly. How can we learn biological information from these raw data? It has been an urgent problem in genome project.In this paper, synonymous codon usage of genes in influenza A viruses, chlamydiae and yeast is analyzed. It is found that codon usage is influenced by several factors. Although genomic base composition and gene expression level are thought to be the most dominant factors which can affect codon usage, other factors such as strand-specific mutational bias, hydropathy level of corresponding protein, gene function and meiotic recombination rate are also related to codon usage variation.It is assumed that codon usage is alterable in different regions of a given gene. The synonymous codon usage in the translational initiation and termination regions of genes in yeast and Coronavirus is analyzed. It is found that most minor codons are preferentially used in the translational initiation region, which is thought to have a negative effect on gene expression and can be explained by the‘minor codon modulator hypothesis’. Besides, minor codons are observed to be preferentially used in the terminal regions of genes in Coronavirus, which may also regulate the level of gene expression.Based on the result of codon usage analysis, support vector machine (SVM) is applied to solve several hot problems in bioinformatics. First, the information of nucleotide sequence is firstly used to recognize the family of G-protein coupled receptors, which leads to a high prediction accuracy. Second, a novel SVM method is presented for classification of meiotic recombination hot and cold ORFs located in hotspots and coldspots respectively in Saccharomyces cerevisiae, which relies on codon composition differences. Moreover, it is found that there is a considerable correlation between meiotic recombination rate and amino acid composition of certain residues, which probably reflects the structural and functional dissimilarity between the hot and cold groups. Third, the prediction of the horizontally transferred genes is improved by a SVM based algorithm which deals with the genes on the leading strand and the lagging strand separately. In addition, a small interfering RNA (siRNA) efficacy prediction algorithm is developed by using SVM with dinucleotide composition as sequence attribute. This algorithm achieves a better performance than several previous published methods.

  • 【网络出版投稿人】 东南大学
  • 【网络出版年期】2007年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络