节点文献

最大信息原理、能量及选择约束在基因剪接位点预测分析中应用的研究

Studies on the Application of Maximum Information Principle, Energy and Selection Constraints to the Prediction and Analysis of Splice Sites in Genes

【作者】 晋宏营

【导师】 罗辽复;

【作者基本信息】 内蒙古大学 , 理论物理, 2009, 博士

【摘要】 确定基因组内所有基因并阐明基因的功能,不仅要依靠实验手段,还需要发展理论方法对实验进行指导。最大信息原理(maximum information principle,MIP)是非平衡统计理论的一个基本原理,该原理是模拟生物进化中突变——选择机制的一个很好的模型,可作为生物信息学提取信息的重要依据。完整基因结构的预测是当前研究的一个重要课题,其中一个关键环节是剪接位点(包括组成性和可变剪接位点)及各种可变剪接事件的精确识别,而预测已知剪接位点的侧翼竞争者是预测可变5’或者可变3’剪接位点事件的关键。本文把最大信息原理应用到剪接反应理论分析中,导出了剪接位点片段的反应自由能表达式;通过引入选择压力指数概念及相应的约束,导出了序列片段中k-mer的选择压力指数表达式。当把理论应用到剪接位点及其侧翼竞争者的预测问题时,获得了较高的预测精度。主要研究内容如下:1.从剪接反应的基本物理原则出发,应用传统最大信息原理分析了剪接位点保守片段。引入剪接位点片段在剪接反应中所涉及的反应自由能概念及相应的约束条件,基于反应自由能加性假设,推导出了剪接位点片段所涉及反应自由能的表达式。作为一个简化模型,该式能用于估计一个5’或者3’剪接位点片段在剪接反应中所涉及的自由能变化。把它运用到剪接位点的预测问题中进行检验时,预测结果精度较高,这说明其较为合理地反映了剪接反应的实际情况。2.作为剪接反应自由能理论估计的一个开端,精确性仍需提高。我们进一步把反应自由能加性假设改进为包含了剪接位点片段中各碱基之间关联的形式,并把传统的最大信息原理改进为包含背景概率的形式;进而导出一个不但考虑了背景概率影响,而且较全面地包括了片段中各碱基之间关联的更精确的剪接位点片段所涉及反应自由能估计表达式。使用该式对剪接位点进行预测时,预测精度与改进前相比有明显提高,说明改进后的表达式更为成功地符合了剪接反应过程。3.使用改进后的剪接位点片段反应自由能表达式预测了人类和小鼠基因中的可变和组成性剪接位点及其侧翼竞争者,预测结果较好,精度比得上最大熵模型等一些当前流行的方法。对于已知剪接位点侧翼竞争者的预测,使用竞争者片段本身的反应自由能估计值预测的精度要高于另一个预测指标——已知剪接位点片段和候选竞争者片段之间的反应自由能估计值之差,这说明就大量剪接位点的总体效果而言,在已知剪接位点片段和侧翼竞争者片段之间的反应自由能竞争不是一个决定可变剪接位点选择的唯一主要因素。4.为了把序列片段或其中k-mer所受的自然选择强度数量化,引入选择压力指数的概念,并引入相应的约束条件,利用最大信息原理推导出序列片段中k-mer的选择压力指数表达式。该式易于和功能联系而对某些功能物理量进行定量估计,前面的剪接反应自由能估计方法也可被纳入到选择压力指数理论框架内。当把理论应用到人和小鼠的组成性和可变剪接位点预测中时,反应自由能估计值和侧翼序列中k-mer的平均选择压力指数共三个指标用二次判别法整合形成的综合方法的预测能力与单个反应自由能指标相比有明显提高。5.基于序列信息量构造了可用于编码区预测的信息差异指数,它的预测能力比得上非均匀指数。使用选择压力指数分析了剪接位点侧翼序列中k-mer所受选择的情况,得到5’剪接位点左侧的GT二核苷酸以及3’剪接位点左和右侧的AG受到较强负选择等一些有意义的结论;还发现剪接位点左右两侧序列中k-mer所受选择情况存在较大差异,并基于此结果设计了两个预测指标。通过选用反应自由能估计值等七个指标,二次判别法整合后对已知剪接位点侧翼竞争者进行预测,精度高于文献中的其它预测方法,是目前为止侧翼竞争者预测方法中精度最高的。

【Abstract】 To recognize gene sequences in genome and to clarify all functions of genes, not only experimental approaches are needed, but also theoretical methods are required to guide experiments. The maximum information principle is a fundamental principle in non-equilibrium statistical theory; the principle gives a good model for simulating the mutation-selection mechanism in the biological evolution, and can be taken as an important basis for extracting information in bioinformatics. Prediction of the complete gene structure is an important subject in the current research, and a crucial part in the subject is to accurately identify the splice sites (not only constitutive but also alternative ones) and all kinds of alternative splicing events. For predicting alternative 5’ or 3’ splice site events, it is the key step to predict flanking competitors of given splice sites.In this dissertation, the maximum information principle is applied to theoretical analysis of the splicing reaction, and an expression of reaction free energy involved by a donor or acceptor site segment is deduced. By introducing the concept of selection pressure index and corresponding constraint, the expression of the selection pressure index of k-mer in the sequence is deduced. When the theory is employed to predict splice sites and their flanking competitors, higher prediction accuracy is obtained. The main contributions are summarized as follows:1. Based on the basic physical principle of splicing reaction, traditional maximum information principle is used to analyze the conservative segments around splice sites. By introducing the concept of reaction free energy involved by a splice site segment in the splicing reaction and corresponding constraint, under the assumption of reaction free energy additivity, an estimative expression of reaction free energy involved by a splice site segment is deduced. As a simplified model, the expression can be employed to estimate the free energy change involved by a donor or acceptor site segment during splicing reaction. When it is applied to the prediction for splice sites in test set, the results show high accuracy, so the expression well presents the actual situation of splicing reaction.2. As a beginning of the theoretical estimation of the splicing reaction free energy, the accuracy still needs to be improved. Furthermore, we improve the reaction free energy additivity assumption to contain the dependencies among bases in splice site segments, and modify the traditional maximum information principle to contain the background probability. And then we deduced a more accurate estimative expression of reaction free energy which contains not only the background probability factors, but also all kinds of dependencies among bases. When it is employed to predict splice sites, the prediction accuracy is obviously improved compared with the results before modified. That indicates the improved expression is in accordance with the splicing reaction process more accurately.3. The improved estimative expression of reaction free energy is used to predict alternative and constitutive splice sites and their flanking competitors in human and mouse genes, the results are satisfactory. The prediction ability of the expression is comparable with some current popular methods such as maximum entropy model etc. For the prediction of flanking competitors of given splice sites, The reaction free energy of the candidate competitor itself outperforms another measure—the reaction free energy subtraction between a given splice site and its candidate competitor segment, that implies as far as general effect of the numerous splice sites is concerned, reaction free energy competition between a given splice site segment and its flanking competitor segment is not an only primary factor for alternative splice site selection.4. With the purpose of quantifying the intensity of natural selection on sequence segment or k-mers in it, we introduce the concept of selection pressure index and the corresponding constraint condition, and deduce the selection pressure index expression of k-mer in sequence segment by use of the maximum information principle. The expression can easily link with functions and then quantitatively estimate some physical quantity, the foregoing method for estimating the splicing reaction free energy can also be included into the frame of selection pressure index theory. When the theory is adopted to the prediction of constitutive and alternative splice sites of human and mouse, the prediction ability of integrative method, which is formed by the integration of tliree measures (estimative value of reaction free energy, average selection pressure indexes of k-mers in two flanking sequences), is obviously improved compared with single reaction free energy measure.5. Based on the information content of sequences, the information discrepancy index which can be used to predict coding regions is devised. The prediction ability of the index is comparable with the heterogeneity index. The selected situation of k-mers in flanking sequences of splice sites is analyzed by use of the selection pressure index, and some interesting conclusions are drown, such as GT dinucleotide on the left side of 5’ splice site is under negative selection, so is AG on the left and right sides of 3’ splice site. It is found that the selected situations of k-mers in the left and light flanking sequences of splice site are quite different, and two prediction measures are designed based on the result. By selecting seven measures including the estimative value of reaction free energy, etc., and employing quadratic discriminant analysis to integrate them into a coherent method, we predict the flanking competitors of given splice sites. The prediction accuracy is higher than the other methods in current literatures. It has the highest accuracy for flanking competitor prediction up to now.

  • 【网络出版投稿人】 内蒙古大学
  • 【网络出版年期】2010年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络