节点文献

人类RNA聚合酶Ⅱ启动子识别研究

Research on Human POL Ⅱ Promoter Recognizing

【作者】 智慧

【导师】 李通化;

【作者基本信息】 同济大学 , 分析化学, 2008, 硕士

【摘要】 启动子的识别是基因识别的重要组成部分。对启动子区的认识,不仅有助于实验室分析研究,而且还可以为人类认识全基因组功能、基因表达调控机制以及人类疾病与启动子多态性或突变的关系提供很大的帮助。本文旨在对人类RNA聚合酶(POL)Ⅱ启动子数据进行识别分类并提高识别的准确率。我们将创新的编码方法应用在人类启动子序列编码中,建立并使用合适的共识模型,使用支持向量机(SVM)的方法对启动子数据进行分类并提高了启动子识别的准确率。首先,我们从真核生物启动子数据库(EPD)以及非启动子数据库中得到用于分类研究的DNA启动子序列数据及非启动子序列数据。正、负数据集均分别被分成5份和10份,用于5重(5-fold)及10重(10-fold)交叉验证。另外,我们还从转录起始位点数据库(DBTSS)中得到了由实验得出的人类染色体启动子数据,准备用于后续的研究。然后,在对数据进行处理后(包括保证数据的非冗余性等),对碱基数据进行编码、选择合适的参数及编码方法。这是本研究的重点和难点。根据采用编码方式的不同,将之分为三步。第一步,本文采用了基于知识的统计编码方法,并将此方法进一步扩展成六种子编码方式,分别是:单碱基统计特征编码、相邻双碱基统计特征编码、隔一位的双碱基统计特征编码、隔两位的双碱基统计特征编码、隔三位的双碱基统计特征编码以及相邻三碱基统计特征编码。编码后在SVM中进行启动子识别,使用10-fold交叉验证的准确率达到了89.68%,灵敏性在86.24%~90.11%,特异性在85.91%~98.35%,与其他利用SVM进行启动子识别的工具相比,均有5%左右的提高。第二步,本文采用了CpG编码和五联体(Pentamers)编码,从不同的角度对人类RNA POLⅡ启动子序列进行编码,提取变量信息,找出预报结果最佳及搭配最合理的编码方式用于后面的研究。第三步,本文还尝试了一种新的编码方法——模式字典(Pattern Dictionary)的编码方法(由本实验室开发),并且针对启动子数据的特点,将ATCG四碱基两两结合,扩展成十六种字符进行编码,以增加数据的特征变量。再次,基于上述编码方法的识别结果,根据编码方式的不同、样本选择的不同、核函数选择的不同等等,我们建立出不同类型成员子模型的共识模型,并用双层SVM进行识别分析。由于共识模型考虑了各子模型的独立性和模型之间的差异性,发挥了各模型之间的互补优势,从而提高了最终的识别准确率。最后,我们将优秀的识别模型及共识模型的思想应用到人类22号染色体启动子数据的识别中,识别准确率达到了90.98%。

【Abstract】 Promoters Recognition is an important part of the research of the gene recognition. Finding the knowledge of the promoter regions not only redounds to the analysis and research in the laboratory, but is helpful to the human knowing the function of the whole genome, the mechanism of the gene expression and controlling, and the relationship of the human diseases and the polymorphism or mutation of the promoters.This paper aimed to do the recognition of the human RNA POLⅡpromoters, classify the promoter sequences, and promote the veracity of the recognizing results. We applied novel encoding methods to the encoding of the human promoter sequences, built up right consensus models, and recognized the promoter sequences with the Support Vector Machine (SVM), and finally improved the veracity of the recognizing results.Firstly, we got the promoter and non-promoter sequences data from Eukaryotic Promoter Database (EPD) and non-promoter databases, which were used for the recognition research. Both of the positive and negative data were divided into 5 and 10 parts, for the 5-fold and 10-fold cross-validation. Otherwise, we also got the human chromosome promoter data from the DataBase of Transcriptional Start Sites (DBTSS), which were got from experiments. The data were used for the following research.Secondly, we did the pre-processing of the sequences data, including guarantee the non- redundant of the data, encoded the sequences data, and selected the suitable parameters and encoding methods. This part of our work is the emphasis and difficulty of the research, and we divided it into three steps: Step one, we applied the knowledge-based statistical encoding method, which were expanded into 6 sub-encoding methods, such as, single-base statistical encoding method, adjacent dual-base statistical encoding method, one-base apart dual-base statistical encoding method, two-base apart dual-base statistical encoding method, three-base apart dual-base statistical encoding method and adjacent ternate-base statistical encoding method. Then we recognized the data with SVM, the accuracy of the 10-fold cross-validation reached 89.68%, the sensitivities were from 86.24% to 90.11%, and the specificities were from 85.91% to 98.35%, compared to other SVM used promoter recognizing tools, our results had nearly 5% precedence.Step two, we applied the CpG islands and Pentamers encoding methods, encoded the promoter sequences data in a different perspective, extracted the information of the variables, and selected the encoding method which got the best recognizing result, used for the following research.Step three, we tried the Pattern Dictionary encoding method, and expanded the 4 bases into 16 bases, combining the arbitrary two of the A, T, C and G four bases, to increase the amounts of the variables, according to the characteristic of the promoter sequences data.Thirdly, we built up the right consensus models, according to the results of the different encoding methods. Based the differences of the encoding methods, the differences of the sample selecting methods, the differences of the kernel functions, .etc, we built up consensus models with different sub-models, and did the recognition with dual-SVM. We finally promoted the accuracy of the recognition, for the consensus models included the independence and difference of each sub-models, and exerted the superiorities and the complementarities of the sub-models.At last, we applied the excellent recognition model into the human chromosome 22 promoter recognizing, and the accuracy of the recognizing reached 90.98%.

  • 【网络出版投稿人】 同济大学
  • 【网络出版年期】2008年 07期
  • 【分类号】Q55
  • 【被引频次】1
  • 【下载频次】120
节点文献中: 

本文链接的文献网络图示:

本文的引文网络