节点文献

人类基因组转录调节模体距离保守性的研究与转录起始位点的预测

Research on Distance Conservation of Transcription Regulatory Motifs and Prediction of Transcription Start Sites in Human Genome

【作者】 吕军

【导师】 罗辽复;

【作者基本信息】 内蒙古大学 , 理论物理, 2008, 博士

【摘要】 对人类基因组转录调节相互作用网络的理解,是现代分子生物学面临的一个直接的挑战。这里的一个中心问题是,如何从近邻物种的启动子的比较,来提取进化信息和搜索进化保守性。通过对人类转录因子结合位点(transcriptionfactor binding site,TFBS)序列中的核苷k联体(k-mer)在人类和小鼠中分布的比较,我们发现一对转录调节7-mer模体(motif)之间的平均距离在人类和小鼠启动子中是保守的。我们称这种保守性为“距离保守性”。这个距离保守性是一种新的进化保守性,不依赖于碱基在基因组序列中的严格定位。利用这种k-mer距离保守性可以发展非联配方法来实现在基因组范围快速地发现转录调节模体。本文中,我们用距离保守性在基因组范围对保守转录调节模体进行搜索,成功率为90%。另外,作为对距离保守性的进一步检验,我们研究了人类组织特异性的转录调节模体对(motif pair),发现在由距离参数构成的2维空间中,对于28个组织,模体对可以显著地区别于其对照。据此,我们由距离参数构成特征向量,采用Fisher判别分析对人类28个组织的顶上140对转录调节模体的最可几对进行了预测。本文的另一个关于转录调节的相关工作是人类基因组转录起始位点(transcription start sites,TSS)的预测。启动子序列和转录起始位点的精确识别对于解释人类转录调节网络是至关重要的。随着统计理论的发展和机器学习算法在生物信息学预测方面的成功应用,发展新的高效的理论预测模型,在基因组尺度对转录起始位点进行辅助注释,已经成为当今生物信息学发展的主流方向之一。UCSC(University of California Santa Cruz)基因组浏览网站就接受了诸多的基因预测模型,作为基因组尺度的基因辅助注释工具。本文中,我们应用多样性增量结合二次判别分析(Increment of Diversity with Quadratic Discriminantanalysis,IDQD)方法对人类基因组转录起始位点进行了预测。在典型的TSS数据集上,正负集数据比为1:58的情形下,我们的预测结果敏感性和阳性预报值均高于65%。使用ROC和PRC评估算法性能,在正负集数据比分别为1:679和1:113的情形下,auROC均高于96%,auPRC分别为26%和64%。对4、21和22号染色体的全基因组搜索,我们预测了单一启动子和可变启动子5’端的第一个TSS,在正负集数据比分别为1:138和1:68的情形下,auROC分别为93%和97%,auPRC分别为40%和65%。以上结果在相同口径下优于最新报道的国外SVM预测精度。我们的结果显示,多样性增量结合二次判别分析(IDQD)方法有能力解决复杂的生物信息学分类问题。IDQD算法程序即及人类基因组TSS预测的相关数据资料可以在网址http://jichubu.imut.edu.cn/IDQD/idqd.htm找到。全文共分5章,第一章到第三章主要是讨论距离保守性问题,第四章和第五章讨论IDQD算法以及该算法在人类基因组转录起始位点预测问题中的应用。其中,第一章提出距离保守性概念,第二章应用距离保守性概念提出一个非联配的转录调节模体预测模型,给出距离保守性的第一个检验实例。第三章应用距离保守性概念对人类组织特异性转录调节模体对进行预测,给出距离保守性的第二个检验实例。第四章,详细描述IDQD算法,第五章,应用IDQD算法对人类基因组转录起始位点进行预测。

【Abstract】 To understanding the interaction network among transcription-regulation elements in human is an immediate challenge for modern molecular biology. Here a central problem is how to extract evolutionary information and search the evolutionary conservation from the comparison of promoters of closely-related species. Through the comparative studies of k-mer distribution in human and mouse transcription factor binding site (TFBS) sequences we have discovered that the average distance between a pair of transcription regulatory 7-mer motifs is conservative in human-mouse promoters. The distance conservation is a new kind of evolutionary conservation, not based on the strict location of bases in genome sequence. By utilizing the conservation of k-mer distance it will be helpful to propose a non-alignment based approach for fast genome-wide discovery of transcription regulatory motifs. We demonstrated the distance conservation by genome-wide searching of conservative regulatory 7-mer motifs with successful rate 90%. Then, after defining human-mouse pair distance divergence parameter we studied the tissue-specific motif pairs and found that the parameter for motif pairs is 11 to 16 times smaller than for their controls for 28 tissues and these pairs can be clearly differentiated on 2-dimensional parameter plane. Finally, the mechanism of distance conservation was discussed briefly which is supposed to be related to the module structure of TFBSs.The accurate identification of promoter sequence and transcription start site is a challenge to the construction of human transcription-regulation networks. The novel method is highly necessary for improving the prediction.We used the method of Increment of Diversity with Quadratic Discriminant analysis (IDQD) to predict the transcription start sites (TSS). In typical TSS set prediction both sensitivity and positive predictive value have achieved a value higher than 65% with positives/negatives ratio 1:58. The performance evaluations by using Receiver Operator Characteristics (ROC) and Precision Recall Curves (PRC) were carried out, which give area under ROC(auROC) higher than 96% and area under PRC(auPRC)≈26% for positives/negatives ratio 1:679, 64% for postives/negatives ratio 1:113. In whole genome searching we made prediction on alternative-promoter-less and alternative-promoter-containing TSSs in chromosomes 4, 21 and 22 and obtained auROC =93% and auPRC =40% for positives/negatives ratio 1:138 and auROC =97% and auPRC =65% for positives/negatives ratio 1:68. The work shows the IDQD method is capable of solving complicate classification problems in bioinformatics.The implementation of IDQD algorithm, datasets and online-only supplementary data are available at the web site http://jichubu.imut.edu.cn/IDQD/idqd.htm.

  • 【网络出版投稿人】 内蒙古大学
  • 【网络出版年期】2009年 02期
  • 【分类号】Q987
  • 【被引频次】2
  • 【下载频次】244
节点文献中: 

本文链接的文献网络图示:

本文的引文网络