节点文献

哺乳动物转录因子及其靶基因的挖掘分析

Data Mining for Mammalian Transcription Factors and Downstream Targets

【作者】 郑广勇

【导师】 朱扬勇;

【作者基本信息】 复旦大学 , 生物信息学, 2009, 博士

【摘要】 转录因子是转录调控中的核心功能蛋白,能够与顺式调控元件相结合并对下游基因的表达进行严格调控,在生命有机体的许多重要生化过程中发挥着不可或缺的关键性作用。鉴于转录因子在转录调控中的重要意义,转录因子及其下游调控靶基因的识别成为后基因组时代的研究热点之一。传统工作中,生物学家主要通过实验生物学方法来识别转录因子及其下游靶基因。实验生物学方法能够获得比较准确的数据,但其实验周期较长,因而无法在短时间内提供丰富的转录调控数据。近年来,生物学家开始引入计算生物学方法来加速转录调控领域的研究进程,主要工作集中在转录因子的识别和顺式调控元件模型的构建上。在转录因子的识别工作中,生物学家主要通过机器学习算法来构建转录因子的识别工具。目前已建立基于BLAST算法和最近邻算法的转录因子识别方法,但这些方法在哺乳动物中的应用不是十分理想。在顺式调控元件的相关工作中,生物学家尝试使用各种指标来建立模型,表征转录因子与顺式调控元件的识别偏好性,但二者的结合规律比较复杂,目前仍然处在探索过程中。本文采用蛋白质结构域和功能位点信息来组成蛋白质序列的特征向量,并在此基础上建立了基于支持向量机算法的转录因子自动识别机。然后耦合支持向量机和纠错输出编码算法,建立了转录因子自动分类器。使用生物学实验验证的数据对所构建的转录因子识别分类工具进行检测,结果显示:本文的自动识别机和分类器具有优异的性能,对转录因子的识别和分类准确率分别达到了88.22%和97.83%。为进一步评估这两个工具的性能,把自动识别机、分类器与BLAST、最近邻算法建立的转录因子识别分类工具进行了比较,结果表明:相比于BLAST、最近邻算法建立的转录因子相关工具,本文的自动识别机、分类器对转录因子的识别和分类具有更为出色的能力。随后使用自动识别机、分类器对哺乳动物中的人、小鼠、大鼠基因组中的蛋白质序列进行分析,获得了大量的潜在转录因子。在转录因子识别工作基础上,为了获得转录因子的下游靶基因信息,引入了反向工程思想,并发展了转录因子-下游调控基因作用对挖掘工具。随后使用该工具对人、小鼠、大鼠基因表达数据进行挖掘,获得了丰富的转录因子下游靶基因信息。使用fisher精确统计方法,对下游靶基因信息的可靠性进行检验,结果显示:在一定程度上,本文的挖掘工具是有效的,所获得的下游靶基因信息是可信的。为进一步研究转录因子对下游靶基因的调控机制,对转录因子与顺式调控元件的结合规律进行了初步的探索:在整合多种生物学指标的基础上,通过决策树算法,建立了组合的顺式调控元件描述模型。使用人、小鼠、大鼠基因组中的多组转录因子-顺式调控元件相互作用数据,对组合模型进行测试,结果显示:组合模型能够很好地描述转录因子与顺式调控元件之间的识别偏好性,从而对二者的结合规律进行回答。在上述工作基础上,为方便生物学家使用工作中挖掘获得的转录因子及下游靶基因信息,构建了综合的哺乳动物转录因子分析平台。平台不仅包含了丰富的转录调控数据,同时提供了方便的转录因子在线预测工具。该平台将成为转录调控领域的重要资源,并将为相关领域的研究提供有力的支撑。本文对哺乳动物的转录因子及其靶基因进行了挖掘分析,有效地解决了目前哺乳动物转录调控数据积累不足的问题。在此基础上,就转录因子与顺式调控元件的结合规律进行了初步研究,提高了人们在分子层面上对转录调控机制的认识。我们相信,通过对转录因子的全景式研究,必将帮助人们在系统层面上对基因组信息进行解读。

【Abstract】 Transcription factor (TF) is a core functional protein of transcriptional regulation, and it controls expression level of downstream genes (TF targets) through interacting with cis-regulatory element (CRE), which plays significant roles in some vital biological processes of an organism. Investigation of TFs and their targets becomes a hot research area in post genome era because of their important function to transcription.Traditionally, experimental approaches are used to investigate TFs and their targets by biologic scientists. People can obtain accurate information about transcriptional regulation through experimental approaches, but these approaches are time-comsuing and they can not provide abundant information in a short time. Hence biologic scientists begin to explore transcriptional regulation through computational methods recently, which most of works are focus on TF identification and CRE modeling. For TF identification, machine learning algorithm was generally used to build analysis tools. Currently, identifying methods based on BLAST and nearest neighbour algorithm (NNA) are built, however performance of these methods are not satisfied when applied in mammalian. For CRE modeling, biologic scientists try to describe preference between TF and CRE through constructing models with various features. Nevertheless, process of CRE modeling is still on going because of complicate interaction mechanism between TF and CRE.In our work, the support vector machine (SVM) algorithm was utilized to construct an automatic detector for TF identification, where protein domains and functional sites were employed as feature vectors. Then a TF classifier was built by combining the error-correcting output coding (ECOC) algorithm with SVM methodology. Datasets valided by biological experiments were used to test performance of the detector and classifier. Test results demonstrated that the two tools had excellent capability for TF analysis, and overall success rate of identification and classification for TF achieved 88.22% and 97.83%. In order to evaluate performance of these tools further, we compared our tools with tools built from BLAST and NNA respectively. Comparison results showed that our tools were superior to tools of BLAST and NNA for TF analysis. After that, the detector and classifier were utilized to analyse protein sequences of Human, Mouse, and Rat. As a result, plentiful putative TFs were obtained.Subsequently, a mining tool for TF-target pairs was developed based on reverse engineering theory so as to get regulated genes of TFs. After that, the mining tool was used to analyse microarray data of Human, Mouse, and Rat. As a result, lots of TF-target pairs were gained. The fisher’s exact test was carried out to assess reliability of TF-target pairs in work. Results of fisher test indicated that approach used here to predict TF-target pairs were valid, and information of downstream genes for TFs inferred here was believable to some extent.In order to further explore regulatory function between TFs and their targets, we investigated interaction mechanism between TF and CRE. In work, a combinational model of CRE was constructed based on decision tree through assembling serverl biologic features. After that, in Human, Mouse, and Rat, many interaction pairs between TF and CRE were employed to estimate performance of the combination model. Results of estimation made clear that the model did have good power to depict binding preference and interaction mechanism between TF and CRE.Finally, an integrated platform of TF was built so that biological scientists can conveniently use information of TFs and their targets acquired in our work. In brief, abundant data of transcriptional regulation was contained in the platform, which also provides a prediction tool for TF. We believed that the platform will serve as an import resource for community of transcription researchers, and present strong support for exploration of transcriptional regulation.Currently, the data of transcriptional regulation in mammalian is far from sufficient. In order to solve the problem, we mined and presented a great deal of information about TFs and their targets in Human, Mouse, and Rat. Moreover, we investigated binding characteristic between TF and CRE, which will increase people’s knowledge of transcriptional regulation machenism. In summary, we think the work of comprehensive research for TF will help people interpret genome information in systems level.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2010年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络