节点文献
基于基因表达谱及序列特征的转录调控关系挖掘
Transcriptional Regulatory Relationship Mining Based on Microarray Data and Sequence Features
【作者】 刘万霖;
【作者基本信息】 中国人民解放军军事医学科学院 , 生物化学与分子生物学, 2010, 博士
【摘要】 基因转录调控在生物体中广泛存在,并对生物体行使正常的生理功能起着至关重要的作用。对基因调控网络进行研究,有助于增进人们对生物系统特征的了解。实验技术的进步,尤其是包括微阵列芯片技术在内的高通量实验的广泛开展,使得生物学数据海量涌现。利用微阵列等高通量数据进行转录调控关系挖掘的生物信息学方法研究,也逐渐得到了学术界的广泛关注。然而,目前众多的基因转录调控的生物信息学研究,仍然存在以下多个问题:相当一部分工作是围绕着某一个具体的生理、病理问题,设计有针对性的湿实验,再对得到的实验数据进行挖掘,这类方法缺少推广性;而另外一些宏观层次的工作往往引入了复杂的模型,没有对基因表达谱数据本身特征或性质进行进一步的提取和挖掘;还有一些工作对转录调控关系元的序列特征进行了分析,但只考虑了已知的模式或特征结构,在数据挖掘方面具有一定的偏性;另外一些工作使用了基于微阵列或染色质免疫共沉淀芯片等数据所得到的转录调控关系强度,但由于这些数据本身敏感性的问题,具有较大误差。为了解决上述问题,本文基于基因微阵列表达谱及序列特征就转录调控关系挖掘方法展开研究,并取得了如下成果:首先,我们利用微阵列表达谱进行调控关系挖掘的新参数体系的构建。从微阵列所表示的基因表达水平出发,我们引入及提出包括表达谱相关性、动态变化范围,以及表达水平矢量等多种参数或方法,描述了转录调控关系元表达水平相似度、动态变化范围差异、统计性质差异、各条件下表达水平的一致性程度等特性,用以进行转录调控关系分析。结合转录因子与靶基因的功能共注释分析,衡量转录调控关系元功能一致性,提高模型预测的准确度。在此基础上,我们使用贝叶斯模型对几组参数进行整合,以获得转录调控关系的存在概率。同时,为增强预测的效能和可信度,我们提出了联合似然比来描述成对参数的性质。利用时序微阵列数据中所体现的扰动的时延特性,选取合适的参数,辅助判定转录调控关系方向性,从而得到了完整的转录调控关系,为准确构建基因调控网络打下基础。其次,我们提出了微阵列表达谱特征的无监督机器学习与优化方法。参数化的学习,固然可以给出直观的参数,便于后续的分析研究。但是将高维的微阵列数据进行参数化提取信息,可能会导致信息损失,或产生先入为主的偏性。另一方面,微阵列数据中包含的大量噪声也会对转录调控关系挖掘带来负面影响。有鉴于此,我们以无监督的机器学习降维算法,代替经验的参数选择,提取有代表性的表达量信息,并排除干扰信息影响,进行转录调控关系的挖掘。我们定义了转录调控关系对的表达模式参数,通过非负矩阵分解以及主成分分析来提取表达水平的主要特征,提高了转录调控关系预测的准确率。第三,我们提出调控关系元序列特征的无偏提取方法。受微阵列表达谱原理的局限,某些随条件或时序变化较小的基因所参与的转录调控关系难以通过分析微阵列表达谱数据而获得。因此,对转录调控关系元的序列特征进行考察是很有必要的。我们利用氨基酸序列的特征,结合数学降维算法,提取转录调控元的序列特征。结合先验知识,通过机器学习方法训练模型参数,提出寻找转录调控关系元的特征序列的无偏提取方法。同时我们还使用空间向量作为特征序列的数学表示方法,构建合适的模型,将序列特征与转录调控关系存在与否联系起来。结果表明基于序列进行转录调控关系挖掘是可行的。进一步的分析证明,不同的特征选取方法与聚类方法,对结果的影响不大。通过进一步改进特征提取方法,可以得到更为理想的预测准确度。总之,使用序列信息构建的向量空间模型可以较为有效地预测出转录调控关系的存在。该方法具有其重要性和可行性,与基于微阵列进行转录调控关系的方法可以互为补充和参照。不同于其它通过全局计算微阵列表达谱的基因调控网络构建方法,本文通过寻找多种参数,辅以其它生物学知识,挖掘转录调控关系元与其表达谱之间的联系,构建较为精细而准确的基因调控网络。并结合转录因子与靶基因无偏序列特征提取的方法,发展利用序列特征进行转录调控关系预测的新方法。最终,建立了一套结合不同数据源、利用多种策略进行转录调控关系挖掘的综合性方法。这套方法可以在一定程度上避免或者减少现有方法的不足,提高转录调控关系挖掘的灵敏性和覆盖度,从而促进对以基因调控网络为代表的生化网络乃至整个生物学系统的了解。全文研究内容层层递进,互为支撑。本文的主要创新点包括:利用微阵列表达谱进行转录调控关系挖掘的新参数体系的构建;微阵列表达谱特征的无监督机器学习与优化;转录调控关系元序列特征的无偏提取。几方面研究互相支持和补充,用于转录调控关系的预测和挖掘。此外,从方法学研究来说,本研究具有较强的通用性和可拓展性。同时,疾病的遗传学检验日益成为研究的热点,目前来看,微阵列是最适用于这一领域的分析手段。因此,我们所建立的这一系列快速、参数化的表达谱分析体系,将会对临床诊断中利用微阵列的基因型研究和分析有所帮助。
【Abstract】 The properties of a biological system include system structures, system dynamics, control method and design method. Biological systems can be depicted as various biological networks, such as metabolic networks, signal transduction networks, regulatory networks, and so on.As one basic process of biological activity, gene regulation plays a dominant role in the biological system. By analyzing gene regulation via experimental and bioinformatic method, we could extract the structure features of a biological system. We can also identify the complex regulatory relationships, uncover the regulatory patterns in the cell, and gain the systematic view of the biological process by the gene regulatory network analyses.With the deeper development and broader application of the high-throughput techniques in the research of life science, microarray data emerges massively and rapidly, which makes the research on the gene regulatory network reconstruction become a hotspot.Many algorithms have been developed to construct gene regulatory networks based on microarray data. Unfortunately, most of these works focus on a specific biological or pathology problem by mining the precise wet-experiment data. Besides, intuitive parameters could not be produced by most models. One remaining problem is whether there are some simple but potential basic characteristics of microarray to be uncovered.Aiming to overcome these shortcomings, we integrated multiple parameters to characterize the expression profile features and combined them with other biological evidences. Meanwhile, we extracted sequence features of regulatory elements without using the prior knowledge. Combining several different evidences, we developed a new approach to predict the regulatory relationship. Our research is based on the model organism Saccharomyces Cerevisiae. The first step is to select features to measure expression profiles. Then we extract sequences features of the expression elements. Finally, a comprehensive method is constructed to infer the gene regulatory relationships, which expand our knowledge on biological system.Based on the expression correlation, the expression level variation and the vectors derived from microarray datasets, we first introduced several novel parameters to describe the characters of regulating gene pairs. Subsequently, we used the na?ve Bayesian network model to integrate these features and the functional co-annotation which lies between the transcription factors and their target genes. This model is proved to have higher efficacy than the previous individual feature model. With this model and based on the time-delay character of time-series microarray datasets, we can predict the accuracy and coverage of existence and direction of the regulatory relationship respectively. This helps to build an integrated prediction and evaluation system.Parametric approach has both pros and cons. A series of parameters may be intuitive indexes. However, information extraction may cause information loss or misleading. Besides, noise included in microarray may disturb the results. So we chose machine learning approach instead of manual selection. We introduced an expression pattern index FAB . With this index, we extracted the main features of expression level and excluded interference elements via Principle Component Analysis method. This approach is proved to be able to improve the accuracy of regulatory relationship prediction.Not all the essential genes can be detected by the knock-out or knock-down experiments because of the expression diversity. In this case, sequence features analysis should be considered. We used dimension reducing algorithm to extract sequence features of the regulatory elements. With the help of prior knowledge, we adopted support vector machine-based method to find the sequence feature of regulatory elements. The results show that it is feasible to mine regulatory relationships based on sequence feature. The accuracy is stable when the clustering methods and the clustering character are changed. And the parameters extracted from tensor analysis have also been verified to be acceptable. This approach might be a suitable complement to microarray-based approach. Unlike other global expression profiles computing methods, our approach is mainly based on several novel parameters, which could be intuitive indicators. Combining some prior knowledge, our approach could improve the accuracy of regulatory relationship mining. The regulatory element feature selection result shows its advantages on mining the regulatory relationship by using the sequence feature.To summarize, we firstly proposed a novel parametric approach to infer gene regulatory relationship from microarray datasets. Then we used machine learning method to extract expression feature and mine the regulatory relationship. Finally we developed a new strategy for gene regulatory relationship mining based on sequence features analysis, which can greatly improve the sensitivity and coverage of transcriptional regulatory mining.With the development of the microarray technology, our approaches are promising to bring more contribution to the regulatory network research as well as the genome type analysis in the clinical diagnosis.
【Key words】 Bioinformatics; Gene regulation; Data mining; Microarray;
- 【网络出版投稿人】 中国人民解放军军事医学科学院 【网络出版年期】2011年 03期
- 【分类号】Q75
- 【下载频次】561
- 攻读期成果