节点文献

基于计算智能的基因调控网络建模研究

Research on Computational Intelligence Method of Gene Regulatory Network Modeling

【作者】 杨斌

【导师】 江铭炎; 陈月辉;

【作者基本信息】 山东大学 , 信号与信息处理, 2014, 博士

【摘要】 随着基因组测序工作的完成,单个基因或蛋白质的功能研究已经不能从根本上揭示生命现象的发生和发展规律,所以,在过去的十年中,系统生物学逐渐成为生物学众多分枝中的关注中心。系统生物学是一门快速发展的新兴交叉学科,它结合了生物、化学、物理、数学和计算机科学等学科的知识和技能,其目的在于以系统的、全局的角度来研究生物系统的生理机制。利用基因表达数据实现基因调控网络建模是近年来研究系统生物学的有效手段之一。准确地构建基因调控网络,会大大影响疾病治疗的精度,对于加深细胞活动和致病基因功能机制的理解以及复杂疾病的预防、诊断和治疗具有深远的影响。虽然国内外的研究已经取得了一些成果,但是基因调控网络具有强耦合性、随机性、时变性、强非线性等特点,是一个复杂而庞大的系统,现有的方法较为简单,不能精确地识别基因间的转录调控关系,并且得到太多的假阳性关系。如何有效地建立精确的基因调控模型是目前研究的热点。本论文采用智能计算方法,通过对基因表达数据进行挖掘,实现基因调控网络的重构和基因调控过程中生化反应的建模,并应用到基因芯片得到的冠状动脉粥样硬化斑块基因表达谱中。具体而言,论文的主要工作和创新点描述如下:1.宏观角度建模。针对现有模型构建基因调控网络不准确的现状,本论文提出使用柔性神经树模型(Flexible Neural Tree, FNT)来构建基因调控网络和预测来自于基因表达谱的时间序列。该方法采用类似遗传编程的结构进化算法优化FNT模型的层次结构,编码在结构中的参数则使用模拟退火算法进行优化。这两种优化算法交替使用,直到找到满意的解或者达到规定的迭代上限则循环结束。为了改善构建基因调控网络的准确性,本论文使用模型选择标准AIC和大数表决方法来识别靶基因的最小调控基因集。实验结果表明,相比于Elman神经网络、模糊神经网络、径向基神经网络、递归神经网络、递归模糊神经网络以及这些模型的集成,FNT模型能够更加准确地预测基因表达谱的时间序列,并构造出更精确的基因调控网络。构建基因调控网络的单一模型各有优缺点,使用过程中具有一定的局限性。结合多种模型的系统生物学方法构造的基因调控网络相比单一的模型会更加准确和稳定,这也是模型研究的一个趋势。本论文首次提出了一种多个模型结合的基因调控网络重构方法,即基于互信息混合模型的基因调控网络构建。在该方法中,线性模型和非线性模型分别用来构造基因调控网络,然后对这两种模型对应的网络结构进行整合,得到最终的基因调控网络。使用灵活树模型编码线性和非线性模型,遗传编程和粒子群优化算法分别优化模型的结构和参数。适应值函数包含稀疏系数和相关系数。稀疏系数满足了在实际基因调控网络中,每个靶基因只有极小部分的候选调控因子作为真实因子这个条件,而相关系数充分利用互信息值来评估基因对的相关性,选择与靶基因相关性较大的调控因子。实验结果表明,基于互信息混合模型的基因调控网络构建方法相比于其他经典的单一方法更加准确,不仅在真阳率上保持很高的水平,而且假阳率也很低。2.基因芯片数据处理、调控途径构建和致病基因染色体分布规律分析。本论文采用人类全基因组微阵列HU133Plus2.0基因芯片,使用齐鲁医院和聊城人民医院提供的粥样硬化斑块病人的冠状动脉和正常冠状动脉的样本组织,构建冠状动脉粥样硬化斑块和正常组织的基因表达谱。通过两组表达谱对比,筛选出1104个差异表达基因,然后采用GO功能分类、pathway分析等生物学方法分析这些表达基因,了解其生物功能和生物通路的变化。GO分析发现冠状动脉粥样硬化差异表达基因涉及多个生物功能,如细胞黏附,生物黏附等。Pathway分析发现基因在黏着斑通路显著性富集。在第四章提出的基于互信息混合模型的基因调控网络构建方法用来预测黏着斑通路中差异表达基因间的调控关系,正确预测了Rho激酶调控机制,证明了基因调控网络构造方法的有效性。论文收集了包括人类、小鼠、斑马鱼、果蝇和线虫五种物种的基因组数据,14种疾病的蛋白编码致病基因和与白血病相关突变数据,分析它们在染色体上的基因密度分布情况。结果发现,基因在染色体间的分布显示了一种异质性模式,蛋白质编码致病基因有着相似的染色体间分布模式,并且涉及某些生物过程的蛋白编码致病基因富集在一个或少量几个染色体上。人类19号染色体拥有最高的或者第二高的蛋白编码致病基因分布频率,这可能和这个染色体拥有更多参与转录调控过程的基因有关。这些发现可以针对特定的染色体,改善疾病相关基因筛选研究的效率,如GWAS,全基因组连锁分析和全基因组测序。3.微观和随机角度建模。基因调控涉及大量的生化反应过程,在这些过程中,尤其是在含有少量调控分子物种并且相互作用速度很慢的情况下,离散性和随机性可能起到重要的作用。本论文提出了一种新的随机和延迟随机生化反应模型自动推导模拟框架。灵活反应模型(Additive Reaction Model)编码化学反应模型,首次结合了随机、离散和延迟三种元素。使用遗传算法和粒子群优化算法嵌套使用的混合进化策略来识别灵活反应模型的结构和参数。实验结果表明,灵活反应模型和混合进化策略能够准确地识别出生化反应模型。

【Abstract】 With the completion of genome sequencing, research on the function of single gene or protein, could not reveal fundamentally the occurrence and development law of biological phenomena, so in the past decade, systems biology has been becoming the center of concerns among numerous biology branches. Systems biology is a new, rapidly developing interdisciplinary, which combines the knowledge and skills in many disciplines, such as biology, chemistry, physics, mathematics and computer science. The purpose is to study physiological mechanisms of biological systems in the system and global perspectives. It is one of the effective means of systems biology research in recent years that the modeling of gene regulatory networks utilizes gene expression data. To construct accurately gene regulatory networks (GRN) will greatly affect the accuracy of disease treatment, deepen understanding the cellular activities and the function mechanisms of causative gene, and have a profound impact on prevention, diagnosis and treatment of complex diseases. Although some achievements have been made, gene regulatory network with some characteristics, such as strong coupling, random, time-varying, strongly nonlinear, etc., is a complex and huge system. The existing methods which are very simple, could not accurately identify transcriptional regulatory relationships among genes, and create too many false positive relationships. How to effectively establish the precise model of gene regulation is a hot research currently.Computational intelligent methods were used in this dissertation to achieve the mining of gene expression data, gene regulatory network reconstruction and modeling of biochemical reactions in the process of gene regulation. The methods were applied to the microarray gene expression spectrum resulting from the coronary atherosclerotic plaque. Specifically, the main contributions and innovations of the thesis were described as follows.1. Modeling in the macro perspective.Based on the notorious performance of existing models, flexible neural tree model (FNT) was proposed to construct gene regulatory networks and forecast time series from gene expression profiling. Genetic programming like tree structure-based evolutionary algorithm was used to optimize the hierarchical structure of the FNT model, and simulated annealing algorithm was proposed to evolve the parameters encoded in the structure. Both optimization algorithms were used interchangeably. This loop continued, until a satisfactory solution was found, or the iteration limit was reached. In order to improve the accuracy of gene regulatory networks, akaike information criterion (AIC) and majority voting method were used to identify minimal regulatory elements of a target gene. Experimental results showed that, compared to the Elman neural network, fuzzy neural network, RBF neural networks, recurrent neural networks, fuzzy recurrent neural networks, and these models ensemble, the FNT model could improve the forecasting accuracy of gene expression profiles and reconstruct networks more accurately.All existing methods of inferring gene regulatory networks have their strengths and weaknesses. Compared with the single model, combining multiple models is more accurate and stable for constructing gene regulatory networks, and also the research trend. This paper first presented a novel method which combined multiple models, namely RMIHM (Gene Regulatory Network Reconstruction Based on Mutual Information and Hybrid Models). In the method, the linear/nonlinear models were used to construct gene regulatory networks respectively, and the overall network integrated network topologies from linear and nonlinear models. The additive tree models were proposed to encode the linear/nonlinear model, genetic programming and particle swarm optimization were used to evolve and evaluate each additive tree model respectively. Fitness function contained sparse and correlation coefficients. Sparse coefficient satisfied the condition that each target gene had a tiny fraction of the candidate regulators as true regulators, and the correlation coefficient utilized mutate information from information theory to evaluate the correlation between gene pairs in order to select maximum relevance regulatory factors of each target gene. Experimental results showed that the method was more accurate than classical single method. Not only was the true positive rate higher, but also false positive rate was lower.2. Microarray data processing, regulatory pathway construction and features of inter-chromosomal distribution of disease-related genes in human genome.In the paper, all atherosclerotic plaques in coronary artery and normal coronary artery tissue samples were provided by the tissue bank in Qilu Hospital and Liaocheng People’s Hospital. Human Genome U133Plus2.0Array (Affymetrix) was used to build gene expression profiles of atherosclerotic plaques and normal tissue samples. By comparing two kinds of expression profiles,1104differentially expressed genes were screened. These genes were analyzed using GO functional classification and pathway analysis, in order to understand the biological functions and pathways. GO analysis found that coronary atherosclerosis differentially expressed genes involved the multiple biological functions, such as cell adhesion, biological adhesive and so on. Pathway analysis revealed that genes significantly enriched in focal adhesion pathway. Gene regulatory network reconstruction based on mutual information and hybrid models, which was introduced in the fourth chapter, was proposed to predict the regulation relationships among differentially expressed genes in the focal adhesion pathway. We correctly predicted the Rho kinase regulatory mechanisms, which demonstrated the effectiveness of the approach.The paper collected the genomic data of model animals including human, mouse, zebrafish, fruit fly and C. elegans, disease-related protein-coding genes of14diseases and related data, and leukemia-associated mutations. By analyzing the spatial inter-chromosomal distribution of genes, we found that inter-chromosomal distribution of genes displayed a heterogeneous pattern. Disease-associated protein-coding genes had a similar inter-chromosomal distribution pattern, and involved in certain biological processes tended to be enriched in one or a few chromosomes. Human chromosome19had the highest or second highest frequency of harboring disease-associated protein-coding genes; and this might be related to the fact that this chromosome harbored more genes involved in transcriptional regulation. These findings could be useful in improving the efficiency of disease-associated gene screening studies, such as GWAS, Genome-wide Linkage analysis and whole-genome sequencing, by targeting specific chromosomes.3. Modeling in the microscopic and stochastic perspectives.Gene regulation involves a large number of biochemical reactions. Discreteness and stochasticity may play important roles, particularly in the system where low number of molecular species or slow interactions between them. This paper presented a new modeling approach for the automated design of stochastic and delayed stochastic biochemical reactions. Additive reaction model was proposed to encode the chemical reaction, first integrating stochastic, discrete and delayed modeling into a computational framework. Genetic algorithm and particle swarm optimization algorithm were used as nested hybrid evolutionary strategy to identify the structure and parameters of model. Experimental results showed that additive reaction model and nested hybrid evolutionary strategy could accurately identify the stochastic and delayed stochastic biochemical reaction models.

  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2014年 10期
节点文献中: