节点文献
肿瘤基因芯片表达数据分析相关问题研究
Research on Relevant Problems of Tumor DNA Microarray Expression Data Analysis
【作者】 王广云;
【导师】 王正志;
【作者基本信息】 国防科学技术大学 , 控制科学与工程, 2009, 博士
【摘要】 随着“肿瘤基因组计划”的进行,基因芯片技术在肿瘤研究中得到了广泛的应用。肿瘤基因芯片能够为肿瘤基因组研究提供大量的转录水平上的基因表达数据。这些数据反映了基因在不同组织细胞的不同生长发育阶段或不同生理状态中表达水平的变化。相应的数据分析技术使得从基因组水平上揭示肿瘤的本质成为可能,为肿瘤相关基因的研究提供了一种全新的、系统的研究方法,并在肿瘤临床诊断与治疗等领域备受关注。目前,人们已经确认了一些与肿瘤发生发展相关的基因,并对其功能和调控机制有了一定的了解,积累了一些相关知识。但是,这些研究成果对于绘制肿瘤基因组图谱,攻克肿瘤还是远远不够的。因此,如何对肿瘤基因芯片表达数据进行有效地分析,以及如何利用已有知识作为辅助对这些数据进行有效地分析,从而找出与肿瘤相关的基因并确定其功能及调控机制,已经成为肿瘤基因组学研究中亟待解决的问题。在这一背景下,本文以肿瘤基因芯片表达数据分析为主题,围绕肿瘤基因表达数据的预处理、聚类分析以及基因表达调控网络的构建三方面问题进行了深入分析和研究,其主要内容和创新之处包括:(1)缺失值估计方法和标准化方法研究。在对缺失值估计方法的研究中发现,基因表达数据间的相似性对缺失值估计的精度有很大影响,而且用来估计缺失值的完全基因的表达数据在空间中的分布规律是估计缺失值一个很好的依据。因此,本文提出了一种基于KNN-SVR (K-nearest Neighbor and Support Vector Regression, KNN-SVR)的缺失值估计方法。该方法以与目标基因具有较高相似性的完全基因子集为训练集使用SVR算法建立回归模型对缺失值进行估计,提高了估计的精确性和稳定性。在对肿瘤基因表达谱分类诊断和分型识别的研究中发现,用当前的标准化方法处理后的数据进行分析会引起类型偏倚,导致样本的错误分类。因此,本文对标准化方法进行了扩展,利用类别信息进行标准化处理,使表达数据更适用于肿瘤基因表达谱分类诊断和分型识别的分析。(2)肿瘤基因芯片时序表达数据的聚类方法研究。针对基因间普遍存在的异步调控和局部调控关系,本文以细胞周期的基因表达数据为研究对象,提出了局部最大相关系数的概念,定义了基因间的相关关系;然后给出了在对异步调控和局部调控的识别中设定最大时延范围和局部相关的最短样本长度应遵循的规律;最后在局部最大相关系数的基础上对K均值算法进行了改进,提出了一种基于局部最大相关系数的聚类方法。该方法的核心是局部最大相关系数,它能够在不破坏基因表达数据间整体相关性的基础上很好地识别出表达数据间的局部和异步相关性,为功能相似的基因和共调控基因的聚类提供了一种更为有效的相似性测度。(3)肿瘤基因芯片非时序表达数据的聚类方法研究。为了消除非时序表达数据中的噪声并识别弱差异表达基因,本文提出了降噪CICA(Constrained Inde-pendent Component Analysis, CICA)模型并对肿瘤基因的非时序表达数据进行聚类。基于降噪CICA模型的聚类方法主要包括两部分:首先使用Ljung-Box Q统计量作为对“白”特性的约束,以高斯性最强为目标,抽取出一个高斯白噪声对表达数据降噪;然后用CICA对降噪后的基因表达数据聚类,其中,以待研究的基因的表达水平为约束,以非高斯性最强为目标,分离出相关的生物过程或功能类。该方法能够在降噪的同时较好地保持基因表达数据的细节信息,实现了对基因表达数据的降噪,提高了对弱差异表达基因的识别能力。(4)基因表达调控网络构建方法研究。本文首先针对基因表达调控的多时延特性,建立了N阶动态贝叶斯网络模型;然后针对仅从基因表达数据中不能得到理想的调控网络的问题,在N阶动态贝叶斯网络的基础上,提出了一种结合多源先验信息的多时延基因表达调控网络构建方法。该方法根据多源先验信息的特点将其转换为不同分布的网络结构先验概率,并与基因芯片时序表达数据相结合,通过马尔可夫链蒙特卡罗法(Markov Chain Monte Carlo, MCMC)学习N阶动态贝叶斯网络的结构。该方法还在表达数据与先验信息相互独立的基础上,在MCMC学习过程中将网络结构接受概率分解计算,灵活地实现了基因表达数据和多源先验信息的融合,从而达到共同学习调控网络的目的。结合多源先验信息的多时延基因表达调控网络构建方法不但对基因间的多时延调控关系具有很好的识别能力,而且降低了数据噪声的影响。
【Abstract】 With the development of Tumor Genomic Project, DNA microarray is widely used in tumor research. Tumor DNA microarray can provide a great number of gene expres-sion data for tumor genomic research, which reflects the fluctuation of gene expression level in different development stage or physiological state of different tissue cells. Be-cause of the capability of uncovering the nature of tumor on the genomic level and pro-viding a kind of new systematic method, the analysis of tumor gene expression data has got great attention. At present, researchers have confirmed some tumor genes and ac-cumulated some knowledge relative to oncogenesis and the regulation mechanism of tumor genes. But these achievements are too little to understand and cure tumor. Thus how to effectively analyze tumor gene expression data has become a problem which must be solved as soon as possible. So taking tumor DNA microarray expression data analysis as the research topic, this dissertation refers to studies on relative preprocessing techniques, cluster analysis algorithms and gene regulation networks modeling methods. The main contents and creative contributions of the dissertation are summarized as fol-lows:(1) The research on methods for missing value estimation and normalization of gene expression data. For the missing value estimation problem, we found that the similarity between gene expression data influences the estimation precision, and the di-mensional distribution of the gene expression data without missing values is a favorable reference to the estimation of missing values. So this dissertation presents a new miss-ing value estimation method based on K-nearest Neighbor and Support Vector Regres-sion (KNN-SVR). This algorithm takes genes without missing values and much similar to genes whose missing values are to be estimated as the training sets, and establishes regressive models through SVR to estimate missing values. This algorithm has better accuracy and stability. In the classification and class discovery of tumor gene expres-sion data, the current normalization methods are likely to make the samples be classified incorrectly. So this dissertation recomposes the normalization methods and uses class information to normalize gene expression data, which makes gene expression data more suitable to the analysis of the classification and class discovery of tumor gene expres-sion data.(2) The research on methods for gene cluster analysis of tumor time series mi-croarray data. In order to identify the asynchronous or local correlation in expression profile, this dissertation presents the concept of Local Maximum Correlative Coefficient (LMCC) and defines the correlative relationship between genes. And then the rules of setting maximum time delay and minimum local time segment are studied. Lastly, this dissertation presents a new clustering method which uses LMCC as the similarity measure of K-means method and makes some corresponding improvements. This method can identify the asynchronous or local correlation preferable and LMCC can provide a more effective measure for similarity.(3) The research on methods for gene cluster analysis of tumor non-time series mi-croarray data. In order to eliminate noise and identify genes with unobviously differen-tial expression in microarray data, this dissertation presents the model of Constrained Independent Component Analysis (CICA) with decreasing noise (deCICA) and uses this model to cluster tumor non-time series microarray data. The clustering method based on deCICA model includes two parts. Firstly, this method extracts a Gaussian white noise to eliminate the noise in gene expression data, in which the statistic of Ljung-Box Q is used as the constraint to the‘white’character and gaussianity maximi-zation is used as the object. Secondly, this method uses CICA model to cluster the de-noised gene expression data, in which the expression data of target genes are used as the constraint to the relative biological processes or functional clusters and nongaussianity maximization is used as the object. Because of the capability of eliminating noise partly and retaining the specific information in expression data, this method can identify genes with unobviously differential expression effectively.(4) The research on methods for constructing gene regulatory networks. This dis-sertation first builds the N-order Dynamic Bayesian Network (N-DBN) to model the multi-time delay in gene regulation, and then presents a new method for constructing multi-time delay gene regulatory network using N-DBN by combining expression data with multiple independent sources of prior knowledge (N-DBN-MP). In order to com-bining with time series microarray data, this method transforms multiple independent sources of prior knowledge into different prior probability distributions according to their characteristic, and uses Markov Chain Monte Carlo (MCMC) algorithm to learn the network structure of N-DBN. During the MCMC learning, the acceptance probabil-ity of network structure is decomposed on the basis of the hypothesis that microarray data is independent with prior knowledge, which realizes the fusion of microarray data and prior knowledge. N-DBN-MP can not only effectively identify the regulation rela-tionships between genes, but also reduce the affect of noise in microarray data.
【Key words】 Tumor; DNA Microarray; Missing Value Estimation; Cluster Analysis; Gene Regulatory Network; LMCC; deCICA; N-DBN;