节点文献

基因芯片表达数据分析相关问题研究

Research on Relevant Problems of DNA Microarray Expression Data Analysis

【作者】 邱浪波

【导师】 王正志;

【作者基本信息】 国防科学技术大学 , 控制科学与工程, 2007, 博士

【摘要】 论文以基因芯片表达数据分析技术为研究课题,围绕基因芯片表达数据预处理技术,基因芯片在肿瘤研究中的相关问题以及基因表达调控网络建模技术三方面问题进行了深入分析和研究,其主要内容和创新之处包括:1)寡核苷酸芯片系统偏移的校正方法研究基因芯片实验涉及多个芯片,因此有必要排除非生物因素引起的芯片间的变异,使得来自不同芯片的数据具有可比性。在对比分析中,通过系统校正能够减少芯片间的系统偏差,使得芯片检测的结果能真实反映生物功能的差别。论文对寡核苷酸芯片系统偏移的校正进行了研究。提出了一种迭代的鲁棒基准芯片校正方法。通过对各芯片上的探针进行排序,选择一个秩差异最小的探针子集,然后利用Tukey biweight算法计算一个伪基准芯片,最后基于伪基准芯片对目标芯片采用M-A非线性校正。对上述过程进行迭代,当达到最大迭代数或者探针杂交强度校正前后的差值低于某个阈值时停止。以Affymetrix公司提供的标准检验数据集HG U133A Spike-in Dataset作为测试数据,与多种现有的方法进行了对比分析,显示新方法具有更好的性能。2)基因芯片表达数据缺失估计算法研究在基因芯片实验中,经常存在数据缺失现象,这会影响芯片数据后续分析结果的准确性。缺失值估计是在不增加实验次数的情况下降低缺失数据对后续分析影响的有效方法。通过利用相似性信息的核加权函数实现缺失值回归估计的局部化,给出了基于加权回归估计的基因表达缺失值估计方法。在两种不同类型的基因芯片表达数据上,将新算法与几种已知的算法进行了比较分析。实验结果表明,新的估计算法具有较传统缺失值估计算法更好的稳定性和估计准确度。3)肿瘤基因芯片表达数据的分类诊断算法研究肿瘤基因芯片表达数据分类是一个典型的高维小样本分类问题。当前已经提出了很多有效的分类算法。提出了基于两步策略的肿瘤基因芯片表达数据分类算法。在测试的基因中存在大量的非差异表达冗余基因,为了有效减少其对分类效果的影响,首先利用ReliefF方法对基因进行预选择,得到一个较小的分类基因子集;然后分别建立了基于相关向量机和基于免疫优化支持向量机的分类预测模型。在四个真实的肿瘤基因芯片表达数据上,与几种不同的算法进行了比较,结果显示新算法可以得到更好的分类精度,同时表现出很好的稳定性。4)肿瘤基因芯片表达数据的分型识别算法研究肿瘤是高度异质性的疾病,不同的发病原因会导致相同的表型。基于临床病理检测很难对肿瘤进行准确的分型诊断。基因芯片技术提供了一种高通量的在分子水平观察肿瘤发生和演变的手段,利用基因表达数据可以对肿瘤组织样本进行准确的分型识别。支持向量聚类是一种基于边缘检测的聚类方法,对非规则的类分布有很好的聚类性能,能够发现样本真实的类分布。给出了基于支持向量聚类的肿瘤分型识别算法。对两个肿瘤基因芯片表达数据进行了分型识别分析,利用自动生成的参数序列,对样本进行不同程度的精细划分,结果显示,基于支持向量聚类的方法能够更准确地对样本进行分型识别,而且能够自动发现肿瘤样本真实的类分布。5)基于反向技术的基因表达调控网络建模技术研究基因表达调控网络的机制不仅仅是基因间的相互作用,还包含各种调控因子的相互作用,诸如,相关的调控蛋白质,siRNA等等。而这些调控因子不易直接测量。状态空间模型能够很好地描述基因表达调控网络复杂的调控机制。基因表达调控网络具有典型的稀疏特性,即基因的表达只被极少数的基因和调控因子所调控,同时,具有相互调控作用的基因间,在其连续表达水平上表现出较强的相关性。针对基因调控的稀疏子模块特点,先利用相关性聚类对基因进行分解,得到多个基因簇,然后利用状态空间方程对每个簇的基因间相互调控关系进行建模分析。通过在不同聚类数量水平上的建模结果进行综合分析,可以得到具有保守特性的基因间相互作用关系,从而得到一个稀疏的调控网络。对人类T细胞周期基因表达数据进行了分析,结果显示,随着聚类数量的增加,通过分解建模,可以更好地对网络进行重构。同时,建立了不同保守程度的稀疏调控网络模型。

【Abstract】 This dissertation refers to studies on DNA microarray expression data preprocessing techniques, classification and class discovery algorithms in cancer research and the gene regulation network modeling method. The main contents and contributions of the dissertation are summarized as follows:1) The research on method to normalize system bias for high-density oligonucleotide array gene expressionIn multiarray experiments, there is some system bias, which be contaminated by experimental factors such as spot location (often referred to as a print-tip effect), arrays, dyes, and various interactions of these effects. For comparable each other, it need to normalize the raw expression profile data. Normalization is the key step in low level processing. In fact, many normalization methods have been developed, i.e. scaling normalization, nonlinear normalization, quantile normalization and so on. New baseline normalization is presented. First, select the subset of probes, which have the min rank range; secondly, compute pseudo-baseline by Tukey biweight algorithm; finally, do nonlinear normalization on pseudo-baseline. Iterative strategy weakens the sensitivity of the baseline method to select baseline. With the standard test dataset, compare it with other methods. The results show that the novel method has better performances than others in several ways.2) The research on algorithms for missing value estimation of microarray expression dataIn microarray experiments, the missing value does exist and somewhat affects the stability and precision of the expression data analysis. Compared with increasing experiments, missing value estimation is preferred in reducing the influence of missing values on the post-processing. With the kernel weight based on similarly between target gene and sample genes, which localize missing value estimation, a new method based on weighted regression is presented. On the two real microarray expression datasets, the novel method is compared with several existing methods. Experimental results show that the novel method has better stability and precision than the existing methods that have been employed.3) The research on algorithms for cancer microarray expression classificationDNA microarray technology can measure the expression levels of thousands ofgenes simultaneously. It has become an important tool in cancer biological investigations. In combination with classification methods, microarray technology can be useful to support clinical management decisions for individual patients. Cancer microarray expression classification is a typical case that has high dimensions and small samples. In gene expression dataset, there are many genes that are redundant for cancer microarray expression classification. The most relevant gene selection is an important issue. A robust two-step approach is presented. For reducing the computation complexity, a gene pre-selection procedure by ReliefF is adopted to reduce the huge number of genes being considered. Secondly, the relevance vector machine and the support vector machine optimized by immune clonal algorithm are differently used on the gene subset for cancer microarray expression classification. On four real cancer microarray datasets, the new approach is compared to the several existing methods. The experimental results show that the proposed approach can achieve high classification accuracy and is more robust.4) The research on methods for class discovery of cancer microarray expressionCancer is a highly heterogeneous disease, and the different causes will lead to thesame phenotype. Based on clinical pathology, it is very difficult to find different classes of the cancer. DNA microarray technology provides a high-throughput tool that penetrates the occurrence and evolution of the cancer on the molecular level. The different classes of the cancer can be accurate discovered on microarray expression profiling. Many clustering methods have been widely used in the study to discover classes of the cancer. The support vector clustering is a bound-based clustering method that does well for irregular classes and can automatically find true classes. An algorithm to discover classes of the tumor is presented, which is based on the support vector clustering. There are a lot of redundancy gene expression profiles for class discovery of cancer. Therefore, the variance filtering selects a little of genes with the largest variance as characters for class discovery of cancer. Secondly, the support vector clustering is used to discover classes of cancer. On the two cancer microarray datasets, with the parameter sequence produced automatically, the presented method partitions the cancer samples on different fine level. The result shows that this method can more accurately discover classes of cancer samples and automatically find true class number of cancer samples.5) The research on modeling methods for the gene regulatory networksThe gene regulatory networks is not only a mechanism of the interaction between genes, and also includes the interaction of various regulatory factors, such as the regulation protein, siRNA and so on, which regulatory factors can not be measured directly. The state-space model is a special type of dynamic bayesian networks, on the assumption that the observed variables are dependent on the state variables that have the Markov dynamic characteristics. Therefore the state-space model can accurately describe the complex mechanism of the gene regulatory networks. Due to the complexity of computation, model-based modeling methods of gene regulatory networks are difficult to directly model greater gene regulatory networks. It is the typical sparse characteristics of gene regulatory networks that one gene expression was only controlled by a very small number of genes and regulatory factors, and its continuous expression profiles show a strong correlation. In view of the light characteristics of gene regulation, cluster genes by use of correlation clustering, and then model the mutual regulation of genes in one cluster with the state-space model. In order to get a sparse network, integrate with the conservative interaction between genes on the various levels of cluster number. On the human T-cell cycle expression data, the dissertation analyzes the reconstruction performance of the model’s dynamic behavior. The result shows that with the increase of the number of clusters, decomposition-modeling can better respond to network reconstruction. Meanwhile, the dissertation establishes several sparse regulatory networks with different levels of light.

  • 【分类号】TP391.41
  • 【被引频次】11
  • 【下载频次】1399
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络