节点文献

微阵列数据挖掘技术的研究

Research on Microarray Data Mining Techniques

【作者】 王明怡

【导师】 吴平;

【作者基本信息】 浙江大学 , 生物信息学, 2004, 博士

【摘要】 微阵列作为一种新的分子生物学技术,可以同时测量生物样本在几千个基因中的表达水平。从这一实验手段可以得到全基因组的基因表达数据,为获取内在、未知而有意义的生物学知识提供了可能。这一领域研究的主要挑战在于开发生物信息学工具来搜集分析数据。 本论文研究了有关微阵列数据挖掘所涉及的几个主要问题,包括基因选择,组织分类和通过基因表达数据的调控网络重建等。本文主要的工作归纳如下: 常用的排列法从微阵列数据中选择的基因集合往往会包含相关性较高的基因,这会影响分类器的性能。为了去除这些冗余基因(特征),提出了无监督的特征选择算法。算法主要包含两步:将原始特征集划分为一组相似的子集(聚类);从每个聚类中选择代表性特征。特征的划分采用特征间的相关性作为测度以k近邻原则来完成。算法无需指定聚类数量,时间复杂度低。真实的生物学数据实验证明该算法可显著提高分类器的分类准确性。 采用微阵列数据进行组织样本有监督分类所面临的主要挑战在于基因数量远多于样本数量。为此提出了采用人工神经网络集成的分类方法,该方法使用Wilcoxon测试选择用于分类的重要基因,神经网络集成中各个体通过凸伪数据法产生的数据来训练,用简单平均法结合网络个体的测试结果。实际的生物学数据实验表明该方法性能优于单个神经网络,最近邻法和决策树。 贝叶斯网络是一种表示多变量联合概率分布的图模型,它可以获得变量之间的条件独立关系。由于可以用来表示基因表达的复杂随机过程而受到关注。本文比较了爬山法和马尔可夫蒙特卡洛(MCMC)两种贝叶斯网络学习方法在模拟微阵列数据情况下的性能。结果表明MCMC法要优于爬山法。但是在实际的微阵列数据条件下,贝叶斯网络只能随机确定基因对之间的关系。 通过微阵列数据挖掘为发现基因调控途径中因果关系提供了可能。提出了基于约束条件的因果关系发现方法,以此来搜索基因之间潜在的因果关系。这一搜索采用Hughes等人已公开的酵母基因组300个表达谱,得到了多个因果关系。粗略分析表明有些关系显示了生物学意义,其他的则有待进一步研究。这一结果表明该方法具有可行性,并且可找到有意义的因果结构。

【Abstract】 The new molecular biological technology, microarray, makes it feasible to obtain quantitative measurements of expression of thousands of genes present in a biological sample simultaneously. Genome-wide expression data generated from the technology are promising to uncover the implicit, previously unknown and potentially biology knowledge. A major challenge in this area is to develop bioinformatics tools for data collection and analysis.In this dissertation several problems about microarray data mining techniques are investigated, which includes gene selection, tissue classification and genetic network construction using gene expression data. The main contributions of this dissertation are summarized as below:Gene set of interest typically selected by usual ranking methods from microarray data will contain many highly correlated genes. This situation will degrade the performance of classifiers. For filtering these redundant genes (features), an unsupervised feature selection algorithm was proposed. The task of the algorithm involves two steps, namely, partitioning the original feature set into a number of homogeneous subsets (clusters) and selecting a representative feature from each such cluster. Partitioning of the features is done based on k-NN (k nearest neighbor) principals using the pairwise feature correlation measures. This method dose not need to specify the optimal number of clusters in advance and its computational complexity is low. Real biological data experiments have shown that this algorithm will significantly increase the classification accuracy of the existing classifiers.Accurate supervised classification of tissue samples in use of large-scale gene expression data presents major challenges due to the number of genes far exceeding the number of samples. Thus, a classification method using artificial neural network ensembles was proposed. In this method, significant genes for classification were selected by Wilcoxon test. Each member of neural network ensembles is trained by different datasets generated by convex pseudo-data methods. The predictions of those individual networks were combined by simple average method. Real biological data experiments have shown that this classification method outperformed than single neural networks, 1-nearest-neighbor classifiers and decision trees.A Bayesian network is a graphical model of joint multivariate probability distributions that captures properties of conditional independence between variables. Such models are attractive for their ability to describe complex stochastic processes of gene expression. We compared the results of using hill-climbing method and Markov chain Monte Carlo method to learning Bayesian networks from simulated microrray data. Our analysis suggests that MCMC performed better than hill-climbing method. However, we find Bayesian network is at chance for determining the existence of a regulatory connection between gene pairs.There is great potential for mining microarray databases to discover causal relationships in the gene-regulation pathway. A constrained-based causal discovery method was presented to search for the underlying causal relationships between genes. The search uses published data set from Hughes et al. of 300 expression profiles for yeast. Using this method, a number of causal relationships were found. A cursory analysis shows some of these relationships make sense biologically sensible, others suggesting new hypothesis that may deserve further investigation. The results indicate that the approach proposed here is both computationally feasible and successful in identified interesting causal structures.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2005年 01期
节点文献中: