节点文献
基因芯片数据的聚类功能评价算法和判别分析算法研究
The Research on the Clustering Functional Evaluation Algorithm and the Discriminant Analysis Algorithm for Gene Chip Dataset
【作者】 吴飞珍;
【导师】 马文丽;
【作者基本信息】 上海大学 , 电子生物技术与装备, 2009, 博士
【摘要】 以人类基因组计划(human genomes project, HGP)结束为标志,人类进入了后基因组时代。在后基因组时代,人类以研究基因功能为重点。基因芯片以其快速、高通量、准确性高等突出优点成为研究基因功能不可缺少的重要工具。基因芯片数据分析是基因芯片技术研究中的重要内容,属生物信息学研究领域。本文主要对基因芯片数据分析中的聚类功能评价和判别分析进行深入研究。第一、聚类分析是基因芯片数据分析的重要方法,其目的是根据基因表达模式对基因分类,根据基因分类推测基因功能。然而由于聚类结果受到聚类算法和聚类参数的影响,使用不同的聚类算法和不同的聚类参数常常会产生不同的聚类结果,如何从基因功能相似性的角度评价聚类结果是聚类分析中的难点。本文第四章和第五章以此为切入点对聚类的功能评价算法进行研究。研究出了一种新的基因注释语义相似度计算方法,这种方法根据基因在基因本体(gene ontology, GO)上的注释计算基因的功能相似程度,并以酵母菌的异亮氨酸代谢通路和谷氨酸生物合成代谢通路为实验,证明了这种算法的准确性。在基因注释语义相似度计算方法的基础上,研究出了基因芯片数据聚类的功能评价算法,这种算法以类间基因功能的差异程度和类内的功能相似程度来评价聚类质量,并以酵母菌表达数据为例,表明用这种方法可以准确评价聚类结果的质量,在这种聚类功能评价算法的指导下可获得高质量的聚类结果。第二、判别分析也是基因芯片数据分析的重要内容,是基因芯片应用于临床诊断必须解决的重要问题之一。我国是肝癌多发国家。microRNA芯片数据和基因芯片数据均可以对肝癌转移作出预测。microRNA通过调控相应靶基因的表达来发挥其生物功能。用来预测的microRNA和用来预测的基因,即特征microRNA和特征基因之间是否存在调控与被调控的关系?第六章以此为切入点对肝癌转移相关的特征microRNA和特征基因的提取,及两者间的关系进行了研究。研究出了一种t交叉权重的方法,这种方法以重复随机抽样进行t检验来计算基因的权重,t交叉权重的优点在于可以根据基因权重大小在判别分析中逐渐扩大特征基因集,与不同的支持向量机核函数结合,在交叉验证变化趋势的指导下,选择合适的特征microRNA集和特征基因集。结果在microRNA芯片数据集和基因芯片数据集中,分别选取了100个特征microRNA和710个特征基因。根据这100个microRNA的表达数据,用多项式核函数的支持向量机预测肝癌转移准确率在83.99%以上;根据这710个特征基因的表达数据,用线性核函数的支持向量机预测准确率在96.76%以上,表明预测准确度良好。对这些特征microRNA和特征基因作进一步分析,发现两者间存在调控与被调控的关系,这提示肝癌的转移可能与这些特征microRNA调控相应的特征基因有关。分析中还发现,特征基因集的功能主要富集于细胞周期代谢通路(P=0.0006),说明细胞周期代谢通路改变可能与肝癌转移有密切关系。本文的创新点主要体现在以下几个方面:(1)研究出了一种新的计算基因注释语义相似度算法。利用这种算法可以将基因功能相似性用数据形式度量出来,突破了以往只有模糊比较基因相似性的缺陷;利用这种算法可大批量比较基因的相似度,与手工相比具有高效准确等优点。(2)研究出了一种新的基因表达数据聚类结果评价算法。该算法实现了从基因功能相似性的角度评价聚类结果,解决了以往只能从数据的数学特征评价聚类结果的不足,从而可获得更高质量的聚类结果。(3)提出了一种新的特征基因提取方法。这种方法将多次t检验的结果转化为基因的权重值,根据权重值大小结合不同核函数的支持向量机来选择特征基因集和核函数,克服了随机试验选择特征基因集和核函数的缺点。(4)发现了肝癌转移相关的特征microRNA与特征基因之间存在调控与被调控的关系。对基因芯片数据的聚类功能评价算法研究和肝癌转移特征基因提取研究具有重要的学术价值和应用价值。首先利用聚类功能评价算法可获得更高质量的聚类结果,对基因功能作出更准确分类;其次提取的特征microRNA和特征基因可以提高预测肝癌转移的准确度;所构建的microRNAs-Genes调控网络为肝癌转移机理研究提供了新思路;同时基因注释语义相似度算法和t交叉权重法分别可用于其它类似的基因注释相似度比较和判别分析的研究中。
【Abstract】 When the Human Genome Project (HGP) marked the end, humanity had entered a post-genome era, in which human focus on gene function research. Gene Chip, also called DNA microarray, with characteristics of fast, high-throughput, high accuracy, has become an important and indispensable tool for studies of gene function. Data analysis is an important aspect of gene chip technology. It belongs to the area of bioinformatics research. In the dissertation, mainly focus on two questions: the clustering functional evaluation and discriminant analysis to gene chip dataset.The cluster analysis is an important approach in gene chip dataset analysis. The purpose of the analysis is to divide genes into groups based on gene expression patterns, and then to predict genes function using these groups. However, due to the clustering results are usually influenced by the clustering algorithm and/or its parameters. Clustering with different clustering algorithms or parameters often produces extremely diverse clustering results. How to evaluate these clustering results, especially from the perspective of biological function similarity, is a challenge in the cluster analysis of gene chip dataset. In Chapters IV and V, aiming directly at the question, study clustering functional evaluation algorithm. Develop a new approach to measure gene annotation semantic similarity. The algorithm based on Gene Ontology (GO) term locations measures the gene function similarity. Taken yeast metabolic pathway isoleucine and glutamic acid biosynthesis pathway as examples to show the accuracy of this algorithm. Based on the algorithm, raise a novel clustering functional evaluation to measure the quality of clustering results. This algorithm assesses the clustering quality using both differential degree between gene functions in separate clusters and similar degree between gene functions in the same cluster. Taking yeast expression data as an example, the results show that the method can accurately evaluate the quality of clustering results. Under the guidance of the evaluation approach, the higher-quality clustering results can be obtained.The discriminant analysis to DNA microarray data is also an important content. It needs to be done for gene chip to be used in clinical diagnosis. China is a liver cancer-prone country. MicroRNA chip dataset and gene chip dataset all can be used to predict the metastasis of Hepatocellular carcinoma (HCC). The microRNA can regulate expression of corresponding target genes. Whether or not there is the regulation relationship between metastasis-related microRNAs (feature microRNAs) and genes (feature genes) in the HCC? Taking the problem as starting point in Chapter VI, study the identification of metastasis-related microRNAs and genes, and analyze their relationship. A novel approach, called t-cross-weight, was developed. The approach calculated weight for each gene through repeatly random sampling t-test. The advantage of t-cross-weight is that, according to rank of weight, can gradually broaden the set of feature microRNAs or feature genes, and use support vector machines (SVMs) with differential kernel function, under the guidance of k-cross-validation tendency to identify appropriate the set of feature microRNAs and of feature genes. The results suggest that 100 microRNAs and 710 genes were identified. According to the expression of the 100 feature microRNAs, employing the SVMs with polynomial kernel function, the accuracy rate of predicting metastasis of HCC is greater than 83.99%; and using linear kernel SVMs with the expression of the 710 feature genes, the accuracy rate is over 96.76%, which indicats significant prediction accuracy. Taking further analysis to these feature microRNAs and genes, found the existence of regulation relationship, which suggests that the metastasis of HCC may be associated with some feature microRNAs regulating some feature genes. Enrichment analysis to these feature genes with DAVID, an online tool, shows that the feature genes enriched in cell cycle pathway (p=0.0006), indicating the cell cycle pathway may be closely related to metastasis of HCC.The innovations in this paper are mainly showed as follows: 1. Developed a new algorithm to measure similarity of gene annotation semantic. This algorithm measures similarity of gene function with the form of data, which breakthrough the defects of fuzzy in the previous gene function comparative. The similarities between a large numbers of genes can be easy obtained by using the algorithm, indicating that it is superior to the manual way in efficient and accurate.2. Developed a novel clustering functional evaluation algorithm. The algorithm assesses the clustering results from the perspective of gene function similarity, so that it can overcome the previous drawback that clustering quality evaluation is only from the aspect of mathematical characteristics of data. Therefore, the result of higher quality can be obtained.3. Proposed a new method to identify feature genes. This method transfers results of t-test into weight value. According to the weight values and SVMs of different kernel function to identify feature genes, which overcome the shortcomings that feature genes and kernel functions is selected by randomized trial ways.4. Found a regulation relationship between feature microRNAs and genes in HCC metastasis.The researches on the algorithm of clustering functional evaluation and the identification of metastasis-related genes and microRNAs in HCC have important academic and application value. Firstly, using the algorithm of clustering function evaluation can obtain higher-quality clustering results, which can divide genes into groups in more accurate functional classification. Secondly, these feature microRNAs and genes chose by the t-cross-weight can improve the prediction accuracy of HCC metastasis.Lasterly, the microRNAs-Genes associate network offers a new idea to the research on mechanism of metastasis of HCC. Otherwise, the algorithms of gene annotation semantic similarity and t-cross-weight also can be used to other similar gene functional compare and discriminant analysis, respectively.
【Key words】 Gene Chip; Cluster Analysis; Discriminant Analysis; Hepatocellular Carcinoma; Metastasis; t-Cross-Weight;