节点文献

基于Laplace谱的基因表达谱数据分类研究

Research on Classification of Gene Expression Data Based on Laplace Spectra Theory

【作者】 庄振华

【导师】 王年; 梁栋;

【作者基本信息】 安徽大学 , 信号与信息处理, 2010, 硕士

【摘要】 基因表达谱数据分类研究就是通过分析DNA微阵列实验中所获取的基因表达谱数据,发掘出不同样本间的基因表达差异,寻找基因与组织病变之间所存在的内在联系。虽然模式识别领域的各种算法在这些年来都有了长足的发展,但是在针对基因表达谱数据的分类研究中仍有许多问题需要解决。基因表达谱数据由于其获取方式的独特性,具有高维度,低样本的特点。传统的机器学习方法在面对这种数据时,无法取得较好的分类结果,并且其极高的运算复杂度,大大降低了数据分析效率。本文基于谱图理论展开针对基因表达谱数据的分类研究,将反映图结构的特征表示引入到基因表达谱数据分类中,研究基因表达谱数据的特征提取及基于谱图理论的基因谱表达数据分类方法,并对算法的性能进行分析。主要研究内容有:1.基因表达谱数据蕴含着大量的生物信息,如何有效地从中挑选出特征基因将对算法的准确率及实时性产生巨大的影响。本文提出一种利用熵度量作为指标进行癌症基因表达数据特征提取的方法。首先对基因表达数据进行筛选并计算各个基因的熵,然后提取出熵最大的若干基因作为特征基因,并用支持向量机进行分类。对前列腺癌基因表达数据的留一法以及分组法实验都证明了该方法的有效性。2.尝试着将一种基于Laplace谱的算法应用于癌症基因表达谱数据的分类上。该方法首先挑选出与类中心欧式距离最小的若干个样本通过高斯权构造Laplace完全图,记为代表该类的标准图。然后用待测样本依次替换标准图中所有的点,将生成的新图与标准图进行特征点匹配,并计算匹配点数总和。最后将待测样本划分为总匹配点数最多的那个类。3.提出一种基于图的Fiedler向量的癌症基因表达谱数据聚类算法。该方法将分属不同类的所有样本通过高斯权构造Laplace完全图,经SVD分解后获得Fiedler向量,最终利用各样本所对应的Fiedler向量分量的符号差异来进行基因表达谱数据的分类。

【Abstract】 Classification of gene expression data is an important way to find the relationship between the different genes. Although the field of pattern recognition algorithms have been significant developed in these years, but it still has many problems must be solved in clustering of gene expression data. Because of the two characteristics (high dimension and low sample) of gene expression data, traditional machine learning methods can not get desired results, and its high computational complexity greatly reduces the efficiency of data analysis.The theory of graphs spectra is introduced into the classification of gene expression data. We utilize this theory to extract the feature of gene expression data and propose some algorithms for classification of gene expression data. This dissertation’s main research contents and the achievements are as follows:1. DNA microarray technology has brought a far-reaching impact on the biomedical field, and it is very significant for using classification method to analyze tumor gene expression data. This dissertation proposes an algorithm for obtaining informative genes of tumor gene expression data by utilizing entropy as an indicator. The whole process is done by first putting tumor gene expression data into strata and calculating the entropy of each individual cancer genes. Then, several genes with the highest entropy were selected and classified using SVM. The effectiveness of this algorithm has been proven by leaving-one method and group method.2. We introduce a novel classification algorithm for gene expression data based on the Laplacian spectra of graphs. Firstly, the class center is obtained by computing the average of each class in the training set, and the Laplacian matrices of complete graphs so called normal graphs are constructed on some samples with the minimum Euclidean distance between the class center. Then, the sum of matched points is calculated by replacing points of standard image with test samples. Finally, the test sample is divided into the biggest one of the total matched points of the class.3. This dissertation proposes an algorithm for classification of gene expression data based on Fiedler Vector. Firstly, the Laplacian matrix of complete graph is constructed on all the different types of gene expression data. Then, the Fiedler Vector is obtained by the singular value decomposition of this Laplacian matrix. Finally, the samples are divided into two classes by utilizing the signs of the Fiedler Vector components.

  • 【网络出版投稿人】 安徽大学
  • 【网络出版年期】2010年 12期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络