节点文献

生物子序列频数分布与肿瘤亚型分类模型研究

Frequency Distribution of Biological Subsequence and Classification Model for Tumor Subtype

【作者】 王树林

【导师】 陈火旺; 王戟;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2007, 博士

【摘要】 生物信息的爆炸式增长吸引了大量科研人员加入到生物信息学研究领域,使得生物信息学很快成为全球关注与研究的焦点。我们主要研究了生物信息学中的两个基本问题:(1)关于k-长DNA子序列在基因组全序列中出现频数的分布问题;(2)关于基于基因表达谱的肿瘤分子诊断问题。对于这两个问题的研究,都取得了非常好的实验结果,具有理论和现实意义,有助于生物信息学的发展。针对问题一,分别从DNA序列的可视化表示、k-长DNA子序列出现频数分布及其计数算法三个方面展开研究。针对问题二,分别从肿瘤特征抽取和信息基因选择两个方面研究了肿瘤亚型分类模型。DNA序列可视化表示对于研究其结构与功能具有至关重要的意义,它有助于重复子序列的识别、内含子与外显子的区分以及DNA序列进化等方面的研究。我们首先综述性研究了几种DNA序列的可视化表示方法,比较了生成DNA序列分形图像的Hao方法与经典的混沌游戏表示方法的异同点,讨论了禁止子序列中回文子序列情况,阐述了迭代函数系统产生分形吸引子的数学机理,详细介绍了根据Moore自动机与迭代函数系统定义的混沌自动机,并研究了以DNA序列驱动混沌自动机产生分形图像的方法,提出DNA序列三联密码子的分形图像表示方法,并对其进行了初步分析研究,提出进一步需要解决的问题。我们在生成DNA序列分形图像的Hao方法的基础上进一步提出一种能够直观显示k-长DNA子序列频数分布差异性的三维频数分布图生成方法,其优点是能够更加直观地观察k-长DNA子序列频数分布。然后把三维频数分布图转化为我们提出的一维对数频谱图,突出显示了频数分布的局部特征,并以一维对数频谱图为依据提出k-长DNA子序列频数区划分准则,详细研究了甚高频数区的n阶零间隔现象,发现并论证了,n阶零间隔分布就是基因组进化过程所留痕迹的有力证据,并给出一维对数频谱图特征的生物学解释。实验发现许多DNA序列频数概率分布近似服从非中心F分布,这个新发现有一定的普适性;对于分布呈多峰现象的DNA序列,可采用多个非中心F分布的叠加来拟合。在比较了非中心F分布与Gamma分布后,提出一种结合二者在拟合方面具有互补优势的新分布,实验证明这种新分布能够更好地吻合实际DNA序列的频数分布。然后研究了两种最特异出现频数(最高出现频数与出现频数为1的k-长DNA子序列个数)与k值的关系,发现不同物种的这两种关系具有良好的一致性,比如发现k-长DNA子序列最高出现频数与k值的关系与指数概率分布函数只相差一个常数因子。最后探讨了DNA序列的进化模型。因为现实世界中的基因组规模非常大,所以对k-长DNA子序列的出现频数进行计数并不是一件容易的事。我们提出并研究了k-长DNA子序列在DNA全序列中出现频数的计数问题,设计并实现了k-长DNA子序列内部计数算法和外部计数算法。该算法通过一个哈希函数把k-长DNA子序列映射为整数关键字从而把k-长DNA子序列出现频数的计数问题转化为整数关键字的重复计数问题,使得能够利用经典B树算法来解决频数计数问题,并针对待解问题的特点提出三种改进措施以进一步提高算法的性能。基于基因表达谱的肿瘤亚型分类方法有望成为临床医学上一种快速有效的肿瘤分子诊断方法,但由于目前肿瘤基因表达谱样本集存在维数过高、样本量很小以及噪音很大等特点,使得选择肿瘤信息基因或从基因表达谱中抽取肿瘤分类特征成为一件有挑战性的工作。国内外专家学者对肿瘤分类问题已开展了广泛深入的研究。我们在总结肿瘤分类研究成果的基础上概括出基于基因表达谱的肿瘤分类过程模型,阐述了分类过程模型的关键环节及其常用方法,提出肿瘤分类过程模型的分类方法,并过程模型比较了前人的研究成果,指出目前肿瘤分类研究中存在的问题。针对肿瘤特征抽取问题,设计了六种方法以获得肿瘤分类特征,分别是:1)主成份分析方法PCA,2)因子分析方法FA,3)独立分量分析方法ICA,4)小波包分解方法WPD,5)基于离散余弦变换(DCT)的PCA方法,6)基于离散Fourier变换(DFT)的PCA方法。实验采用两种肿瘤样本集(结肠癌和急性白血病样本集)验证了这六种方法的有效性。实验结果表明,所提出的方法不仅分类性能好而且各有其特点,都能在保持较高的分类准确率前提下大幅地降低基因表达谱数据维数。在分类性能方面,基于DCT变换的PCA方法是一个比较理想的数据降维方法,对于结肠癌组织样本,交叉验证识别准确率高达96.77%,而对于急性白血病组织样本,其准确率高达100%。因子分析方法和独立分量分析方法有助于分析样本集的结构特征,实验发现只需少量的因子或独立分量就可以获得很高的分类性能,由此推测,只需3~4个肿瘤信息基因就可以获得很高的分类性能的假设,为设计优秀的肿瘤信息基因选择算法提供了先验知识。尽管采用肿瘤特征抽取方法获得了好的实验结果,但是肿瘤信息基因选择仍是必不可少的工作。从基因表达谱的成千上万个基因中选择尽可能多的、分类能力尽可能强而基因数量却尽可能少的信息基因子集是一个挑战性工作。在没有先验知识的情况下,在如此大的基因空间中进行穷尽搜索是不可能的事情。为此我们提出了两类近似算法来解决肿瘤信息基因的选择问题。一类是采用经典粗糙集模型和邻域粗糙集模型的属性约简算法进行信息基因选择的方法。由于采用经典粗糙集模型的属性约简算法需要对数据进行离散化处理而导致信息损失,致使选出的肿瘤信息基因分类性能不高。为避免这个问题,我们又以邻域粗糙集模型的属性约简算法FARNeM(forward attribute reduction based on neighborhood model)为基础,设计了十一种信息基因选择算法以解决肿瘤亚型分类问题。实验结果表明,该方法能够快速搜索到分类准确率更高的信息基因子集。为提高NEC(neighborhood classifier)分类器在样本不均衡时的分类性能,对NEC分类器进行改进提出了一种适合于样本不均衡数据集的加权邻域分类器;同时我们还把适合于多分类问题的特征选择算法Simba(iterative search margin based algorithm)引入到肿瘤分类领域中,以丰富肿瘤信息基因选择方法的多样性;为增加分类模型的可信度提出一种基于邻域粗糙集模型的概率神经网络集成方法对肿瘤样本集进行分类;为实用的肿瘤分子诊断软件研制奠定了基础。另一类是根据获得的肿瘤基因表达谱样本集的结构特征提出的以支持向量机分类器为评估准则的肿瘤信息基因启发式宽度优先搜索算法,其优点是能够同时搜索到基因数量尽可能少而分类能力尽可能强的多个肿瘤信息基因子集。实验采用了三种肿瘤样本集验证了这种分类算法的可行性和有效性。对于急性白血病组织样本集,只需2个信息基因就能获得100%的4-折交叉验证分类准确率(共发现14个这样的两基因子集);而对于难以分类的结肠癌组织样本集,只需4个信息基因就可获得100%的4-折交叉验证分类准确率(共发现7个这样的四基因子集);对于小圆蓝细胞肿瘤(Small Round Blue Cells Tumor,SRBCT)数据集,同样只需4个信息基因就能获得100%的4-折交叉验证分类准确率(共发现504个这样的四基因子集);实验结果与我们的预测假设十分吻合。与国内外其它优秀的肿瘤分类算法相比,我们的实验结果在综合分类性能方面超过目前所有已知的分类算法。为更加客观地评价肿瘤分类模型的分类性能,我们提出一种能够消除肿瘤样本集的不同划分对分类性能造成影响的一种称之为全折交叉验证的方法,实验证明这是一种更加客观反映分类性能的评估方法;同时针对多肿瘤亚型样本集提出一种推断肿瘤亚型相关信息基因的方法。

【Abstract】 The exponential growth of the cumulative biological data has attracted a number of scientists to be engaged on the study of bioinformatics which has become the focus of world’s attention. The two basic problems on bioinformatics are investigated in this dissertation. One is the occurrence frequency distribution of k-mers in whole chromosome, and another is the molecular diagnosis of tumor based on gene expression profiles. The satisfactory experiment results have been achieved from the study on the two problems, which has a profound implication theoretically and realistically and which is helpful to the development of bioinformatics. The visual and fractal representation of DNA sequences, the occurrence frequency distribution of k-mers in DNA sequences and the counting algorithm for k-mers in DNA sequences are deeply investigated to the first basic problem. The feature extraction and informative gene selection for tumor classification are studied on the second problem.The visual representation of DNA sequences plays a profound role in DNA structure and its function, which is especially helpful to the recognition of repeatitive subsequences, the recognition of intron and exon, and the evolution of DNA sequences, etc. Firstly, the two methods including Hao’s method and the classic chaos game representation method that can generate the fractal image of DNA sequences are introduced, and further the comparison and contrast between the two methods are provided in details. On the basis of these work, the palindromes are discussed in forbidden k-mers. Secondly, after the mathematical principle of the iterated function system generating fractal attractor is introduced, the chaos automata is defined according to the Moore finite state machine together with the iterated function system; and the method of chaos automata generating fractal image driven by DNA sequences is investigated deeply and further. At last, the method of generating fractal image driven by codon sequences in DNA sequences is proposed and investigated, and further many unsolved problem are presented.Furthermore, on the basis of Hao’s method which allows the depiction of the occurrence frequency of k-mers in the form of fractal image, a novel method that can generate 3D occurrence frequency distribution map (OFDM) of k-mers in genome is proposed, and its advantage is that the difference in occurrence frequency of k-mers is obviously exhibited for biologists. Then the partition criterion for occurrence frequency segment is proposed according to the 1D logarithm histogram into which is transformed from 3D OFDM. The 1D histogram can show the local feature of the occurrence frequency distribution (OFD) of k-mers, i.e. the OFD of k-mers in ultrahigh frequency segment appears discontinuous in integer. The palindromes in forbidden k-mers are further studied in forbidden segment. The phenomena of n-order zero interval (n-OZI) in ultrahigh frequency segment is deeply investigated. Moreover, it is proposed that the distribution of n-OZI is the mark in the process of genome evolving and many features of the logarithm histogram of occurrence frequency are successfully explained from the view of biology. On the basis of many experiments, it is discovered and validated that the OFD of k-mers is subjected to non-central F distribution. Adopting several non-central F distributions can fit the density distribution of the occurrence frequency of k-mers in genome which has the same number of peaks. On the basis of experiments, the comparison between non-central F distribution and Gamma distribution which was proposed to fit genome distribution by Hsieh and Luo is studied through experiments. Due to the complement of the two distributions in fitting the density distribution of genome, a new probability distribution which combines non-central F distribution with Gamma distribution is presented, and experiments show that the new distribution is better than any single of the two distributions in fitting genome density distribution. After the relationship between the maximal frequency of k-mers in genome and the length of k-mers and the relationship between the number of different k-mers which occur only once in genome and the length of k-mers are deeply investigated, and it is discovered that the two relationships among many species are consistent each other, which are the evidences on the neutral evolution theory of genome.The problem that all k-mers in whole genome are counted simultaneously is researched, which is not an easy task because the size of the natural genome is too great. Therefore, the internal and external counting algorithms which count all k-mers in genome are designed and implemented. These algorithms firstly translate the problem of counting all k-mers into the problem of counting integer keys by the help of a hash function which maps a k-mer into an integer, so we can apply the classic B-tree algorithm to solve the problem of counting all k-mers in genome. At last, three measures are proposed to further improve the efficiency of the algorithm according to the features of DNA sequences.The tumor diagnosis method based on gene expression profiles will be developed into the fast and effective method in clinical domain in the near future. Although DNA microarray experiments provide us with huge amount of gene expression data, only a few of genes relate to tumor among the gene expression profiles. Moreover, it is a challenging task to extract feature or select informative genes related to tumor from gene expression profiles because of its characteristics such as the high dimensionality, the small sample set and many noises and redundancy in gene expression profiles. Therefore, the molecular diagnosis of tumor has been broadly and deeply investigated and a large number of relevant papers are published. At first we detailedly introduce the techniques and methods in tumor classification process model, and then the classification method for the tumor classification process model is proposed. At last the research results in tumor classification are summarized in the past several years, and the problem such as how to rationally assess the results of tumor classification and the further research on tumor classification are presented.After analyzing those tumor classification methods, we have designed six feature extraction approaches to extract the classification feature of tumor based on gene expression profiles. The six approaches are principal component analysis (PCA), factor analysis (FA), independent component analysis (ICA), wavelet package decomposition (WPD), PCA method based on discrete cosine transform (DCT) and PCA based on discrete Fourier transform (DFT) which are assessed by experiments on two well-known datasets which are the colon dataset and the leukemia dataset. The experiments with support vector machines (SVM) show that the six approaches not only can obtain good classification performance but also have their own characteristics. The proposed six approaches can extract a small quantity of components which are validated classification features related to tumor when retaining higher recognition rate; and the results of classification can be visualized in the form of 2D or 3D scatter plot. Among the six approaches, the PCA method based on DCT is the best method on reducing dimensionality and can achieve the highest classification accuracy; the cross-validation accuracy of 96.77% has been achieved for colon dataset using SVM classifier and 100% for leukemia dataset also. To FA and ICA, experiment results indicate that only several components can obtain higher classification performance. Therefore, we can deduce that only several genes (i.e. 3 or 4) in gene expression profiles can achieve higher classification accuracy, which is the basis of our further designing informative gene selection algorithm.The informative gene selection is need despite that the proposed feature extraction methods are effective in tumor classification. However, the accurate classification of tumor by selecting the tumor-related genes from thousands of genes is a difficulty task due to the large number of redundant genes, and usually it is impossible to apply an exhaustive algorithm to search informative gene subset in such large gene space. Therefore, two kinds of approximate algorithm which can search informative gene subsets are proposed and implemented.One is applying attribute reduction method based on classical rough set model and neighborhood rough set model to searching informative gene subsets. The classification accuracy rate of informative gene subsets obtained by using attribute reduction method based on classical rough set model is not high because there is information loss caused by discretization of gene expression profiles. To avoid this problem we design eleven gene selection methods based on the FARNeM (forward attribute reduction based on neighborhood model) algorithm to classify tumor sample set, which are proved effectively and quickly in searching informative gene subset. To improve the classification accuracy of the imbalanced tumor dataset, a weighted neighborhood classifier based on the neighborhood classifier is proposed, which is proved more effectively to tumor dataset with complex boundary between every two subtypes, and we introduce feature algorithm Simba (iterative search margin based algorithm) to the tumor classification domain based on gene selection, which also adapts to multi-class tumor dataset. The probability neural network ensemble based on the neighborhood rough set model is proposed to classify tumor dataset to obtain more reliable experiment results.Another method is the novel heuristic breadth-first search algorithm (HBSA) based on SVM classifier which can simultaneously find many informative gene subsets in which the number of informative genes is almost least but its classification accuracy is almost highest in spite of its characteristic of time-consuming. Experiments on three tumor sample sets show that the proposed approach is feasible and effective. The 4-fold cross-validation accuracy of 100% has been achieved by only two genes for leukemia dataset (14 2-gene subsets are found totally like this) and 100% by only four genes for colon dataset (7 4-gene subsets are found totally like this), which are superior to the results of other classification methods, and 100% by only four genes for SRBCT (Small Round Blue Cells Tumor) dataset (504 4-gene subsets are found totally like this). Experiment results are consistent with our prediction assumption. Compared with other tumor classification methods, our experiment results are obviously superior. To reflect the true classification accuracy rate of classifier, we proposed a full-fold cross-validated method which can eliminate the affect of the different partition for tumor sample set and can more objectively evaluate classification model.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络