节点文献

基于k-mer频率统计的物种分类方法

Biological Classification Based on K-mer Frequency Statistics

【作者】 陈鑫

【导师】 梁艳春;

【作者基本信息】 吉林大学 , 计算机科学与技术, 2011, 硕士

【摘要】 生物学界的物种分类工作走过了几百年的发展历史,在日积月累的过程中建立了相当详细的分类方法,并发展出形态分类学这门学科,但目前尚未发现和未进行分类的生物物种的数目仍然是非常巨大,传统的形态生物分类学方法在面对如此繁琐的工作时已经遇到了瓶颈。随着生物测序技术的发展,DNA测序成本开始降低,而生物学家又意识到真正包含生物最本质特征信息的载体正是生物的基因组序列,所以基因序列内容应该被应用到物种分类工作中。目前生物信息学家进行生物物种分类使用的基本方式是在全基因组中选取一段具有相当特性的片段来代表物种的特征,并且使用这种特征进行物种间的比较,从而进行生物学分类分析。这项分类技术已经取得了令人满意的成果,不过由于该项技术上仍然存在一定程度上的局限性和不足之处,并且由于不同的研究者选择的片段不同,为分类方法的标准统一带来了难题。本文尝试用另一种方法来建立一个能将生物自身的序列特征统一的标准系统。这种方法的基础在于:生物基因序列k-mer短片段序列的频率在进化过程中具有相当的稳定性。在这种稳定性的前提下,我们尝试使用生物基因组的大部分序列而非一小部分来描述生物本身的特征。通过对这些序列进行k-mer的频率统计,得到了一个代表物种的特征向量,并使用这个特征向量进行物种的分类鉴别。这样使得各个物种都可在一个统一标准下进行分类划分。我们尝试了细菌和病毒的分类,并取得了一定的成果。在生物分类学的“属”以上级别的分类中产生了非常精确的数据,在亚种或变种级别上的数据结果也达到了一定的精度。

【Abstract】 In the past two hundred years, biological classification scientists established a set of classification system which was based on anatomy features, and set a very detailed classification method. By using this set of classification system and classification method, the researchers completed a million of specific species’classification, but there are still more than ten million unknown species are not complete accurate classified. Spend too much time on observing the anatomical characteristics’detail of each species is unrealistic. Biologists need a more efficient and more convenient way to complete the classification.The rapid development of genome sequencing technology allows biologists see the hope to solve the problem. Now we recognized that all the characteristics information of living things contained in their genes’sequence, then how to parse the sequence and applied these features to classification work has became a new research hotspot. In the process of analysis, we know that during the work of decipher the ciphertext, the frequency of single words is often a key to the work of decipher. Similarly, we suspect that the gene sequence of several base composition of the short fragment can be also viewed as a word, and then we can study whether the frequency of short segment represents the characteristic properties of all the species.The main work of the paper is to introduce the method:classify species based on k-mer frequency statistical and the verification process. This method applies to all species which can be sequenced, classification quickly and all species are under a uniform standard. The main idea of the method is to divide large-scale genome sequence into equal-length windows, and then count k-mer fragment of the frequency in each window. Since the statistical frequency in most segments have quite conservative, we use this conservative values as the frequency characteristics in the work of the classification of species. When all fragments represent statistical completed, we follow the same order of values, and at last generate a standard feature vector to represent the every species. By studying the standard relationship of the distances of the feature vectors, we could complete the work of classification of species.We use bacteria and viruses two groups of experimental data to verify the validity of the new method, and compare with the existing classification results. In the bacterial group, we selected six kinds of bacteria from three groups belonging to different Orders and different genera. After de-noising, pattern generation, analysis vector distance between the species and classification, ultimately, the results obtained is very similar to the current biological classification results, indicating that the method use in the level of Orders can make very good classification results. Then we test the viruses, we chose 8 viral sequences belong to 4 different kinds. Since the sequences’characteristics of the virus, we omitted de-noising process, the final classification result is consistent with the known result, verified by a larger data set. We believe that the method in virus species level classification is quite accurate. Then we attempt the lower level of virus species classification. The classification of small-scale data set is accurate, but we find that the distance between species is become smaller and influence classification results, and then we verify this effect in a larger data set, the results indicating that our approach in the ability of sub-species level classification is acceptable.Since the method:classify species based on k-mer frequency statistical are applied with fewer restrictions, classification very speed and genus level above classification is accurate. Perhaps in the future this method may become a standard method to study new species at the beginning of the detail research.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2011年 10期
  • 【分类号】Q19
  • 【被引频次】1
  • 【下载频次】527
节点文献中: 

本文链接的文献网络图示:

本文的引文网络