节点文献

基因序列比对算法在SNP中的研究及应用

Gene Sequence Alignment Algorithm Research and Implement in SNP

【作者】 康晓军

【导师】 贺立源;

【作者基本信息】 华中农业大学 , 资源环境信息工程, 2011, 博士

【摘要】 近年来,生命科学的研究正处于突飞猛进的发展中。随着人类基因组计划(HGP)的基本完成与现代生物技术的飞速发展,大量生物信息的获取已经为揭开生命的奥秘提供了坚实的数据基础。在生命科学的研究进入到后基因组时代(Post-Genome Era)时,生命科学的研究重点已经不再是生物信息的获取,而是转移到对基因组功能及其变化规律的研究,因此对海量数据的处理产生了紧迫的需求。与此同时,计算机技术及网络技术的革命性发展为处理海量数据提供了强有力的支撑,于是,生物信息学便在此前提下迅速的发展起来。并终将为人类破译遗传密码,掌握疾病的遗传信息,破解基因功能,结构功能预测起到巨大的推动作用。SNP即单核苷酸多态性,它主要指物种在进化过程中因为基因组中核苷酸的变异从而引起的DNA序列之间的差异,主要包括碱基缺失、插入、转换或者颠换等,单核苷酸多态性所反映的差异位点中包含的遗传信息是导致一些遗传疾病、肿瘤等的重要因素之一,基因突变及SNP在生物学、生物信息学和生物医学等研究中有着极其重要的作用。生物信息数据的表现形式为基因序列数据,通过对序列的比较可以发现其中的功能、结构等方面的信息。基因双序列比对或多序列比对的分析是目前生物信息学所关注的研究热点之一。对于基因序列的分析也通常采用聚类算法或者分类算法进行。本文主要研究基于序列比对算法对基因表达数据中SNP问题的分析,主要的工作及创新点概况如下:1)本文首先介绍了生物信息学的相关概念及其重要的意义,并对目前的国内外研究现状进行了概述。2)对基因表达数据常用的聚类分析算法进行了较为详细的研究,通过实验进行了初步的分析。3)介绍了目前基因序列比对算法的研究现状,并对其进行了分析,为本文中使用的序列比对算法提供依据。4)基于对序列比对算法的研究,本文提出了在海量基因序列数据中寻找SNP的实验方案设计。通过对经典BLAST算法的改进分别在PC机平台下及高性能集群环境下对算法进行了并行化设计及实现,并通过实验数据进行了较为详细的分析和测试,实验表明本文的实验方案在时间复杂度及结果方面都获得了较为理想的效果。5)以本文提出的方案及算法为基础,设计并实现了基于Windows操作系统和集群平台的序列分析系统,其功能主要包括基因序列数据的导入导出、SNP分析、序列比对、参数设置、结果数据输出、着色处理查看等。

【Abstract】 In recent years, life science research is in developing by leaps and bounds. As the human genome project completed and modem biological technology rapid development, Lots of biological information acquisition has to uncover the mystery of life and provides solid data base. In the time of life science research into the Post-Genome Era, Life science research focus is no longer biological information, but moved to the research of genome function and the changing laws. Therefore the pressing needs have been produced of mass data processing. Meanwhile, computer technology and network technology has a revolutionary development to the massive data processing and provides powerful support, and ultimately have vast pushing effect for human crack the genetic code, grasps the disease of the genetic information, cracked gene function, structure and function prediction.SNPS namely single-nucleotide polymorphisms, It refers to species in the evolutionary process because of the variation in the genome of nucleotides resulting differences between the DNA sequence. It mainly includes bases loss, insert, conversion etc, SNP reflects the difference of genetic information contained in the site is causing some genetic diseases, cancer and other important factors. Gene mutation and SNPS in biological systems, bioinformatics and biomedical research plays a very important role.The expression form of Biological information data is genetic sequence data. Through the comparison of sequence can found the information of the function and structure. Gene double sequence alignment or multi-sequence alignment analysis is one of research hotspot of bioinformatics. For the analysis of the gene sequence is usually adopts clustering algorithms. This paper mainly studies based on sequence alignment algorithm in gene expression data to SNP problem analysis. The main work and innovation points as follows.1) This paper firstly introduces bioinformatics the related concepts and their important sense, and summarized current research status in domestic and abroad.2) For a detailed study on gene expression data commonly used the cluster analysis algorithm, through the experiment we analyzed the algorithms.3) Introduced the research status of Gene sequence alignment algorithm, provides the basis for this paper. 4) Based on the research of sequence alignment algorithm, this paper puts forward in mass gene sequences in the experiment for SNPS data plan design, through the improvement of classic BLAST algorithm we design and realization the algorithm in PC platform and high performance cluster environment. Furthermore we make a detailed analysis and testing through the experimental. Experiments show that this experiment scheme in time complexity and results are obtained in the ideal result.5) As the bases of the algorithm, we design and realized the sequence analysis system based on Windows operating system and cluster platform. Its main functions include gene sequences of derivation, SNPS data analysis, sequence alignment, parameter setting and results data output, the shading treatment check, etc.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络