节点文献

基因数据信息分析方法及应用研究

Genetic Data Information Analysis Methods and Its Application

【作者】 吴蓉晖

【导师】 李仁发;

【作者基本信息】 湖南大学 , 计算机应用技术, 2012, 博士

【摘要】 近几年来,随着人类基因组工程的顺利完成,产生了海量的生物分子序列数据。对这些生物分子数据的分析、保存、处理以及研究推动了计算机科学和分子生物学以及数学的有机结合,由此兴起的生物信息学和计算生物学也逐步发展为自然科学中一个活跃的研究领域。目前生物信息学的研究主要针对核酸和蛋白质两个层次的数据,包括核酸和蛋白质的序列、结构和功能的分析研究。具体有这些研究领域:序列比对、序列编码、序列表示、序列比较、特征选择、特征提取、分子进化、相似性分析、蛋白质结构预测、比较基因组学和计算机辅助基因(蛋白质编码基因)识别蛋白质和RNA的结构预测、计算机药物设计等。所有生物信息数据挖掘的研究工作的归宿都是为了从根本上理解人类疾病的致病机理,从而有效地预防、治疗疾病,尤其是死亡率高的复杂性疾病。从人类基本遗传物质-DNA序列着手分析是理解人类复杂性疾病的有效途径。本文以解决人类复杂性疾病为目标,以基因表达谱分析平台为工具,研究利用计算方法分析基因表达谱数据,识别基因序列并发现重要特征基因的突变,从而为复杂性疾病研究提供更多有效DNA序列信息,为系统生物学在复杂性疾病研究中打下坚实的基础,主要工作有:本文首先详细介绍了国内外在该领域的研究现状,进而分析了现有的基因选择方法,然后又具体阐述了当前序列特征表示方法的分类,并总结比较了各类中已有方法的优缺点,接着在此基础上详细介绍了本文提出的特征基因选择方法及DNA序列的图形表示、基于结构组成表示方法。本论文所取得的主要研究成果如下:提出了基因微阵列数据的一种新的混合的基因选择方法。首先用滤波方法依据基因的不同表达将基因微阵列数据进行分类,然后选择打分较高的重点基因,接着用蚁群算法对基因表达数据进行聚类,再用支持向量机方法来验证候选基因的分类效果,最后通过实验验证了该方法在解决该类问题中的有效性。在基因碱基的不同分类的基础上,在笛卡尔坐标系中构建了一个DC-R曲线和一个DC-Y曲线,进而提出了一种新的基因特征序列二维图形表示法,即DC-curve.同时介绍了该曲线具有的一些高级属性,如无环和非退化性以及和DNA序列的一一对应性。基于该曲线,给出了基因序列的突变分析和相似性分析方法。基于基因序列的组成和核苷酸的位置信息提出了一种新的基于DNA结构组成的DNA编码表示方法,与此同时在信息理论的统计特征的序列表示法基础上提出了一种改进的基于信息理论并结合序列统计特征的序列表示方法。最后分别基于上述两种方法给出了序列相似性分析方法和进化树构建方法。

【Abstract】 In recent years, with the successful completion of human genome plan (HGP), more and more large biological molecular sequence data been produced. The need to analyze, maintain, process and study biomolecular data has directly or indirectly promoted the collaboration among computer science, molecular biology and mathematics, and give rise to new disciplines such as bioinformatics and computational biology, which have gradually developed into an active research area in natural science. The two major research directions of bioinformatics are:nucleic acids and proteins, including the analysis of the sequence, structure and function of nucleic acids and proteins. For example, sequence alignment, sequence encoding, sequence illustration?, sequence comparison, feature selection, feature extraction, molecular evolution, similarity analysis, protein structure prediction, comparative gene group learning and computer-aided gene (protein-coding genes) to identify the protein and RNA structure prediction, computer drug design.The ultimate goal of the bioinformatics data mining is to obtain a fundamental understanding of the pathogenic mechanism of human disease, thus effectively prevent and treat diseases, especially complex diseases that lead to high mortality. An analysis starting from the basic human genetic material-DNA sequence is an effective way of understanding the complexity of human diseases. Aiming at solving complex human diseases, this thesis use gene expression profile as a tool to find out how to use computational method to recognize gene and detect the important variants of feature genes. It can provide more effective DNA sequence level information, and lay a solid foundation for the study of complex diseases in system biology. These main works are as following:This thesis starts with a detailed review of literature, and a discussion on the existing gene selection methods. We then specifically address the current sequence feature classification method of expressing, and summarize the advantages and disadvantages of the various types of existing methods. Based on that, a detailed explanation on methodology is then presented, which includes gene selection method and DNA sequence of a graphical representation and the structure-based representation. The major research contributions of this thesis are as follows:We propose a new hybrid gene selection method for gene expression profile. First, we use filtering method to classify expression data, and then select genes with high score. Second, we apply ant colony algorithm for gene cluster and use SVM to evaluate candidate subsets. Finally, we validate the effectiveness of our method in solving similar problems via experiments.Based on different classification of nucleotide acids, we build a DC-R curve and the curve of a DC-Y in the Cartesian coordinate system, and then proposed a new two-dimensional graphical representation for gene characteristic sequence:, the DC-curve. At the same time, we introduce the advanced attributes of this method, such as a ring and non-degradation and DNA sequence of one-to-one correspondence. In addition, we discuss mutation analysis and similarity analysis on gene sequence based on these methods.Based on the position information of nucleotide acids and the composition of the gene sequence, we propose a new DNA coding method based on the structure of DNA. In addition, we also improve method based on the information theory combining sequence statistic characteristic for representing sequence. We use these methods to analysis similarity and construct phylogenetic tree.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2014年 09期
节点文献中: