节点文献

一种新的RNA二级结构可视化表示及其应用研究

A New Visual Representation for RNA Secondary Structure and Its Application

【作者】 梁成

【导师】 李仁发;

【作者基本信息】 湖南大学 , 计算机科学与技术, 2010, 硕士

【摘要】 生物序列的比较和分析是当前生物信息学研究的热点之一。生物序列一般是指DNA、RNA序列或蛋白质序列。随着研究的发展,作为主要遗传物质的RNA逐渐成为研究的重点,由于RNA二级结构比一级序列具有更大的保守性,且在RNA二级结构内发现了丰富的可用于分类和系统发育分析的信息,因此对RNA二级结构的分析具有十分重要的意义和价值。本文主要以RNA二级结构之间的相似性为研究内容,分别给出了一种新的基于可视化表示的相似性分析方法和基于Lempel-Ziv复杂度的相似性分析方法,为生物序列的可视化表示和分析提供了新的途径。本文主要完成了以下两个方面的工作:(1)提出了一种新的RNA二级结构可视化表示——CZ曲线,给出了CZ曲线具有的两种性质。基于CZ曲线给出了RNA二级结构的对应点的坐标映射图,并从图中直接获取了部分RNA二级结构的相似性信息和特征序列碱基的组成情况。随后将CZ曲线应用于RNA二级结构的相似性分析,给出了相似性比较结果。根据得到的相似性矩阵,结合可凝聚的层次聚类算法给出了11种真实RNA二级结构的进化树。实验结果表明本文提出的方法不仅可以有效的分析RNA二级结构(含假结)的相似性问题,还可以正确的将不同种类的RNA二级结构进行归类。此外,该方法只需要提取特征序列对应特征曲线的几何中心来计算相似性矩阵,因此计算复杂度较低。(2)针对目前不同的RNA二级结构可能对应相同特征序列的问题,提出了一种新的RNA二级结构特征序列的表示方法,给出了在转换时可参照的规则。随后利用Lempel-Ziv算法在得到的新的特征序列之间进行了相似性分析,从第三章使用的数据中选取了两组作为实验数据。实验结果与相关文献的分析结果一致,表明此表示法可以有效的提取RNA二级结构的结构信息,且避免了不同的RNA二级结构可能对应相同特征序列的问题。

【Abstract】 The comparison and analysis of the biological sequences is one of the hot spots of bioinformatics.Biological sequences generally refer to DNA, RNA or protein sequences. With the development of the research, RNA that contains the genetic information has become the focus of the research.As a matter of fact, the RNA secondary structure is more conservative than its primary sequence, and a lot of information that can be used for classification and phylogenetic analysis has been found in RNA secondary structure. Therefore, the analysis of RNA secondary structure is of great significance and value.The research content of this paper is the similarity of the RNA secondary structure.Here we propose two methods to analysis the similarity of RNA secondary structure respectively based on a new visual representation and the Lempel-Ziv complexity.This provides a new way for visualization and analysis of biological sequences.The main work of this paper is as follows:(1)We propose a new visual representation for the RNA secondary structure-CZ curve, and introduce two properties of the CZ curve. Accoding to the CZ curve we show the projection graphs of the points corresponding to the RNA secondary structures, and we can get some information of the base composition and similarity of the RNA secondary structures directly from the graphs. Then our method is applied to compute the similarity of RNA secondary structure.After showing the results of the similarity analysis between the RNA secondary structures, we utilized the similarity matrix combining the hierarchical clustering algorithms to give the phylogenetic tree for the real 11 RNA secondary structures. The results show that our method can not only effectively analyze the similarity between RNA secondary structures (including pseudoknot), but also classify the different kinds of RNA secondary structures accurately. Moreover, our method only needs the geometrical center of the characteristic curve of the RNA secondary structure to compute the distance matrix, so it has low computational complexity.(2)In view of the problem that different RNA secondary structures may correspond to the same characteristic sequence, we propose a new method to describe the characteristic sequence of the RNA secondary structure, and give the rules that can be referred to in the changing progress. Then we compute the similarity between the new characteristic sequences by using Lempel-Ziv complexity. We choose two data sets from paragraph 3 as our test data. The results are consistent with the analysis given in other literatures, which show our methods can effectively extract the structural information of the secondary structures, and avoid the problem that different RNA secondary structures may corresponse to the same characteristic sequence.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2011年 04期
节点文献中: