节点文献

信息散度和Alignment空间的一些研究

【作者】 卢国祥

【导师】 沈世镒;

【作者基本信息】 南开大学 , 生物信息学, 2009, 博士

【摘要】 随着信息科学的不断发展,人们对信息论学科的认识日益加深,信息论学科与其他学科的交叉渗透也越来越广。目前对信息论的研究已经从香农当年仅限于通信系统的数学理论的狭义范围扩展开来,像智能计算、生物、金融等领域都开始大量运用信息论的有关知识。在以上这些领域中,涉及到了各种随机分布差异的概念,需要利用信息量去衡量它们之间的区别。另外,在通信科学与生命科学中,对差错的概念已经有所推广,不再仅仅是经典的字符改变形成的差错,而且还扩展到了字符的插入与删除形成的差错。面对这些新问题,本文从三个部分进行了初步的探讨。第一部分:信息散度(Information Divergence)的研究本文的第二章主要讨论了信息散度的问题。信息散度在信息论中又被称为离散量,主要用来衡量两个随机分布之间的差异。比如最早提出的相对熵(即Kullback-Leibler散度)就是其中最为人所熟知的一种。本章首先介绍了一些著名的信息散度,然后讨论了信息散度与概率分布空间中度量的关系。2003年,Endres和Schindelin在论文“A New Metric for Probability Distributions”中将Jensen-Shannon散度(有的论文也称为capacitory discrimination)作了改进,证明了改进后的结果可以成为概率分布空间中的度量。本章在此论文的基础之上继续研究,从而得到了一类由概率分布生成的新度量。文中证明了得到新度量的充分必要条件,讨论了新度量的最值问题。本章最后对Jensen-Shannon散度的凸性作了一点探讨。第二部分:Fq上的Alignment空间的相关研究以及计数问题在数据处理问题中,差错的类型有多种,除了符号的替换之外还有数据的插入与丢失等等情况发生,本文称这样的差错为广义差错或者突变误差。由广义差错可以得到一种非线性空间——Alignment空间,这种空间在编码、密码、计算机与生物信息等等领域中有着广泛的应用。比如带插入/删除的信道编码、生物序列比对、图像处理等,都需要用到广义差错和Alignment空间中的有关概念与性质。本文在这一部分对广义差错和Alignment空间作了详细的说明和讨论。本文的第三章介绍了Fq集合上的Alignment空间的相关概念。首先我们对F-q集合上的Alignment空间和Alignment距离的定义作了说明,然后对Alignment距离的计算方法作了介绍,这个计算方法就是经典的动态规划算法。接下来文中讨论了广义差错的Levenshtein距离与Alignment距离的关系,最后介绍了该空间的一些简单的性质。第四章主要讨论一种研究Alignment空间的途径——序列的模结构理论以及虚拟符号的运算理论。本章首先简要介绍了序列的模结构理论,然后详细介绍了比对序列的虚拟符号运算理论,严格证明了两序列的比对序列间虚拟符号运算子的存在性,并且证明了等位运算子成为保距运算子和微调运算子的充分必要条件。第五章主要讨论Alignment空间中的计数问题。Alignment空间中的计数问题主要分为两类,本章开始对其作了说明。然后文中详细讨论了F2上的n维Alignment子空间中Alignment距离为n与Alignment距离为2的序列对数目。得到了F2上的n维Alignment子空间中Alignment距离为n的序列有2n对,F2上的n维Alignment子空间中Alignment距离为2的序列有(2n2-7n+11)-6对的结果,并且得到了Alignment距离为n的序列对满足的充分必要条件,说明了它们的最长的最小罚分比对序列就是最短的最大得分比对序列的结论。第三部分:由一般拓扑度量空间生成的Alignment空间在第二部分讨论的基础之上,Alignment空间还可以继续扩展到更一般的情况,由一般的拓扑度量空间同样可以产生Alignment空间。第六章中首先对由一般拓扑度量空间所产生的Alignment空间和其中的Alignment距离的定义作了说明,然后证明了此时得到的Alignment距离与一类推广的Levenshtein距离等价的结论,并且利用模结构理论详细证明了此距离满足度量空间中度量的三个条件。本章最后给出了该空间在生物信息学中的一个应用——蛋白质三维结构形态的比对问题。

【Abstract】 With the development of information science, people have recognized that the information theory has been not only used in mathematical theory of communication and secrecy systems, but also used in other fields, such as intelligent computing, biology, financial modeling. In all the fields above, random distributions is a very significant concept. People need " information " to identify the distinction between two different distributions. Therefore, information theory can play an important role in these areas. Besides, the concept of error has promoted. At previous time, the error only refers to the substitutions of symbols. Recently, the deletions and insertions of symbols are also considered as errors yet. We call all these errors generalized errors. Facing those problems mentioned, we mainly study three aspects below in this paper.1. The information divergence is studied. Information divergence is a way to measure the difference between two random distributions. For example, relative entropy is the most well-known information divergence. In chapter 2, we introduce some famous information divergences and discuss the relationship of them and metrics in probability distribution space. We improve the chief conclusion in the paper " A New Metric for Probability Distributions " and obtain a class of new metrics for probability distributions. The new metrics based on the important concept of relative entropy and Jensen - Shannon divergence. The sufficient and necessary conditions for the new metrics to hold are provided next. The maxima and minima of the new metrics are given, too. At last, the convexity of the Jensen - Shannon divergence is discussed.2. The Alignment space over F_q is studied. Alignment space is a nonlinear space defined by generalized errors. In chapter 3-5, the detailed introductions of the related concepts of Alignment space over F_q are presented. Firstly, we give the definitions of the Alignment space over F_q and the Alignment distance in it. We illustrate that the Alignment distance can be computed by classical dynamic programming algorithm. The properties and relations between Levenshtein distance by expansion sequences and Alignment distance are discussed. In order to research the Alignment space, we introduce the modulus structure theory of sequences and virtual symbol operation theory. That there must be a virtual symbol operation between two alignment sequences is proved strictly in this part. The sufficient and necessary conditions for isometric operator and micro-adapted operator are also proved. At last, we study the structure of the Alignment space, give the number of the sequence pairs whose Alignment distance is n in the n - dimensional Alignment subspace is 2n and whose Alignment distance is 2 in the n - dimensional Alignment subspace is (2n~2—7n + 11)—6, obtained the sufficient and necessary conditions for sequence pairs whose Alignment distance is n in the n - dimensional Alignment subspace and the result that the maximum score alignment sequences are the minimum penalty alignment sequences in this case.3. The Alignment space generated by general metric space is studied. The Alignment space mentioned in part 2 can be extended to general condition. Firstly, the definitions of the Alignment space generated by general metric space and the Alignment distance in it are described. And then, we prove that the Alignment distance is equivalent to a class of Levenshtein distances. Using modulus structure theory, the Alignment distance is proved to be a metric in metric space. At last, we give an example on the protein structure sequence alignment in 3-dimensional space as the application of Alignment space.

  • 【网络出版投稿人】 南开大学
  • 【网络出版年期】2010年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络