节点文献

基于海明距离的DNA序列中相似性重复片段查找技术研究

Study on Techniques of Searching for Approximate Repeats in DNA Sequences Based on Hamming Distance

【作者】 赵毅

【导师】 王国仁;

【作者基本信息】 东北大学 , 计算机系统结构, 2008, 硕士

【摘要】 生物信息学是随着人类基因组计划的启动、基因序列和蛋白质序列等生物数据迅猛增加而逐渐兴起的一门通过综合运用数学、计算机科学和信息科学来研究生物系统中信息现象的科学。在其广泛的研究领域中,重复片段查找是一个重要的DNA序列分析基础问题,其中的相似性重复片段查找因具有重要的生物意义以及其问题本身的复杂性,一直以来都是广大生物信息学研究人员致力研究的重要课题之一。本文针对DNA序列中两类重要的相似性重复片段——相似性串联重复片段和相似性反向重复片段的查找技术进行了深入研究,在分别为两类重复片段进行形式化定义之后,设计了相应的索引技术和查找算法用于两类相似性重复片段的查找和识别。在相似性串联重复片段查找的研究中,首先在海明距离的基础上定义了模式相似度和相邻相似度的概念用于衡量相似性串联重复片段模式间的相似程度,并提出了新的相似性串联重复片段定义Largest Neighbor-similarity-based Approximate Tandem Repeats (LNATR)。之后,通过将DNA序列划分为模式单元,设计了模式单元数组(Pattern Unit Array, PUA)的索引结构用于LNATR的查找。最后在模式单元数组上,根据后继信息进行模式连接以及模式增长,设计了一种基于模式单元数组的LNATR查找算法,并与Gad M. Landau等人提出的查找算法进行了比较。在相似性反向重复片段查找的研究中,首先在海明距离的基础上定义了匹配度用于衡量相似性反向重复片段模式间的匹配相似程度,并综合考虑了反向重复片段模式间可能存在间隔的特点,提出了新的相似性反向重复片段定义Largest Matching-degree-based Approximate Inverted Repeats (LMAIR)。之后设计了边界索引(Boundary Index, BI)的索引技术用于LMAIR的查找。最后在边界索引的基础上,分别设计了基本LMAIR查找算法和优化的LMAIR查找算法,并对两种算法进行了比较。

【Abstract】 With the start of Human Genome Project and the rapid increase of biological data, bioinformatics is gradually becoming one of the most important research fields, which studies the biological systems by applying mathmatics, computer science and information science. In the broad research areas of bioinformatics, repeats searching problem is an important and basic DNA sequence analysis problem, of which approximate repeats searching is an important issue which many researchers have paid great attention to, since there is great biological significance in approximate repeats and the searching problem itself is a new and complicated one.This thesis focuses on the searching problem of two kinds of important approximate repeats, which are approximate tandem repeats and approximate inverted repeats. Based on the proposed definitions of the two kinds of repeats, two indexing structures and relative searching algorithms are designed respectively.For the problem of searching for approximate tandem repeats, firstly pattern-similarity and neighbor-similarity are proposed based on hamming distance for similarity measurement, then a new definition Largest Neighbor-similarity-based Approximate Tandem Repeats (LNATR) is presented. After that a new indexing structure named Pattern Unit Array (PUA) is designed, based on which an effective LNATR searching algorithm is proposed, and is compared with another approximate tandem repeats searching algorithm designed by Gad M. Landau.For the problem of searching for approximate inverted repeats, the thesis first presents matching-degree based on hamming distance to measure the similarity between the two patterns of inverted repeats, based on which a new definition Largest Matching-degree-based Approximate Inverted Repeats (LMAIR) is presented. Then Boundary Index (BI) is designed for further LMAIR searching. Finally, simple LMAIR searching algorithm and optimized LMAIR searching algorithm are proposed based on BI, and comparation is made between the two LMAIR searching algorithms.

  • 【网络出版投稿人】 东北大学
  • 【网络出版年期】2012年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络