节点文献

多模型下的近似字符串匹配算法研究

Approximate String Matching under Different Models

【作者】 赵华

【导师】 卢正鼎; 路松峰;

【作者基本信息】 华中科技大学 , 计算机应用技术, 2013, 博士

【摘要】 近似字符串匹配是模式匹配研究领域中的一个重要问题。近年来,随着各学科的迅速发展,在许多不同背景下对于近似串匹配问题的研究逐渐受到人们的关注特别是在计算生物学等新兴学科中,有许多近似串匹配的新模型被陆续提出。一方而,这些模型在各学科中均有非常重要的应开用;另一方而,由于这些模型被提出的时问大多不长,因此对它们的研究并不充分。因此,对不同模型下的近似串匹配问题进行研究就成为了在模式匹配和诸多相关领域中的研究重点和难点。针对这一现状,研究的主要内容就是三种近年被提出的模型下的近似串匹配问题。这三种模型分别是:属性匹配模型,交换匹配模型以及块匹配模型。针对属性匹配,提出了一种新的索引结构CIDS-PIP (compressed indexed data structure for property indexing problem)及相应的匹配算法。在该算法中采用了压缩后缀数组作为核心的索引结构。为了进一步降低索引的空间开销,针对不同的属性规模,提出了两种解决方案。针对这两种方案设计了不同的辅助索引结构,以同时满足较高的查询效率和较低的空间开销。与现有的支持属性匹配的索引相比,CIDS-PIP的空间开销更低。针对交换匹配,提出了一种新的离线匹配算法,并证明更精确的模式交换版本的个数上限。该交换匹配离线算法是建立在已有的全文索引的基础上,而非设计全新的索引数据结构。在该算法中,采用后缀数组或压缩后缀数组作为索引结构。当模式长度较小时,该算法可以达到良好的时间效率,并明显优于现有的属性匹配在线算法。此外,还解决了近似交换匹配问题。实验证明,相对于已有的在线交换匹配算法,该算法的时间效率大幅提高。块模式匹配模型是Julio N等人在2011年提出的一种匹配模型,现在在此基础上进行的研究并不多见。针对在线和离线两种情况下的块模式匹配问题进行了研究,并且分别给出了新的算法。相对于现有的在线匹配算法,新的在线算法需要更低的空间开销,而时间开销并没有增加。相开对于现有的离线匹配算法,新的离线算法的时间效率更高。与此同时,还指出了在Julio等人论文中的一处不正确的表述,并对其算法的时间复杂度进行了修正。此外,对块模式匹配问题的两个衍生问题:融合块模式匹配问题和突变块模式匹配问题进行了研究并提出了新的算法。

【Abstract】 Approximate string matching is an important issue in the research areas of pattern matching. With the rapid development of various disciplines, the research of approximate string matching in many different areas gradually attracts lots of attentions. Particularly, in the emerging disciplines such as computational biology, many models for approximate string matching are proposed. On one hand, these models can be applied widely in many areas. On the other hand, the research on pattern matching under these models is not enough, since most of these models have been proposed for not long time. In view of this situation, we studied the approximate pattern matching problems under three different models, which are property pattern matching, swap matching and blocked pattern matching.For the property matching problem, an off-line matching algorithm is proposed. In that algorithm, the indexes based on compressed suffix arrays are used as the care structure. In order to reduce more space overhead, we gave two solutions for different sizes of the property. In these two solutions, different auxiliary structures are designed in order to meet the higher query efficiency and lower space overhead. Compared with other existing index supporting the property matching, the space overhead of our solution is much smaller. As return, the time complexity of the matching is a little higher.For the swap matching problem, an off-line matching algorithm is proposed, and it is the first off-line swap matching algorithm. At the same time, a more accurate upper bound of the number of swap versions of a pattern is given. In this solution, the existing full text index is used as our index instead of designing new ones. Our solution is more effective than the existing on-line algorithms when the pattern is short.The Blocked Pattern Matching (BPM) Problem is studied. Blocked pattern matching problem is proposed by Julio N. et.al in2011. Now there is little research on this problem. For the blocked pattern matching problem and other related problems, a group of algorithms which improved the time efficiency or space efficiency separately are given. An incorrect statement in Julio’s paper is found, and a solution for the problem which needs less searching time is proposed. The mutated blocked pattern matching problem and fused blocked pattern matching problem are studied, and new algorithms are given.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络