节点文献

DNA数据压缩方法的研究

Research of DNA Data Compression Method

【作者】 谭丽

【导师】 孙季丰;

【作者基本信息】 华南理工大学 , 信息与通信工程, 2014, 博士

【摘要】 随着DNA测序技术的发展,生物医学研究面临着如何存储和传输DNA数据的问题。DNA数据压缩技术成为其中解决问题的重要方法之一,即以高效的压缩编码方法,将DNA数据存储于较小的空间。由于DNA数据的特殊性,使用传统的压缩算法并不是很理想,因此出现了专门针对DNA数据的压缩算法,这类算法主要有两大类,即替代法和统计法。替代法是在原始DNA数据中寻找重复出现频率高的DNA片段,并将这些片段编入字典,用字典索引值替代原始DNA数据中的这些片段;统计法是通过统计归纳出原始DNA数据中每个碱基符号出现的规律,估计出其对应的概率统计模型,然后利用该模型对DNA数据进行熵编码。这两类压缩方法对DNA基准测序序列具有较好的压缩效果,而对高通量DNA数据压缩有限,故出现一系列针对高通量DNA数据的压缩方法。目前,高通量DNA数据的压缩方法主要为重测序的DNA数据压缩方法和从头测序的DNA数据压缩方法。对于重测序的DNA数据压缩方法主要是基于参考基因组压缩,基于参考基因组的DNA数据压缩方法,虽然获得的压缩比很高,但对参考序列依赖性太强,而实际上有些测序序列并不存在现成的参考基因组,同时由于压缩和解压都需要相同的参考基因组,故参考基因组必须事先保存在本地,从而导致算法资源开销较大;对于从头测序的DNA数据压缩方法,不依赖外部的参考基因组,自完备性较好,但在一定程度上仍受限于短读拼接技术。本文针对这些问题,首先在前两章中调研了DNA数据压缩方法的研究现状,并对相关的DNA数据压缩技术及压缩方法所面临的挑战与展望进行了分析与讨论。最后,提出了几种DNA数据压缩方法,并对算法进行了分析。本论文主要有如下几方面的贡献:1.在经典DNA数据压缩算法的基础上,提出一种基于扩展操作的DNA数据压缩算法(DNA sequence compression using extended operations, DNAEC)。该算法将三种标准编辑操作扩展为八种操作,利用LZ算法思想对精确匹配、互补回文、自匹配三种模型进行压缩,而对非重复片段使用基于上下文的二阶二进制算术编码器进行压缩编码。最后通过实验仿真,与典型的DNA压缩算法相比,在DNA基准测序序列上该算法的压缩性能得到了提升,特别是对较长的DNA数据,其压缩效果更为明显。2.在Memetic算法框架下,提出一种基于CPMA (Collaborative particle swarmoptimization-based memetic algorithm)的DNA数据压缩方法。CPMA分别采用综合学习粒子群优化算法和动态调整的混沌搜索算子进行全局搜索和局部搜索,接着寻找全局最优的基于扩展操作的近似重复矢量码书,并用此码书压缩DNA数据。实验结果表明,CPMA比其它优化算法有很大的改善,对文中采用的大部分测试函数,其解都非常接近全局最优点;对于DNA基准测序序列,与经典DNA数据压缩算法相比,基于CPMA的压缩性能得到了显著提升。3.在Memetic算法框架下,提出另外一种混合粒子群优化(Hybrid particle swarmoptimization based memetic algorithm, HPMA)的DNA数据压缩方法。HPMA采用动态综合学习粒子群优化算法作为全局搜索方法,然后分别用两种不同的局部搜索算子,即中心对称变异差分进化算子和自适应混沌搜索算子,接着寻找全局最优的基于扩展操作的近似重复矢量码书,用此码书压缩DNA数据。最后对算法进行实验仿真,在19个高维测试函数上,HPMA能获得比论文提及的优化算法更好的优化性能和伸缩性能;在DNA基准测序序列上,与经典DNA数据压缩算法相比其压缩率得到明显的提高。4.考虑到基于参考基因组的DNA数据压缩算法对参考序列依赖性太强的特点,提出一种高通量DNA数据的压缩算法(Codebook index transformation for high-throughputDNA data, CITD),该算法先采用码书索引变换模型,将传统码书索引值的表示方法变换成由四个标准碱基字符替代的四进制数值方式,并采用一种界定替换串与非替换串的简明编码方法,接着通过信息熵的大小来决定是否进行BWT (Burrows-Wheelertransformation),最后用MTF (Move to front)变换和Huffman熵编码对高通量DNA数据进行压缩。在多个测序数据集上的实验结果表明,CITD在大多数情况下可以获得比所对比的高通量DNA数据专用压缩方法更优的压缩性能。5.提出另外一种高通量DNA数据压缩算法,即DNAC-K(K-means clustering forhigh-throughput DNA data)。该算法首先利用K-means方法建立DNA数据聚类族,接着在聚类族中对子序列之间进行序列比对,最后用Huffman熵编码对高通量DNA数据进行压缩。实验结果表明,在大多数测序数据集上DNAC-K可以获得比从前的高通量DNA数据专用压缩方法更优的压缩性能。

【Abstract】 Due to the recent advancements in DNA sequencing technologies, the biomedicalresearch is faced with the problems of storage and transmission of the huge amount of DNAdata. The use of compression techniques provides an important candidate solution for thestorage and transmission challenges of DNA data, in other words, the DNA data is stored inthe smaller space by the effective coding methods. Because of the particularity of the DNAdata, it is not very ideal with traditional compression algorithms. Thus compressionalgorithms for DNA data specially appeared, including substitutional methods and statisticalmethods. The substitutional methods look for high frequently repeated fragments of DNA inoriginal DNA data, and index these fragments into a dictionary, which will be used to indteadthese fragments of DNA data; The statistical methods estimate the corresponding probabilitystatistical model through statistical rule of original DNA data, and then DNA data is entropycoded by this statistical model. These two types of compression methods have bettercompression effect for the benchmark DNA sequences, but limited for high-throughput DNAdata compression. So, a series of high-throughput DNA data compression methods appeared.At present, the high-throughput DNA data compression methods mainly includeresequencing DNA data compression method and de novo sequencing DNA data compressionmethod. The resequencing DNA data compression method mainly based on the referencegenome, which obtain the high compression ratio but require reference genome, but in fact,some sequences have no existing reference data, and because the same reference data isneeded by compression and decompression, so these reference genome must be stored in thelocal, increasing the costs. The de novo ssequencing DNA data compression methods do notrely on external reference genome and the completeness is better, but is still limited by theshort read splicing technology.In view of the above problem, the research status of DNA data compression method aresurveyed in this thesis, the challenges and prospects faced by DNA data compression methodare also discussed. Finally, several kinds of DNA data compression methods are proposed.The contribution of this dissertation mainly has the following several aspects:(1) A compression algorithm called DNAEC is proposed for DNA data compression, inwhich the extended operations are used. The three edit operations are extended to eightoperations so that the DNA data can form the three match patterns (exact match,complementary palindrome match and self repeat match), and then can be well compressed bythe idea of LZ, encoding the unmatched fragments with the order-2arithmetic code. Simulation results show better compression performance of DNAEC than typical DNA datacompression algorithms on benchmark DNA sequences, especially for longer DNA data.(2) On memetic algorithm framework, a DNA data compression method based oncollaborative particle swarm optimization-based memetic Algorithm (CPMA) is proposed, inwhich the comprehensive learning particle swarm optimization (CLPSO) is used for globalsearch and a dynamic adjustive chaotic search operator (DACSO) as the local searchrespectively. In CPMA, it looks for the global optimal code book based on extendedapproximate repeat vector (EARV), by which the DNA data is compressed. Experimentalresults demonstrate better performance of CPMA than the other optimization algorithms, andit is very close to the global optimal points in most of the test functions adopted by the thesis.The compression performance of the method based on CPMA is markedly improvedcompared to many of the classical DNA sequence compression algorithms on benchmarkDNA sequences.(3) A novel hybrid particle swarm optimization based memetic algorithm (HPMA) isproposed for DNA data compression, in which dynamic comprehensive learning particleswarm optimization (DCLPSO) method within the frameworkl of the memetic algorithm isused for global search and two adaptive local searches (LS) operators including centersymmetry mutation differential evolution (CSMDE) operator and adaptive chaotic searchoperator (ACSO) work in cooperative way. HPMA looks for the global optimal code bookbased on EARV, by which the DNA sequence will be compressed. Experiments wereconducted on19high-dimensional functions and11real DNA sequences. The results showthat HPMA is more competitive in both the performance and scability, and also attains bettercompression ability than other representative DNA specific algorithms on DNA sequencedata.(4) Considering the DNA data compression algorithm based on the reference genome,the dependence have been too strong. A novel high-throughput DNA data compressionmethod based on codebook index transformation (CITD) is proposed, in which the codebookindex transformation (CIT) model is adopted to substitute the traditional representation ofcodebook indexes by the quaternary values which are expressed by the four standard basecharacters, and adopts a simple encoding method to distinguish the replaced and non-replacedsubstring, and subsequently determines whether need to use the burrows-wheelertransformation (BWT) according to the value of information entropy, finally uses move tofront (MTF) transformation and Huffman entropy coding to compress the data. Experimentalresults on several sequencing data sets demonstrate better performance of CITD than the high-throughput DNA data compression algorithms cited in this thesis, in most cases.(5) A compression algorithm, called DNAC-K, is proposed, which is based on K-meansclustering for high-through DNA data and creates cluster of sequences based on K-meansclustering method at first, then the clusters are iterated according to the edit distances ofsubsequences, and finally, Huffman coding is adopted to encode the clustering result.Experimental results on several sequencing data sets demonstrate better performance ofDNAC-K than many of the previous high-throughput DNA data compression algorithms.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络