节点文献

大规模生物序列分析的高性能算法和模型

High Performance Algorithms and Models for Large-scale Biological Sequence Analysis

【作者】 杨矫云

【导师】 陈国良; 徐云;

【作者基本信息】 中国科学技术大学 , 计算机软件与理论, 2014, 博士

【摘要】 随着测序技术的发展,生物序列的规模呈现爆炸性增长,目前生物信息学中的计算方法与技术如何应对快速增长的序列数据,已成为当前生物信息学迫切需要解决的问题。为了适应大规模生物序列数据的分析和计算,本文主要从三个层面研究了数据组织、算法设计和并行化加速。数据组织就是建立数据表示和组织的模型,模型能尽量给出全局信息以及有利于分析和计算的效率提高;算法设计就是给出适应大规模数据处理的高效算法,算法具有低时间复杂度或尽可能短的时间内输出好的解(即尽快求解算法);并行化加速是实际大规模数据处理必须考虑的手段,着重要解决算法的并行化与有效的负载平衡。本文选取生物信息学中单体分型、模体发现和最长公共子序列三个重要的生物序列分析问题,来探究大规模生物序列数据处理中的关键技术和方法。本文的主要工作有:(1)单体分型问题:单体型是单条染色体上特异位点组成的序列,与人类疾病密切相关。生物实验测序通常得到两条单体型合并而成的基因型,因此需要将基因型分型成单体型。本文研究群体数据集的单体分型问题,首先建立了网络流模型,并在该模型上对已有的分型规则进行分析和综合,归纳出新的启发式知识,进而设计了新的单体分型算法FNphasing。在大规模数据集上,计算实验表明FNphasing算法的时间性能显著优于已有的算法,且精度也达到了目前最优。(2)模体发现问题:模体是生物序列中一些重复出现、保守的区域,通常具有重要的生物功能,通过发现模体可以帮助了解生命机体的原理和特征。本文研究(l,d)模体发现问题,首先采用新哈希策略来减少存储的潜在模体数目,进一步设计了新的剪枝策略,降低了算法的平均时间复杂度。在挑战性实例的求解上,计算实验表明新算法CVoting的时间性能比已有算法降低一个数量级,且空间消耗更少。(3)最长公共子序列问题:寻找序列间的最长公共子序列是序列相似性鉴定的一种重要手段,序列间的相似性可以作为物种共同起源的证据。本文研究多序列最长公共子序列(MLCS)问题,首先将该问题转化为图搜索,然后采用迭代最佳优先搜索策略设计了尽快求解算法Pro-MLCS,计算实验表明Pro-MLCS算法一般在总运行时间的前3%时间内即可输出最优解。在Pro-MLCS算法的基础上,进一步设计了空间增长缓慢的SA-MLCS算法和空间受限的SLA-MLCS算法。SA-MLCS算法采用迭代beam加宽的搜索策略,使得其找到与Pro-MLCS算法相同解所消耗的空间要少得多;而SLA-MLCS算法采用替换策略,使得其在SA-MLCS算法达到空间限制后能够继续搜索更好的解,进一步提高了可解问题规模。计算实验表明,在给定的空间限制内,SA-MLCS算法与SLA-MLCS算法能够处理的数据规模比Pro-MLCS算法高一个数量级。最后设计了Pro-MLCS算法的并行化版本:DPro-MLCS和DSDPro-MLCS,前者适用于分布式环境,后者适用于分布式-共享分层存储的集群环境。计算实验反映,二者均能达到了线性加速,且具有良好的尽快求解性能。本文所研究的大规模生物序列数据处理中的关键技术和方法,其主要创新之处如下:(1)数据组织:贡献在于全局表示模型的建立。对于单体型问题,本文构建了单体分型全局视图的网络流模型,该模型包含了原始数据的全局信息,使得单体分型的可行解与模型上的流存在一一对应关系,更有利于设计高效的分型算法。对于模体发现问题,本文采用新的哈希策略,减少了存储的潜在模体数目,使得空间消耗大大降低,减少了空间对大规模数据处理的制约。对于最长公共子序列问题,本文将该问题的解空间组织为搜索图,并转化为在图中寻找最长路径问题,高效的图搜索算法可以在该问题上的得到应用。(2)算法设计:贡献在于高效算法和尽快求解算法的设计。对于单体型问题,本文使用网络流模型的全局信息设计了高效的启发式搜索算法FNphasing,其在大规模数据处理的应用中,时间性能显著优于已有算法。对于模体发现问题,本文设计了新的剪枝算法减少哈希表的访问次数,使得新算法的平均时间复杂度达到目前最好。对于最长公共子序列问题,本文设计了尽快求解算法模式和空间受限的尽快求解算法模式。相比于已有的算法,尽快求解算法Pro-MLCS在求得相同解的情形下时间性能降低了一个数量级,而空间受限的尽快求解算法SLA-MLCS在相同的时间与空间限制下可求解问题规模提高了两个数量级。(3)并行化:贡献在于尽快求解算法的并行化。本文针对新提出的尽快求解算法,设计了一种跨层并行化策略,使得不同层之间的并行处理成为可能,并利于实现负载均衡,新的并行算法达到了线性加速,且维持了尽快求解性能,能够充分利用大规模集群环境的计算资源,能够处理大规模数据。

【Abstract】 With the development of new DNA sequencing technologies, an enormous amount of sequence data has been generated, which presents great challenges for computational biology problems. Therefore, how to deal with the rapid growth of data volumes be-comes a problem needing to be solved urgently. In order to handle the large scale datasets, we study the methods from three different levels, include data organization, algorithm development and parallelization. Data organization is to establish models for data presentation and organization, which contains the global information and could improve the efficiency of data analysis; Algorithm development is to design efficient algorithms for large scale datasets, which have small time complexity or ouput good solutions as soon as possible (i.e. anytime algorithms); Parallelization must be consid-ered when dealing with large scale datasets, and the parallelization of algorithms and load balance should be solved.In this dissertation, we choose three important biological sequences analysis prob-lems, including haplotype phasing, motif search and longest common subsequence, to study the methods and technologies of large scale data analysis. The works of this dis-sertation include:(1) Haplotype phasing problem:haplotype is composed of the SNPs on a single chromosome, which is related to human diseases. The sequenced data are often geno-types, i.e. the combination of two haplotypes, therefore, the genotypes need to be phased into haplotypes. In this dissertation, we studied the population haplotype phasing prob-lem. At first, we established the flow network model and analyzed the phasing rules on this model in order to obtain some heuristic knowledge. Then, a new haplotype phasing algorithm FNphasing was presented. When applied on large scale data sets, FNphasing is significantly faster than the state-of-the-art algorithms, and also achieves an equal or superior accuracy compared with other approaches.(2) Motif search problem:Motifs are recurring and conserved regions in biologi-cal sequences, which have molecular structural or functional features, the identification of them can help us better understand the mechanisms of life. In this dissertation, we studied (I, d) motif search problem. At first, we adopted a new hashing strategy to re-duce the number of potential motifs. Then a new pruning algorithm was designed to reduce the average time complexity. When solving challenging instances, the new al-gorithm CVoting is an order of magnitude faster than current algorithms, and uses much less space.(3) Longest common subsequence problem:finding the longest common subse- quence an important method to identify the similarity of sequences, which could serve as the measurement of how two different species are evolutionarily related. In this disser-tation, we studied multiple longest common subsequence (MLCS) problem. At first, we coverted the longest common subsequence problem into a graph search problem. Then, an anytime algorithm Pro-MLCS was designed by using iterative best-first search. The experimental results show that Pro-MLCS could achieve the optimal solution with only3%of the total runnning time. Furthermore, based on Pro-MLCS, two anytime algo-rithms were designed:memory efficient algorithm-SA-MLCS-with small space in-crease rate, and space bounded algorithm, SLA-MLCS. SA-MLCS adopted an iterative beam widening strategy, and could use much less space to find the same quality solution when compared to Pro-MLCS; While SLA-MLCS adopted a replacement strategy, and could continue to find better solutions after SA-MLCS reaches the space bound. The experimental results show SA-MLCS and SLA-MLCS could deal with an order of mag-nitude larger size instances than Pro-MLCS when given the same space bound. At last, we developed two parallel algorithms:DPro-MLCS and DSDPro-MLCS:the former one is for distributed memory architecture and the latter is for hierarchical distributed-shared memory architecture. Both algorithms could achieve linear speedup, and have good anytime property.This dissertation focuses on the methods of dealing with large scale biological sequence data, thus the contributions of this dissertation for three levels are as follows:(1) Data organization:the contribution is the establishment of the global repre-sentation model. For haplotype phasing problem, this dissertation constructed the flow network model, which contains all the information that could be derived from the orig-inal data, which makes the feasible solutions and the flows one-to-one mapping, which would be useful for the algorithm design. For motif search problem, this dissertation adopted a new hashing strategy to reduce the number of stored potential motifs, which brings down the space usage. For longest common subsequence problem, this disserta-tion reorganized the solution space into a graph, and formulated the problem into finding the longest path in the graph, thus the efficient graph search algorithms could be applied.(2) Algorithm development:the contribution is the design of efficient algorithms and anytime algorithms. For haplotype phasing problem, we developed a heuristic al-gorithm FNphasing based on flow network model, which is significantly faster than current algorithms when applied on large scale datasets. For motif search problem, we designed a new pruning method to reduce the access number of hashing table, mak-ing the average time complexity of the new algorithm achieve the best among current algorithms. For longest common subsequence problem, we established an anytime al-gorithm Pro-MLCS and a space bounded anytime algorithm SLA-MLCS. Compared to current algorithms, Pro-MLCS is an order of magnitude faster when achieving the same quality solution and SLA-MLCS could handle an order of magnitude larger size instances when given the same time and space bounds.(3) Parallelization:the contribution is the parallelization of anytime algorithm. This dissertation designed a new parallel strategy to parallelize the anytime algorithm for longest common subsequence problem. The new strategy makes the parallel com-putations of different layers possible. The new parallel algorithms could achieve nearly linear speedup, and have good anytime property. It could make full use of the compu-tational resources of clusters to deal with larger scale datasets.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络