节点文献

基于DNA序列的进化树构建算法的研究

Study of Methods of Constructing Evolutionary Trees with DNA Sequences

【作者】 李建伏

【导师】 郭茂祖;

【作者基本信息】 哈尔滨工业大学 , 人工智能与信息处理, 2008, 博士

【摘要】 进化树又称为系统发生树,是一种用来描述各种生命实体之间进化关系的树型结构。一个可靠的进化树推断不仅有助于了解生物的进化历史和进化机制,而且对生物医药学、计算分子生物学其他分支的研究具有重要意义。目前常见的进化树构建方法有距离法、最大简约法和最大似然法三种。其中,最大似然法被认为比其他两种方法更为准确,但是其计算复杂度非常高。为了提高最大似然法,本文从以下4个方面展开了深入研究:(1)提出了一个快速高效的分支交换操作由于计算复杂度非常高,实际应用中的最大似然法都是基于启发式的。虽然不同的启发式算法具有不同的搜索策略,但是其基本思路都是通过一系列的分支交换操作来提高起始进化树。并且,启发式算法的性能在一定程度上取决于其所采用的分支交换操作的搜索能力。目前常见的分支交换操作有NNI(Nearest Neighbor Interchange) , SPR(Subtree Prune and Regraft)和TBR(Tree Bisection and Reconnection),其中TBR的搜索空间最广。但研究表明,TBR的搜索空间还不够广,容易陷入局部极优。另一种分支交换操作p-ECR (p-Edge Contraction and Refinement)具有更广的搜索空间,能够在一定程度上避免局部极优。但是由于计算复杂度非常高,p-ECR很少被使用。本文结合邻接法和p-ECR提出了一个分支交换操作p-ECRNJ。p-ECRNJ的基本思想是利用邻接法分解p-ECR中产生的未分解节点。这样,每一次p-ECRNJ操作只需要O(p3)去分解未分解节点,而无须花费大量的时间去尝试所有的可能(最多(2p+1)!!),从而大大降低了时间复杂度。对12组真实数据的测试结果表明基于p-ECRNJ的进化树构建算法能够在合理的时间内找到比其他流行的算法更好的进化树。并且,p-ECRNJ还可以有效地提高其他分支交换操作,如NNI。从而,证明了p-ECRNJ的有效性。(2)提出了一种PSO的进化树构建算法爬山算法在进化树重构中得到了广泛应用并取得了一定的成功。但是由于进化树的似然函数通常含有多个局部极优解,而爬山算法没有逃离局部极优的能力,因此基于爬山算法的进化树构建算法很容易陷入局部极优。从本质上讲,粒子群优化算法(Particle Swarm Optimization,简称PSO)是一种并行的、动态的爬山算法,具有一定的逃离局部极优的能力。本文提出了一个基于PSO的进化树构建算法PSOML。该算法采用PSO的基本框架,利用p-ECRNJ完成粒子状态的更新,其中p值是根据粒子的速度确定的。对真实数据的测试结果表明虽然要花费较长的时间,PSOML的准确性要明显优于基于爬山算法的PHYML和RAxML。并且,PSOML可以在合理的时间内找到比基于遗传算法GARLI更好的进化树。(3)结合Quartet puzzling和邻接法提出了一种进化树构建方法由于最大似然法的时间复杂度非常高,基于分治思想的最大似然法——quartet方法得到了人们的关注。最流行的quartet方法是Quartet Puzzling(简称QP)。它首先利用最大似然法估计quartet拓扑结构集合Q,然后利用一个贪心算法将Q进行重组构成一个包含所有序列的进化树。研究表明,QP的准确性不够高,甚至比邻接法还要低。如何快速有效地将Q进行重组是QP所面临的一个难题。另一方面,邻接法具有很好的理论特性,但其准确性取决于作为输入的距离矩阵的质量。长分支一直是困扰邻接法的一个问题。本文结合邻接法和QP提出了一个进化树构建算法QPNJ。QPNJ的基本思想是首先用最大似然法估计quartet拓扑结构集合Q,然后根据Q估计序列之间的进化距离,构成距离矩阵M,最后利用邻接法根据M构建进化树。QP和邻接法的这种结合会达到取长补短的效果。一方面利用更有理论依据的邻接法完成Q的重组可以提高QP的准确性,另一方面利用quartet估计序列之间的进化距离可以在一定程度上避免邻接法所面临的长分支问题。理论上,QPNJ与QP具有相同的时间复杂度。需要指出的是,邻接法和QP的结合不是唯一的,任意类似于邻接法的聚类过程如Weightor都可以按照QPNJ的思想代替邻接法与QP相结合。模拟实验表明,QPNJ和QPWNJ(结合了Weightor与QP)比QP更加准确,甚至比邻接法和Weightor还准确。并且,QPNJ和QPWNJ的准确性不像QP一样依赖于模型树的结构。从而证明了QP与聚类算法如邻接法和Weighbor的结合是有效的。(4)结合同伦方法和SEM提出了一种进化树构建算法根据当前物种构建进化树是一个典型的从非完全数据学习的问题。结构期望最大化(Structural Expectation Maximization,简称SEM)算法是一个根据不完全数据求解模型结构的有效方法,它通过迭代交替搜索的简单方式能够简化最大似然估计问题。但是,由于SEM算法直接采用贝叶斯公式计算隐变量的条件概率,使得每次迭代的结果都是上次迭代中期望似然值的最优解,因此算法对于初始点的选择具有很强的依赖性。尤其是进化树的似然函数往往具有多个局部极优解,所以直接利用SEM构建进化树很容易陷入局部极优。同伦方法是一个全局方法,其基本思想是构造一个同伦函数将一个已知解的问题与待解问题联系起来,然后从已知解的问题开始,利用同伦参数的变化,最终求得待解问题的最优解。本文结合同伦方法和SEM提出了一种进化树构建算法HSEMPHY。HSEMPHY首先利用最大熵原理计算隐变量的条件概率,引入同伦参数β,然后借鉴同伦方法的过程优化似然函数。对真实数据的测试结果表明HSEMPHY与目前两个流行的进化树构建算法PHYML和RAxML性能相当,并且对初始点具有较低的敏感性。

【Abstract】 An evolutionary tree, also known as phylogenetic tree, is a tree structure showing the evolutionary relationship among various organisms. A reliable evolutionary tree inference not only helps to understand the evolutionary history and evolution mechanisms in biology, it is also of great significance to many branches research in biomedical science and computational molecular biology.Currently, a rich variety of evolutionary tree reconstruction methods have been developed, which fall into three categories, (a) distance based methods, (b) maximum parsimony methods and (c) approaches applying the maximum likelihood principle. It is known that maximum likelihood can recover correct trees more frequently than other methods. But high computational complexity is the main disadvantage. In order to improve maximum likelihood, the paper launched an in-depth study from the following four aspects:(1) Proposed a fast and effective topological rearrangement operatorDue to the very high computational complexity, maximum likelihood in application is based on heuristics. Although using different search strategies, the heuristics are all to try to improve a starting tree / starting trees by a series of elementary topological rearrangement operators, until local optima is found. It is obvious that the performance of the heuristics depends on the degree of exhaustiveness of the topological rearrangement operators on some extent. The topological rearrangement operators often used in heuristics are Nearest Neighbor Interchange (NNI), Subtree Prune and Regraft (SPR) and Tree Bisection and Reconnection (TBR), and TBR is the most exhaustive. Even TBR searches, however, can often get trapped in local optima, since there are not many trees accessible in one step from any given tree. Another more exhaustive topological rearrangement operator is p-Edge Contraction and Refinement (p-ECR). However, due to its high computation complexity, p-ECR has rarely been used in practice. This paper presented a fast and efficient topological rearrangement operation combining p-ECR and neighbor joining. The main idea of p-ECRNJ is to to use neighbor joining to refine the unresolved nodes produced in p-ECR. Thus, in very iteration, p-ECRNJ only spends O(p3) to resolve unresolved nodes, and without having to spend a lot of time to evaluate all possible trees (at most (2p+1)!!), which significantly reduced the time complexity of p-ECR. Experiments with 12 real datasets show that p-ECRNJ based heuristics can find better trees than other two popular evolutionary tree reconstruction methods (PHYML and RAxML) in considerable time. Moreover, p-ECRNJ can improve efficiently other local topological rearrangement operators, such as NNI. All the results prove that p-ECRNJ is efficient.(2) Proposed a PSO based evolutionary tree reconstruction methodHill climbing has been widely used in evolutionary tree reconstruction and has achieved some success. However, for hill-climbing strategies to be guaranteed to find the global optimum the optimality landscape must be unimodal. Moreover, research has demonstrated that multiple maxima exist under the maximum likelihood principle. Therefore, hill climbing based evolutionary tree reconstruction methods often return local optima. In principle, PSO (Particle Swarm Optimization) is the parallel and dynamic version of hill climbing and has a higher probability to break away from the local optima effectively. This paper presented a PSO based evolutionary tree reconstruction method named PSOML(Particle Swarm Optimization based Maximum Likelihood). PSOML uses the framework of the basic PSO principle, every particle in the swarm corresponds to a possible evolutionary tree and the update of particles’state is accomplished by p-ECRNJ, while p is dependent on particles’velocity. Experiments with real datasets show that PSOML clearly outperforms PHYML and RAxML in terms of solution optimality, although it cannot compete in terms of runtimes. Moreover, PSOML can find better trees than GARLI across a range of datasets in considerable time.(3) Proposed an evolutionary tree reconstruction combing Quartet Puzzling and neighbor joiningDue to the high time complexity of maximum likelihood, divide-and-conqure based quartet methods of phylogeny reconstruction are currently receiving considerable attention. The most popular is Quartet Puzzling (QP). It first use maximum likelihood to estimate the topology quartet set Q, and then use a greedy algorithm to merge Q into an evolutionary tree containing all sequences. But research shows that QP is not as accurate as expected. How to effectively recombine the information contained in Q is still a hard problem facing QP. Neighbor joining is a clustering technique, which has many good characteristics. However, the performance of neighbor joining depends on the quality of the estimated evolutionary distances of different sequences. Long branch is still a problem facing neighbor joining. Therefore, this paper proposed a reconstruction method QPNJ (Quartet puzzling and Neighbor Joning) combing neighbor joining and QP. The main idea of QPNJ is to use maximum likelihood to estimate the set Q of quartet topologies, estimate the evolutionary distances between every two sequences according to Q, and use neighbor joining to reconstruct an evolutionary tree according to the evolutionary distances. One hand, using neighbor joining with better theoretical guarantees as the recombining technique can improve QP; On the other hand, QPNJ can avoid the long branch problem faced by neighbor joining. In theory, QPNJ has the same time complexity with QP. It is worth noting that the combination of QP and neighbor joining is not only, any clustering method like neighbor joining can replace neighbor joining to combine with QP. Finally, simulation experiments show that QPNJ and QPWNJ(the combination of QP and Weighbor) are more accurate than the QP, and even than neighor joining and Weighbor. Moreover, not like QP, QPNJ and QPWNJ do not rely on the structures of model trees. These experiments results prove the combination of QP and clustering methods such as neighbor joining and Weighbor is efficient.(4) Proposed an evolutionary tree reconstruction combing homotopy with SEMReconstructing evolutionary trees according to current sequences is a typical machine learning problem from incomplete data. SEM(Structural Expectation Maximization) is very efficient to estimate model structures using maximum likelihood with incomplete data. Starting from a structure, SEM iteratively probabilistically complete the data according to the distribution induced by current model and use the completed data to evaluate different candidate structures. However, in SEM, the condition probability of the hidden variables is directly computed by Bayes’rule and the structure obtained in every iteration is optimized with respect to the expected likelihood value of the optima in last iteration. As a result, at later iterations of the procedure, the trees that maximize this expected likelihood value will tend to be similar to the tree found in the previous iteration. Therefore, the performance of SEM depends on the starting points. Moreover, research has demonstrated that multiple maxima exist under the maximum likelihood principle. Thus, the reconstruction algorithm using SEM can often be strapped in local optima. The homotopy method belongs to the field of global optimization techniques. The main idea is that a smoothed version of the objective function is first optimized. With enough smoothing, the optimization will be convex and the global optimum can be found. Smoothing then increases and the new optimal are computed, where the solution found in the previous step serves as a starting point. The algorithm iterates until there is no smoothing. On the basis of SEM and homotopy, this paper proposed an evolutionary tree reconstruction, which is named HSEMPHY. HSEMPHY computes the condition probability of hidden variables through maximum entropy principle. It can reduce the influence of the initial value on the final solution by introducing the homotopy parameterβand simulating the process of homotopy continuation principle. Experimental results with real datasets show that HSEMPHY is at least as good as the two most popular methods PHYML and RAxML while it is very robust to poor starting trees.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络