节点文献

迁移工作流容错执行模型及其实现方法研究

Study on Fault-Tolerant Execution Model and Implement Methods in the Migrating Workflow System

【作者】 卢朝霞

【导师】 曾广周;

【作者基本信息】 山东大学 , 计算机软件与理论, 2009, 博士

【摘要】 迁移工作流是一类基于移动agent计算模式的工作流管理技术,它以移动agent为范型构建一个或多个任务执行主体(称作迁移实例),以工作位置映射工作流参与者的网络节点和服务,其中,网络节点表示迁移实例的工作场所,位置服务包括运行时服务和工作流服务两部分。迁移实例可以在某个工作位置上利用本地资源和服务执行一项或多项任务,并在必要时携带任务说明书和当前执行结果迁移到另一个能满足其要求的新工作位置上继续工作。为同一个工作流创建的多个迁移实例可以协同工作,以满足并行业务过程管理的需要。迁移工作流管理系统模型由一个迁移工作流管理机和若干个已经建立友好信任关系的工作位置组成,其中,迁移工作流管理机用于工作流发起者组织、管理和监控工作流,工作位置代表参与协同业务的企业、机构或个人,为迁移实例履行服务承诺。部署在工作流管理机上的主要工作部件是迁移工作流引擎,它支持工作流联盟管理、业务过程定义、迁移实例创建、派遣和监控。如果一个业务流程可以分解为若干个并行处理的子过程,则可以创建多个迁移实例并令每个迁移实例负责执行一个目标相对独立的子过程。工作位置包括停靠站和工作机网络两部分,是迁移实例的运行场所,其中,停靠站接受迁移实例的服务查询和迁移请求,在迁移实例到达后为迁移实例提供运行时环境和运行时服务,并代理迁移实例请求工作机网络上的数据服务和功能服务。如果迁移工作流管理机和工作位置一起部署,则每个工作流参与者都可以组织和发起自己的工作流,因此,迁移工作流模型容许多个业务过程管理在同一个系统中同时运行。因为迁移实例运行在一个跨机构的异构网络环境中,所以其任务执行过程容易受不确定性因素的影响,例如主机故障、链路故障、通信故障、服务程序和服务资源故障等。上述不确定因素不仅会干扰迁移工作流的正常执行,而且可能导致迁移实例夭折,甚至迁移工作流失败,因此,迁移实例容错是保证迁移工作流可达性、正确性和可靠性不可或缺的必要机制。迁移实例容错主要包括三个方面:执行容错、通信容错和状态容错。·执行容错:执行容错是指工作流任务能够在所有工作位置上都被迁移实例可靠执行。在迁移工作流模型中,工作流任务通过迁移实例在工作位置之间的连续迁移和就地利用服务完成,工作位置不仅要为迁移实例提供运行场所,而且要为迁移实例提供可靠的工作流服务,任何主机物理故障或服务逻辑故障都会干扰迁移实例任务的正常完成。特别是对于某些要求可靠性较高的长事务任务(如订票、付款)来说,因为涉及到对重要数据库的访问,需要保证操作的事务属性,对迁移实例的执行过程,需要提供必要的容错保障机制。·通信容错:通信容错是指迁移实例之间的通信信件能够被可靠地发送和递交。在迁移工作流模型中,通信是迁移实例之间实现协作的基础,只有保证通信信件能够被可靠地发送和递交,才能保证迁移实例协作的成功。存在两方面的原因会导致迁移实例间的通信失败:(1)通信链路物理故障,导致信件不能发出;(2)迁移实例移动,导致信件不能可靠递交,即当信件到达目标主机时,接收方迁移实例已经离开。对于通信链路物理故障,可以通过备份链路重传信件。对于因迁移实例移动而导致的信件递交失败,可以通过设计合理的迁移实例位置追踪和信件转发机制实现容错。·状态容错:状态容错是指迁移实例的异常状态能够被及时捕获和恢复。迁移实例的状态包括正常执行状态和异常状态,异常状态主要是指迁移实例因某些物理故障或受到安全攻击而变得不可追踪或不可用。因为并行的多个业务子过程之间通常具有数据和时间关联关系,所以,执行同一工作流的多个迁移实例之间也具有特定的行为依赖关系。如果某个迁移实例的状态出现异常,则可能会引发其它迁移实例的状态异常或执行阻塞。因此,状态容错机制不仅要能够及时地捕获单个迁移实例的异常状态并使其恢复,而且要能够分析状态异常的波及范围,有效限制异常状态的蔓延。本文在国家自然科学基金项目的资助下,以迁移工作流系统模型为基础,吸收其他领域的研究成果,重点研究了迁移工作流的容错执行模型及其实现方法,包括迁移实例的容错执行方法、迁移实例间可靠通信方法、多迁移实例的协同监控和失败协调恢复方法等,并通过具体的应用案例对上述研究成果进行了分析和验证。本文的主要工作包括:1.迁移工作流容错执行模型研究为了实现迁移工作流的可靠执行,本文建立了系统级容错执行模型。模型从服务层、实例层、协作层三个层次描述了系统存在的故障,及相应的容错实现机制。文中给出了容错执行模型的框架结构,设计了迁移工作流实验用例,建立了迁移工作流容错执行环境。2.迁移实例容错执行模型及其实现方法研究。为了实现迁移实例的容错执行,本文将工作流任务区分两种不同的类型:时间关键任务和业务关键任务。时间关键任务特指那些对响应时间要求较高的短事务任务,如实时数据处理、在线软件更新等。业务关键任务特指对执行可靠性要求较高的长事务任务,如订票、购物、转帐等。由于业务关键任务通常涉及数据库的修改,因此需要保证执行的“只一次”性和事务属性。本文针对执行业务关键任务的迁移实例,重点研究了一类基于空间复制的容错执行阶段构建模型,文中给出了动态阶段的概念,定义了动态优先级,设计实现了阶段工作位置选取策略和动态阶段构建算法。性能和效率分析表明,该模型能够减少阶段提交的时间和通信开销,提高容错执行的效率。3.迁移实例容错通信模型及其实现方法研究。为了实现迁移实例间的通信容错,本文针对通信过程中因迁移实例移动而导致的信件不能可靠递交故障,重点研究了一类基于服务域划分和“邮局-邮箱”原理的迁移实例可靠通信模型,文中给出了通信模型的定义及体系结构,设计了迁移实例的命名和寻址方式,给出了主要的通信算法,并对模型特性和通信效率进行了分析。实验表明,该通信模型模拟现实世界中的信件投递过程,简单易实施,具有较好的可靠性和效率。4.迁移实例状态监控模型及其协调恢复方法研究。为了实现迁移实例状态容错,本文针对因某个迁移实例状态异常而导致的其它迁移实例状态不一致或执行阻塞故障,重点研究了一类针对多主体协同并发过程的协同监控模型及其相应的检查点算法,文中给出了协同监控模型定义,设计了监控者管理算法,包括监控者创建、移动与退出,并描述了监控信息的获取与处理过程以及基于监控的检查点过程。性能和效率分析表明,监控模型能够实施对迁移实例有效的监控,通过监控者的协调,实现故障恢复后全局一致性状态。本文工作的创新点主要体现在:1.针对工作位置故障导致的迁移实例执行过程受阻问题,提出了一种基于空间复制的容错执行阶段构建模型。该模型有效降低了空间复制法的时间和通信开销,提高了迁移实例执行容错方法的可用性。该模型通过合理规划迁移实例的任务执行阶段,避免了迁移实例执行过程中对工作位置的不必要重访,降低了总体运行时间:通过工作位置动态服务优先级的设定和计算,使得工作位置在不同时刻相对于不同的迁移实例具有不同的服务优先级,因而能够更加准确地反映工作位置作为迁移实例运行时环境的适合程度:通过尽可能选取上一阶段使用过的工作位置的策略设定,使得迁移实例既能够选取到执行环境中最优的工作位置,又减少了阶段提交的通信开销。2.针对通信链路故障和迁移实例移动性导致的通信失败问题,提出了一种基于服务域划分的迁移实例容错通信模型。该模型具有简单易用、高可靠性和高效率的特点,并且对系统规模扩大具有良好的适应性。该模型借助业务关联度的概念把工作位置划分成不同的服务域,在服务域上设置邮局,在邮局中为本域创建和外域迁来的迁移实例设置信箱,并通过熟人地址簿建立地址缓存机制,以便地址查询。每个迁移实例都有两个信箱:源信箱和活动信箱。活动信箱由迁移实例随身携带,以方便信件的直接投递和提取:源信箱固定存放在迁移实例的创建地邮局,用以支持信件直接投递失败时的信件转发。该模型具有以下优点:(1)双信箱机制。双信箱机制可以有效避免迁移过程中发生的信件丢失,并保证信件的“仅一次(exactly-once)”提交;(2)熟人地址簿机制。熟人地址簿机制支持高效透明的迁移实例寻址,不仅可以有效降低通信地址查询时间,而且可以减轻迁移实例通信对创建地的依赖,降低系统在迁移实例注册、注销等方面的开销,增强系统健壮性和提高工作效率。3.针对迁移实例异常状态捕获与恢复问题,提出了一种层次型协同监控模型(HCM~3)。该模型可以有效捕获和处理迁移实例的状态信息,避免迁移实例夭折造成的工作流执行失败。该模型将迁移实例状态监控看作一个多监控者的协同工作过程,多个监控者协同监控执行同一工作流的所有迁移实例,并通过监控者之间的协调实现不同层次异常状态的捕获、处理和恢复。该模型具有以下特点:(1)监控层次性。监控者之间具有层次关系,并与迁移实例之间的层次关系相对应,不但可以针对不同的迁移实例定制监控内容和监控手段,而且能够诊断并处理在迁移实例层和过程层发生的异常情况,协调不同层次间的监控者行为:(2)监控并发性。监控可以在不同层次上同时进行,通过监控者之间的协调达到状态的一致,在一定程度上解决了集中式监控的单点瓶颈问题,提高了监控效率;(3)监控可靠性与监控效率。模型在多个监控者之间分散监控失败风险,同集中式监控相比具有较高的可靠性,同时对每个迁移实例仅分配一个监控者,避免了过多冗余监控者带来的额外开销。由于迁移工作流的特殊性,也由于迁移工作流管理尚是一个刚刚开始的研究新领域,因此,无论是理论研究还是应用方面都还远未成熟。本文进一步的研究工作包括:1.协同监控模型的进一步完善。本文关于迁移工作流的协同监控模型还处于概念验证阶段,一方面系统对许多参数做了假定,如仅设定有限种类的故障类型,而且不考虑监控者的失败情况等:另一方面实验案例较单一,没有结合系统做大量深入的定量分析。下一步的工作将进一步考虑环境的复杂性和动态性,完善算法,并在已有的定性分析的基础上,对整个系统的各方面性能做深入的定量分析,以获得客观的评判标准。2.面向目标的任务分解和迁移实例执行机制。本文的研究内容基于面向过程的迁移工作流方法进行,因为业务过程分解和迁移实例执行阶段的划分需要事先对业务流程进行明确的定义,所以要求系统设计者掌握完备的工作流知识。对于跨机构、大规模协同业务过程,要求设计者掌握完备的工作流知识是十分困难的。下一步将研究目标驱动的迁移工作流机制,以减少系统性能对设计者先验知识的依赖,提高系统的易用性。

【Abstract】 The migrating workflow is a mobile agent-based workflow management technology. The performing agent of tasks, which is named migrating instance, is constructed from mobile agent paradigm. Work place is mapped to the network node and its service of workflow participants. Network nodes are working sites of migrating instance, while services provided by nodes include runtime service and workflow service. Migrating instance can utilize local resource to perform one or several tasks at one work place. If necessary, it can migrate with its task list and current results to another satisfying work place to continue its work. Migrating instances created for a common migrating workflow can work collaboratly to meet the needs of management of parallel business processes.The migrating workflow management system is composed of a migrating workflow management macine and several work places with trust relations. The migrating workflow management machine is used to organize, manage and supervise workflow for the sponsor of workflow. A work place, representing an organization or a corporation which participate in collaborating works, provides services for migrating instance. The workflow engine, which is located at workflow management machine, provides support for management of workflow alliance, definition of business process, creation, dispatching and watching of migrating instance. If a business flow is divided into several parallel business processes, each of which is performed by a migrating instance. Hence multi-migrating instances performing a common business flow can be created parallely. A work place, including a docking station and work host network, is the working location of migrating instances. It receives the query and request of migrating instance, provides the running environments and running services when migrating instance arrives. Moreover, it requests for data services and functional services for migrating instances. If the migrating workflow engine and work place are deployed together, each workflow participant can organize and lanch its workflow. Hence running of multi-business processes is permitted in a common workflow systemThe running environment of migrating instance is an inter-organizational network, hence the task performing process is prone to be affected by uncertainty, e.g. host faults, channel failure, communication failure, service and resource mailfunction etc. Faults or failure will distort execution of migrating workflow, moreover, they can cause migrating instance to death, even worse, the abortion of migrating workflow. Hence fault tolerance of migrating instance is necessary to ensure reachable, correctness and reliability of migrating workflow. The fault tolerance of migrating instance includes three facts: fault tolerance of execution, fault tolerance of communication and fault tolerance of state.·Fault tolerance of execution: migrating instance can perform tasks reliably at any work place. In migrating workflow system, workflow tasks are performed by migrating instance through moving consecutivly and making use of local services. Work place provides not only runtime environment, but also reliable workflow services for migrating instance. Physical faults or logical malfunction can disturb conventional execution of migrating instance. Especially for such long transactional tasks as booking or payment which demands high reliability, transactional property should be ensured because of visiting to important database. Hence fault tolerant scheme is indispensable for migrating instance.·Fault tolerance of communication: communicating mails of migrating instance can be sent and submitted reliably. In migrating workflow system, communication is the basis to implement cooperation. Only when communicating mails are sent and submitted reliably, the success of cooperation can be ensured. There are two factors that can cause communication failure of migrating instance. (1) physical faults of communicating chanel, which cause mail unsent; (2) migration of migrating instance, which cause mail can not be submitted due to the random moving of instances, i.e., when a mail gets to a target, the receiver has already been gone. Mail can be resent through backup chanel for physical faults, but for failing submission due to randomly moving of migrating instance, the location tracking and mail transferring mechanism is needed.·Fault tolerance of state: exceptional states of migrating instance can be catched and resumed. States of migrating instance are divided into regular state and exceptional state. The exceptional state means migrating instance is not trackable and availabal because of some physical faults or suffering from attack. Business processes which are executed parallely often possess relevance on data or time. If one migrating instance appears exceptionaly, other migrating instances will be exceptional or blocking. Therefore, the fault tolerance of state can not only catch exceptional state and resume timely, but also compute the affected scope and restrict the spread of exceptional states effectively.This study is mainly supported by the National Nature Science Foundation of China under Grant No.60473123 and No. 60573169, based on the migrating workflow framework. This thesis absorbs the research results of relevant fields, focuses on the fault tolerance model of the migrating workflow. Some implementation schemes are presented including the fault-tolerant execution model, the reliable communication model of migrating instances, and the collaborating monitor and coordinated recovery scheme. The results have been analyzed and validated through an experimental case. The main contributions of this thesis are described as follows:1. Research on the fault-tolerant execution model of migrating workflowIn order to perform workflow tasks reliably, a fault-tolerant execution model of workflow is presented in this thesis. The model possesses hierarchical structure which is made up of service layer, instance layer and coordination layer. The possible faults of the three layers are descriped and the fault-tolerant implementation scheme is established. Moreover, the framework of fault-tolerance model is provided, the experimental case is devised, and the experimental environment is established, which are the research basis of latter chapter.2. Research on the fault-tolerant execution model and implementation of migrating instance.In order to implement the fault-tolerant execution of migrating instance, workflow tasks are divided into different types: time-critical tasks (TCT) and business-critical tasks (BCT). The former represents short transactional tasks requiring strict response time, e.g. real-time data processing, online software updating etc.; the latter represents long transactional tasks requiring high reliability, e.g. booking, paying, money transferring etc., to which the transaction property should be ensured when performing modify operation in an important database. In this thesis, a fault-tolerant stage construction model based on space replication method is provided. The definition of dynamic stage and dynamic priority is presented in this thesis. Moreover, the stage working place selection algorithm and dynamic stage construction algorithm are implemented. Performance analyses and experiment results show that the model can reduce time and communication costs of stage submission, hence improve efficiency of fault-tolerant execution.3. Research on the fault-tolerant communication model and implementation of migrating instances.In order to implement fault-tolerant communication of migrating instance, we have studied the reliable communication model based on sevice domain and "postoffice-mailbox" mode. The communication model is tailored for mail submission failure due to the randomly moving of migrating instance. In this thesis, the definition of communication model and corresponding system framework are descriped. The naming and address-locating scheme is described. The main communication algorithms are proposed. Moreover, the model characteristics and communication efficiency are analyzed. The experiment results show that the communication scheme is simple, reliable and efficient.4. Research on state monitor and coordinated recovery model and implementation.In order to implement fault tolerance of state, we studied the collaborating monitor model of a collaborating and parallel process with multi-executing agents and a corresponding checkpoint algorithm. The model is tailored for inconsistent state and execution blocking due to exceptional state of a migrating instance. The collaborating monitor model and monitor management algorithms are presented. The information capturing and disposing process is described. Moreover, a checkpoint method based on the monitor model is provided. Performance analyses and experiment results show that the model performs a very effective monitoring to migrating instances and can recover from failure with consistent state by coordinating monitors.The main innovative contributions of this thesis are:1. In order to avoid execution blocking caused by failure of work place, a fault-tolerant execution stage construction model based on space replication is provided. The model can optimize efficiency of stage construction, reduce costs on time and communication, and improve usability of the model.The model can plan tasks execution of a migrating instance according to the executing ability of working place, avoid the unnecessary revisit to some working places, and lessen total running time of migrating instance; moreover, a method to evaluate working places called dynamic priority is defined. For a working place, its priority is distinct for different migrating instance at different time. The dynamic priority method can reflect the adaptability of a working place as the runtime environment of the migrating instance. In addition, the working place selection algorithm is provided to select the most perfect working places for a migrating instance, at the same time to lessen the communication costs of stage submission.2. Aiming at communication failure caused by network faults or randomly moving of migrating instance, a reliable communication model based on service domain is presented. Compare to traditional communication methods of mobile agent, the model is easy to use, reliable, efficient and adaptable to a larger system scope. The communication model divides the whole working places into several service domains, each of which sets a postoffice in which two mailboxes of migrating instance locate. Each migrating instance has two mailboxes, one is source mailbox, the other is active mailbox. The source mailbox locates at the home postoffice, while the active one travel along with the migrating instance. An address_book is set at postoffice to buffer addresses of communicated migrating instances to be used for forthcoming query. The model bears advantages as follows: (1) every instance has double mailboxes, while the hmb takes on a guide role, the amb is the actual component to receive messages, which ensures reliable submission with exactly-once property, reduces the bandwidth of triangular routing and overheads of register and deregister. (2) the transparent and efficient addressing strategy can decrease addressing time and lessen dependency to home, moreover enable more robust and scalable system. Original experiments show that the model can satisfy the requirements on reliability, adaptability and efficiency of migrating workflow system. Future works will focus on establishing a more secure system to be applied in a more general environment. (3) fault tolerant, the model can avoid message loss due to the randomly moving of migrating instance, and ensue the sequential and exactly once submitting of mail.3. Aiming at catching and resuming of exceptional state, a hierarchical collaborating monitor model (HCM~3) is provided. The model can get and dispose the state of migrating instance, avoid workflow failure caused by the death of migrating instance.The model looks upon monitor of the while workflow as a coordinated parallel process, dispatches multi-monitors to implement collaborated monitoring for all migrating instances performing a workflow, and implement the catching, disposing and resuming of exceptional state at different level through coordination of monitors. The model possesses merits hereinafter: (1) hierarchical. Monitors have hierarchical relations with one another, which can tailor monitor for different migrating instances, diagnose when exception appears, and coordinate monitor’s work at different level. (2) parallel. Monitoring is parallel. The system state can be kept consistent through coordination among monitors. The model can avoid single point failure and promote efficiency of monitor. (3) reliable and high efficiency. The model is reliable since it disperses monitoring tasks into several monitors. At the same time, only one monitor is distributed to a migrating instance, hence it can lessen additional costs introduced by overabundant monitors.Since the migrating workflow is an emerging workflow research field, it is far from mature in both theory and applications. To further the study started in this thesis, the author proposes the following future works:1. The HCM~3 need furthering improvement. Now the HCM~3 is at its primary stage and not very mature because we set many assumptions when building the system and utilize a simple case to do experiments. The further work will polish the model with considering complexity and dynamic of application environment. In addition to qualitative analyses, we will process many thorough quantitative analyses on many profiles of the model to get more impersonal evaluation.2. Research on the target-oriented task decomposition and migrating instance execution scheme will be undertaken. In this thesis, the flow decomposition and instance dispatching scheme is only a direct and simple division of business flow, which is based on definite definition and complete knowledge about the business flow. The adaptability is embodied by binding the implementation details to migrating instance at runtime. The further work will study the target-oriented workflow scheme, which can lessen the dependency on the deviser’s knowledge and awareness about the workflow, hence improve the usability.

  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2010年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络