节点文献

生物信息学网格环境下任务调度关键技术研究

Research on Task Scheduling Policy in Bioinformatics Grid Environments

【作者】 姜文超

【导师】 周艳红;

【作者基本信息】 华中科技大学 , 计算机应用技术, 2009, 博士

【摘要】 随着后基因组时代的到来,爆炸式增长的生物数据对计算资源的性能提出了严峻的挑战,作为应对挑战的生力军,网格技术得到了空前的重视,专门用来处理生物数据的生物信息学网格也随之诞生。除传统网格所面临的技术挑战外,由于生物数据所特有的数据量大、彼此间不相关或弱相关、任务粒度大以及需要多方协作等特性,生物信息学网格对资源管理、任务调度、负载均衡等技术和方法提出了特殊需求,需要根据生物信息学应用的特点对其进行必要的改进,使底层资源与高层应用有机结合,从而有效提高资源利用率和任务执行效率,简化生物学研究人员使用网格平台的复杂程度,使生物信息学网格作为生物学研究的重要基础设施最大限度地发挥其服务潜力。在对网格资源管理模式详细分析的基础上,提出双层资源定义机制,综合考察系统底层物理资源特性和高层应用的逻辑关联,使网格平台在进行任务调度、负载均衡以及服务流的动态组织等关键操作时能够兼顾到物理和应用两方面的特征,做到服务与任务的最佳匹配,有效避免了纵向资源定义机制可能带来的拓扑失配问题。基于双层横向资源定义的思想,分别给出了适合生物信息学网格任务调度、负载均衡以及服务流动态组合调度的新策略。复杂生物学应用通常由多个子任务根据特定应用逻辑共同协作完成,基于相关服务组合优化的思想,给出了基于逻辑组合划分的两级服务调度策略SP2SP。根据复杂应用各子任务之间的逻辑关系确定符合其需求的服务集并定义为服务的逻辑分组,首先实现复杂应用和服务逻辑分组之间的一级优化匹配,进而在服务的逻辑分组内部,实现基于QoS和加权队列的二级匹配。SP2SP有效降低了调度器与信息服务的交互次数,实现了资源预留,同时兼顾到任务的优先级,提高了网格任务的执行效率,保证了多任务之间对资源竞争使用的公平性。网格负载均衡是保证网格系统整体性能不可或缺的功能模块。针对生物信息学网格负载均衡过程中,任务的动态迁移可能引起大数据迁移现象,提出基于最小代价最大流信道M~2C的负载均衡策略M~2ON。M~2ON通过语义覆盖网搜索计算性能符合需求的网格节点,通过M~2C考察源节点与可能的目标节点之间的通信状态,最后通过双线性插值函数DLI将其融合成综合影响因子IIF作为最终目标节点的选择依据。M~2ON避免了传统单覆盖网模型可能引起的拓扑失配问题,降低了任务或数据传输开销在整个任务完成时间中的比例,从而提高了网格任务的执行效率。为了降低使用网格平台的复杂程度,互相协作的多个网格服务可以根据特定的应用逻辑自动组织成特定的服务流,在服务流程确定后,由于任务粒度较大且不均匀,可能引起资源负载不均衡,进而影响资源总体利用率。针对生物数据之间不相关或弱相关特性,给出了基于任务粒度分解的多级流水线服务调度策略MP-GridWF;结合副本创建机制,进而给出基于多级流水和多粒度副本创建的服务调度策略MP&MR-GridWF。MP-GridWF与MP&MR-GridWF相继提高了需要多个服务串行协作的网格任务的执行效率。结合上述研究内容和方法,基于中国教育科研网格公共支撑平台CGSP,作为国家科技基础条件平台NPPC的一部分,搭建了生物信息学网格子平台H-BioGrid。实际应用测试充分表明上述研究方法可以有效提高网格资源的利用率和任务执行效率,降低了网格任务平均完成时间。H-BioGrid可以集成任何意欲加入平台的软、硬资源,已经部署并公开发布了实验室开发的多个生物信息学应用软件和数据库,为国内外生物信息学研究提供必要的支持。

【Abstract】 Bioinformatics which is absorbed in creating and developing advanced computational techniques to manage and extract useful information from DNA/RNA/protein sequence is fast emerging as important discipline for life science research. The bioinformatics applications are extremely computationally or data intensive, providing motivation for using Grid technology. However, some functional modules of grid such as resource management, scheduling method, and load balancing policy etc. must be adjusted to accommodate to the bioinformatics applications that are computationally/data intensive, data irrelevant, great task granularity, and cooperative.A novel bi-partite model for resource management is proposed based on the detailed analysis on grid resource management mode. This bi-partite model can let grid users observe grid resources from both low level physical characteristics and high level application characteristics. Based on the bi-partite model, a novel grid service scheduling policy, a load balancing method based on min-cost & max-flow channel, and a dynamic combinatorial optimization method for grid service are separately presented to enhance the global performance of grid platform.Complicated bioinformatics applications usually include multiple sub-tasks which need to interact with each other to coordinately accomplish the whole tasks. A set of optimal services taking account of certain performance constrains are invoked in order to satisfy the complicated tasks. Thus, coordinating the optimal invoking of such services is important to increase responsiveness and to ensure optimal application execution and system usage in general. we present a method called SP2SP, 2-level grid Service Scheduling Policy based on Logical Subnet Partitioning, which tackles the service scheduling problem in Bio-Grid environments in three steps:1) a similarity based logical subnet partitioning algorithm which classifies individual services into different subsets according to similarity constrains that are based on performance metrics; 2) the employment of a requirement based prediction algorithm that maps the bioinformatics applications that are composed of multiple sub-tasks into optimal subnet; and 3) multi-priority queue based service scheduling algorithm used inside individual subnet taking charge of allocating each sub-task to an optimal physical service within the subnet. Based on the sub-grid platform of NPPC, comprehensive experiments are performed in order to evaluate proposed SP2SP mechanism. Results have shown that SP2SP outperforms other scheduling algorithms. In particular, SP2SP performs best for scenarios where a group of tasks has similar resource requirements or need to cooperate with each other to obtain better performance as a whole.To realize load balance among all grid nodes, a bipartite model for load balancing (LB) in grid computing environments, called Transverse viewpoint based Bi-Tier model (TBT), is proposed. TBT can efficiently eliminate topology mismatching between overlay-and physical-networks during the load transfer process. As an implementation of TBT, a novel LB policy called M~2ON (Min-cost and Max-flow Channel based Overlay Network) is presented. In M~2ON, the communication capability is denoted as M~2C (Min-cost and Max-flow Channel) which is obtained using a Labeled Tree Probing (LTP) method. The computing capacity is denoted as the Idle Factor (IF) which is obtained from the semantic overlay. The higher- and lower-level characteristics are combined into an Integrated Impacting Factor (ⅡF) using a Double Linear Inserting (DLI) function. Based onⅡF, optimal topology matching can be achieved in the LB process. Extensive experiments and simulations have been performed and will be discussed. The results show that M ON achieves more accurate topology matching with a minimum increment in the overall locating time yet achieving higher system performance as a whole.Based on the theory and research production mentioned above, a bioinformatics grid platform called H-BioGrid is designed and constructed.This platform can integrated any hardware, software, and data resources which come forward to join to this platform. Some bioinfromatics and database developed in our lab are already deployed into H-BioGrid and provide free access to the global bioinformatics researchers.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络