节点文献

基于事件的分布式系统监控

Event-based Monitoring and Management of the Distributed System

【作者】 司徒放

【导师】 曹健;

【作者基本信息】 上海交通大学 , 计算机应用, 2010, 硕士

【摘要】 随着分布式系统趋于复杂,运行时的监控在提高系统性能和可靠性上发挥了越来越重要的作用。本文主要提出一个结合监控探针平台与复杂事件处理技术的新方法,可以完成运行时的分布式系统监控,降低监控组件的开发和使用难度,提高监控管理的效率。监控探针平台运行于受监控的资源之上,提供对JMX组件的通用管理接口。监控组件被封装为JMX探针,从而探针平台可以对探针进行运行时的部署、元数据生成、管理和检索,统一了探针的信息查询和操作调用方式,且与现有JMX产品兼容。探针采用事件方式汇报监控信息,为了提高事件在网络中的传输效率与可靠性,在传输事件之前,会经过扩展的事件过滤,之后再封装为消息发往监控服务器。为了能迅速应对大量的探针监控事件并分析事件间的时序与关联关系,监控服务器使用了基于复杂事件处理的监控规则,将监控事件交由复杂事件引擎进行实时处理。监控规则使用类似SQL的语法描述复杂事件,对输入的基本监控事件进行过滤、关联和聚集等操作,抽象出更高层的管理事件。管理事件一旦被判定发生,对应的管理决策动作会被触发,通过操控各个监控探针操作,实现运行时的分布式系统自动配置与管理。上述的分布式系统监控与复杂事件处理技术已经用于仿真计算平台。根据该项目的实际需求与实践经验,本文以仿真作业分发调度、作业运行监控、系统性能评估以及节点信息统计等为例展示了监控系统的事件定义、规则配置、响应动作绑定和决策调度等功能。

【Abstract】 With the increasing complexity of distributed systems, run-time monitoring and management have become an essential service for improving the system performance and reliability. This paper proposed a combination of monitoring probes and complex event processing (CEP) technology to achieve a new method for distributed system monitoring, which can perform run-time automated management for distributed systems, improve monitoring and management efficiency.A JMX based monitoring probe platform is employed on top of the managed resources to provide a common management interface, which standardizes runtime deployment, meta-data generation, location and configuration of monitoring components as probes. Since all monitoring components in the system are wrapped in probes and loaded by the probe platform, meta-data structures and operation calls are unified and compatible with the existing JMX products.Efficiency and reliability of monitoring information transmission in the network are also considered. This is achieved by using the event driven mode and expanded event filters. Monitoring events are filtered and encapsulated in messages during the transmission from probes to the management server.The management server adopts the complex event processing technology to analyze the high volume of monitoring events and perform event time- sequence correlation in real-time, which plays an important role in decision support of monitoring service. The management rules abstract higher-level management events, using SQL-like syntax to describe filtering, correlation and aggregation over basic monitoring events. Once the management events are found, the corresponding decisions will be triggered to perform system auto-management, by manipulating probes on the probe platforms.The aforementioned distributed monitoring infrastructure has been used in an actual simulation platform. According to the project demands and practical experience, at the end of this paper, several cases are introduced to demonstrate management process such as event definition, rule configuration, event action binding and decision execution.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络