节点文献

面向网络安全监控的流数据处理技术研究

Research on Key Technologies in Analysis Processing of Network Security Monitoring Data Stream

【作者】 甘亮

【导师】 贾焰;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2011, 博士

【摘要】 随着信息技术的不断发展,互联网在人们生活中扮演着越来越重要的角色。而随之而来的各种网络安全事件严重成胁着互联网的应用和发展。因此,以安全为目的的网络监控在维护网络正常高效运行、保障关键设施和确保信息系统安全等方面起着越来越重要的作用。如何对实时、海量的网络安全监控数据实施高效的在线分析,进而为各种应用提供进一步支持,成为网络安全和数据分析处理领域的一个研究重点。本文以网络安全监控为背景,针对网络安全监控的应用特点和实时监控数据分析处理存在的挑战性问题,从动态实时的数据流处理角度,研究了四类网络安全监控流数据的高效查询技术,分别为连续top.k监测、表连接优化、多查询优化、大时间窗口查询。本文的主要贡献为:1、改进了网络安全监控流中重要事件监测所使用的数据流连续top.k查询方法。建立有效的索引结构可以提高查询效率,而已有的数据流top.k索引一般基于网格索引,这些索引结构中存在大量自由数据点f可证明非top.k结果1。针对这一问题,本文提出了基于逆支配点集(reverse dominant point set,RI)PS)的top.k索引方法A.MC’R索引,该索引中通过逆支配点集性质_亨9枝了大量自由数据点,同时给出数据流中数据点加入和删除时肛MC!R的更新算法。理论分析和真实数据集上的实验证明肛McR索引在存储容量、查询效率等方面存在着优势。2、针对网络安全监控数据流分析中重要的表连接操作,研究了监控流与超大维表的表连接优化方法。IP地址维的数据量达到2”条记录,表连接过程中此类超大维表无法将整个表常驻内存,需划分为多个子块分块循环读入内存,造成磁盘I/频繁。基于这一问题,本文提出将超大维表按列分解压缩并常驻内存的多动态索引嵌套循环连接算法fMDI.NL.Join)。超大维表按其m个属性列划分为m个子维表并压缩,同时建立子维表索引;表连接时按查询语句中的投影所确定的属性列来动态调度对应的子维表索引,以确保使用冗余最小的子维表并利用索引提高连接探测效率。理论分析和实验表明该连接算法特别适合于维表连接键存在较大冗余的超大维表f如IP维表、银行账户维表等1。由于维表压缩为较小子表且压缩是无损的,表连接时每个嵌套循环过程探测扫描维表开销减少,提高了表连接效率并降低了存储维表的内存开销。3、对网络安全监控数据流的多个并发查询进行优化。数据流多查询优化可使用优化多查询计划和物化共享中间结果两种方法。现有的物化共享中间结果研究中使用物化视图或索引方式,且未对结果压缩存储。针对这一问题,本文提出了采用流数据方形式存储物化共享中间结果的方法,以及流数据方的压缩存储结构——压缩流数据方fcompressed Stream(:ube)。该方法以经典数据方树形压缩存储结构Qc.tree和Dw”f为基础,为进一步减少单位时间流数据方切片占用的内存空间,根据多查询需求将压缩存储结构中的完全物化所有中间结果改为部分物化;并建立结点物化收益模型,采用动态选择的方法按每个查询频率选择非物化结点。压缩流数据方以经典的压缩存储结构为基础,动态选择方法_亨9枝了大量非查询需求的物化结点,该方法虽然不可避免的_亨9枝了部分查询需求的物化结点,使查询性能降低,但在有限内存空间中较好的压缩了流数据方。以streamC)c_rree为例,通过理论分析和实验,在网络安全数据流中单位时间处理少量查询情况下可_亨9枝大量结点,以少量的降低查询响应速度为代价,有效地压缩了流数据方存储空间。4、针对网络安全监控流的大时间窗口查询问题进行了研究。由于计算和存储资源限制,数据流系统的查询时间窗口w限定在当前一段时间范围内,超出窗口w的大时间窗口查询无法计算。针对这一问题,本文通过综合数据流系统的实时计算及传统数据库系统的海量存储优势提出了增量式混合存储体系结构。数据流系统计算时采用“分治法”将时间窗口等分为多个不重叠的小时间块,只计算小时间块而获得整个时间窗口的结果,从而提高数据流系统计算效率。在传统数据库中采用物化视图存储窗口w外查询结果,并增量式接收小时问块结果来更新物化视图,避免了传统数据库中低效更新计算,提高了传统数据库的物化视图更新效率。该系统计算范围限于可增量计算的聚集查询,可适用于联机在线分析领域。综上所述,本文基于网络安全监控流数据处理应用中亟待解决的数据分析处理与查询问题,就数据流连续top.七监测、表连接、多查询优化、大时间窗口查询等几个重要的基础问题的关键算法进行了突破性研究,对于促进大规模网络安全监控数据分析处理的理论研究和实用化具有一定的理论意义和应用价值。

【Abstract】 The continuous development of informationization makes the Internet increasinglysignificant in our lives; while the emergence of various security events during theprocess highly threatens the application&development of the network. As a result of it,network security monitoring is essential to network maintenance, key infrastructureprotection and information system security. And one of the most challenging issuesoccurring in network monitoring and data analysis processing is how to process andquery the real-time massive monitoring data in an efficient manner, thus to providesupport for various follow-up applications.Based on the background of network security monitoring (NSM) and faced withthe above challenges, this dissertation focuses on the cost-efficient solutions to severalbasic problems derived from the difficulty in analyzing the dynamic real-time datastream, such as continuous top-k queries, join, multi-query optimization, big timewindow queries. The main contributions are concluded as follows:1. Continuous top-k queries on NSM data streams. An index of data stream canimprove the performance of queries efficiently. However, a grid index is usually usedfor continuous top-k, in which a lot of free data points (proved to be not top-k results)are included. Directing to prune those points, we propose an index structure, k-maxcalculating region (k-MCR) based on the reverse dominant point set (RDPS) and gridindex. We get k-MCR through calculating RDPS within grid index and pruning the unitsapproximately and quickly in grid, putting forward the updating algorithm of adding inand deleting data points in data stream. Analytical and experimental evidences displaythat k-MCR index approach performs better on both storage of index and efficiency ofqueries.2. Join algorithm for huge dimension tables on NSM data streams. We proposeMDI-NL join algorithm to optimize join between huge dimension tables (such asIPaddress tables, containing 232 tuples) and NSM data streams, which reduces theconsumption of CPU power and memory capacity. Generally speaking, a hugedimension table should be partitioned into small sub-tables by row and each sub-table isloaded into memory in turn, resulting in frequent disk I/O. However, Compressing hugedimension tables into small in-memory tables will improve the efficiency. So, we findthat the join key of a dimension table is so large and redundant for join, which can beused to compress dimension tables losslessly. We divide each dimension table into nsub-dimension tables by column, then compress those sub-dimension tables’s join keyand build an index for each sub-dimension table. In join operation, MDI-NL selectsindices of sub-dimensional tables dynamically according to the projection of query,which makes sure that each choosed sub-dimensional tables has the least join key rebundancy. Theoretical analysis and experimental evidences show that MDI-NL ismore adaptable for join with the huge dimension table in which there are largeredundant join keys. Since dimensional tables are compressed to relatively small andlossless sub-dimensional tables, the cost of scanning in each nest-loop and storage spaceis decreased, and therefore performance of table join is improved.3. Multi-query optimization on NSM data streams. Two types of methods can beadopted for multi-query optimization in data stream: optimization of query plan, andmaterialized intermediate results. It is common used of materialized views or indices inthe existing approaches of materialized intermediate results, which has no compressionon intermediate results. To solve such a problem, we propose compressed stream cubeas the storage structure of materialized intermediate results based on the compressedcube (QC-tree and Dwarf). For further reducing the space, part of the nodes arematerialized in the compressed model, different from the traditional method that all thenodes are materialized in a compressed model. In our model, the compressed nodes aredivided into basic nodes and additive nodes; in which the former ones are those thatmeet the validity of queries, and the latter ones are used to increase the speed inresponse of a specific query. This approach adopts an effective pruning technique tominimize the number of elements in the limited memory space. By this method, lots ofunused nodes in a query are pruned; while some useful materialized nodes are alsoinevitably pruned at the same time, leading to the comparatively decrease of efficiency.Theoretical analysis and experimental evidences (StreamQCtree as an example) indicatethat our approach can discard a great number of nodes under the circumstance ofprocessing fewer queries in a certain amount of time, which can effectively reduce thestorage of stream cube via a little efficiency loss in query.4. Big time window queries on NSM data streams. In the context of limited CPUpower and storage, DSMS (Data Stream Management System) just can deal withqueries within current time window W, while big time window queries which arebeyond the time window W can not be handled. Aiming at this problem, we propose anincremental hybrid storage architecture that combines real-time DSMS and mass storageDBMS (DataBase Management System). To improve the efficiency of DSMS, timedimension is "divide-and-conquer" into small non-overlapping time window blocks, andwe only handle data within the same small block and combine those small blocksqueries into a big time window query. Furthermore, materialized views are used to storethe query results beyond window W in DBMS, and updated by incrementally loadingthe results in expired small blocks from DSMS, free from the inefficient updating,which improves the efficiency in updating and maintenance of materialized views. Itshould be mentioned that this system only can do with aggregation operation in OLAP.This dissertation addresses the problem needed to be solved urgently incost-efficient analysis processing and queries for NSM data streams. A lot of work has been done on the continuous top-k, join, multi-query and big time window query,bringing a breakthrough to the field. This is a promotion of large-scale networkmonitoring data processing on both theoretical study and practical applications.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络